Leaner, More Efficient Storage Infrastructure for the AI Era

The AI era demands a simple infrastructure strategy that prioritizes scalability, performance and cost efficiency in managing AI data pipelines. A key challenge is supporting large language model (LLM) training, which requires massive data, compute and storage resources. Efficient training relies on the continuous feeding of large data sets and the storage of model parameters, intermediate results and checkpoints. Above all, the infrastructure strategy must ensure that the AI resources are scalable, reliable and cost-efficient. Scaling AI Training Infrastructure As models grow, so do the demands on high-performance block storage system with multiattach capabilities. The block Continue reading

How we found a bug in Go’s arm64 compiler

Every second, 84 million HTTP requests are hitting Cloudflare across our fleet of data centers in 330 cities. It means that even the rarest of bugs can show up frequently. In fact, it was our scale that recently led us to discover a bug in Go's arm64 compiler which causes a race condition in the generated code.

This post breaks down how we first encountered the bug, investigated it, and ultimately drove to the root cause.

Investigating a strange panic

We run a service in our network which configures the kernel to handle traffic for some products like Magic Transit and Magic WAN. Our monitoring watches this closely, and it started to observe very sporadic panics on arm64 machines.

We first saw one with a fatal error stating that traceback did not unwind completely. That error suggests that invariants were violated when traversing the stack, likely because of stack corruption. After a brief investigation we decided that it was probably rare stack memory corruption. This was a largely idle control plane service where unplanned restarts have negligible impact, and so we felt that following up was not a priority unless it kept happening.

And then it kept happening. 

Coredumps Continue reading

Ultra Ethernet: Fabric Object – What it is and How it is created

Fabric Object


Fabric Object Overview

In libfabric, a fabric represents a logical network domain, a group of hardware and software resources that can communicate with each other through a shared network. All network ports that can exchange traffic belong to the same fabric domain. In practice, a fabric corresponds to one interconnected network, such as an Ethernet or Ultra Ethernet Transport (UET) fabric.

A good way to think about a fabric is to compare it to a Virtual Data Center (VDC) in a cloud environment. Just as a VDC groups together compute, storage, and networking resources into an isolated logical unit, a libfabric fabric groups together network interfaces, addresses, and transport resources that belong to the same communication context. Multiple fabrics can exist on the same system, just like multiple VDCs can operate independently within one cloud infrastructure.

The fabric object acts as the top-level context for all communication. Before an application can create domains, endpoints, or memory regions, it must first open a fabric using the fi_fabric() call. This creates the foundation for all other libfabric objects.

Each fabric is associated with a specific provider,  for example, libfabric-uet, which defines how the fabric interacts with the underlying hardware and Continue reading

Changes in ipSpace.net RSS Feeds

TL&DR: You shouldn’t see any immediate impact of this change, but I’ll eventually clean up old stuff, so you might want to check the URLs if you use RSS/Atom feeds to get the list of ipSpace.net blog posts or podcast episodes. The (hopefully) final URLs are listed on this page.

Executive Summary: I cleaned up the whole ipSpace.net RSS/Atom feeds system. The script that generated the content for various feeds has been replaced with static Hugo-generated RSS/Atom feeds. I added redirects for all the old stuff I could find (including ioshints.blogspot.com), but I could have missed something. The only defunct feed is the free content feed (which hasn’t changed in a while, anyway), as it required scanning the documents database. You can use this page to find the (ever-increasing) free content.

And now for the real story ;)

Ultra Etherent: Discovery

Updated 8-Ocotber 2025

Creating the fi_info Structure

Before the application can discover what communication services are available, it first needs a way to describe what it is looking for. This description is built using a structure called fi_info. The fi_info structure acts like a container that holds the application’s initial requirements, such as desired endpoint type or capabilities.

The first step is to reserve memory for this structure in the system’s main memory. The fi_allocinfo() helper function does this for the application. When called, fi_allocinfo() allocates space for a new fi_info structure, which this book refers to as the pre-fi_info, that will later be passed to the libfabric core for matching against available providers.

At this stage, most of the fields inside the pre-fi_info structure are left at their default values. The application typically sets only the most relevant parameters that express what it needs, such as the desired endpoint type or provider name, and leaves the rest for the provider to fill in later.

In addition to the main fi_info structure, the helper function also allocates memory for a set of sub-structures. These describe different parts of the communication stack, including the fabric, domain, and endpoints. The fixed sub-structures Continue reading

Working for a Vendor with David Gee

When I first met David Gee, he worked for a large system integrator. A few years later, he moved to a networking vendor, worked for a few of them, then for a software vendor, and finally decided to start his own system integration business.

Obviously, I wanted to know what drove him to make those changes, what lessons he learned working in various parts of the networking industry, and what (looking back with perfect hindsight) he would have changed.

Spaghetti Pasta Networking

Here’s an interesting data point in case you ever wondered why things are getting slower, even though the CPU performance is supposedly increasing. Albert Siersema sent me a link to a confusing implementation of spaghetti networking.

It looks like they’re trying to solve the how do I connect two containers (network namespaces) without having the privilege to create a vEth pair challenge with plenty of chewing gum and duct tape tap interfaces 🤦‍♂️

SysLinuxOS: The Go-To Linux for System Administrators

Today’s Linux distributions are plentiful and run the gamut of purposes. There are Linux distributions for those who are new to the open source OS, for gaming, developing, content creation, multimedia, containers, Internet of Things (IoT), edge, routers, firewalls, refrigerators … the list goes on and on. And, of course, there are Linux distributions that are purpose-built for those in IT, such as its site, “SysLinuxOS was built to work right out of the box, with all networking tools already installed by default. There is no need to install anything; it is a Swiss army knife to always carry with us. There are all the major Virtual Private Networks (VPN), several remote control clients, various browsers, as well as Wine, Wireshark, Etherape, Ettercap, PackETH, Packetsender, Putty, Nmap, Packet Tracer 8.2.2, Virtualbox 7.2, Munin, Zabbix-agent2, Icinga, Monit, Nagios4, and tools for serial console and the latest stable liquorix kernel.” At first blush, SysLinuxOS seems to be similar to Tails, only instead of it being targeted at pentesters, it’s more for administrators who need Continue reading

How to Take Packet Captures in ContainerLab/Netlab?

How to Take Packet Captures in ContainerLab/Netlab?

If you follow my blog, you probably know that I’m a big advocate for using Containerlab and Netlab to spin up network labs. I’ve already covered both tools in detail, so I won’t go over the basics again here. You can check the links below if you’re new to them or want a quick refresher. In this post, we’ll look at how to take packet captures in Containerlab labs. So, let’s get started.

Containerlab - Creating Network Labs Can’t be Any Easier
What if I tell you that all you need is just a YAML file with just a bunch of lines to create a Network Lab that can run easily on your laptop? I’ll walk you through what Containerlab is
How to Take Packet Captures in ContainerLab/Netlab?
Netlab - The Fastest Way to Build Network Labs
Netlab then takes care of creating the topology, assigning IP addresses, configuring routing protocols, and even pushing custom configs.
How to Take Packet Captures in ContainerLab/Netlab?

How do I run Containerlab?

I know everyone has their own way of running Containerlab, so I thought I’d share how I set up and run my labs. My daily driver is a MacBook, but I run Containerlab on a server that’s set up as Continue reading

SUSE and Tigera: Empowering Secure, Scalable Kubernetes with Calico Enterprise

Modern Workloads Demand Modern Kubernetes Infrastructure

As organizations expand Kubernetes adoption—modernizing legacy applications on VMs and bare metal, running next-generation AI workloads, and deploying intelligence at the edge—the demand for infrastructure that is scalable, flexible, resilient, secure, and performant has never been greater. At the same time, compliance, consistent visibility, and efficient management without overburdening teams remain critical.

The combination of Calico Enterprise from Tigera and SUSE Rancher Prime delivers a resilient and scalable platform that combines high-performance networking, robust network security, and operational simplicity in one stack.

Comprehensive Security Without Compromise

Calico Enterprise provides a unified platform for Kubernetes networking, security, and observability:

  • eBPF-powered networking for high performance without sidecar overhead
  • One platform for all Kubernetes traffic: ingress, egress, in-cluster, and multi-cluster
  • Security for every workload type: containers, VMs, and bare metal
  • Seamless scaling with built-in multi-cluster networking and security
  • Zero-trust security with identity-aware policies and workload-based microsegmentation
  • Integrated observability for policy enforcement and troubleshooting
  • Compliance features that simplify audits (PCI-DSS, HIPAA, SOC 2, FedRAMP)

Deployed with Rancher Prime, these capabilities extend directly into every cluster, enabling security-conscious industries such as finance, healthcare, and government to confidently run Kubernetes for any use case—from application modernization to AI and edge Continue reading

How to Connect Nested KubeVirt Clusters with Calico and BGP Peering

Running Kubernetes inside Kubernetes isn’t just a fun experiment anymore – it’s becoming a key pattern for delivering multi-environment platforms at scale. With KubeVirt, a virtualization add-on for Kubernetes that uses QEMU (an open-source machine emulator and virtualizer), you can run full-featured Kubernetes clusters as virtual machines (VMs) inside a parent Kubernetes cluster. This nested architecture makes it possible to unify containerized and virtualized workloads, and opens the door to new platform engineering use cases.

But here’s the challenge: how can you ensure that these nested clusters, and the workloads within, can reach, and be reached by, your physical network and are treated the same way as any other cluster?

That’s where Calico’s Advanced BGP (Border Gateway Protocol) peering with workloads comes into play. By enabling BGP route exchange between the parent cluster and nested KubeVirt VMs, Calico extends dynamic routing directly to virtualized workloads. This allows nested clusters to participate in the broader network topology and advertise their pod and service IPs just like any other node. Thus eliminating the need for tunnels or overlays to achieve true layer 3 connectivity.

In this blog, we’ll walk through the big picture, prerequisites, and step-by-step configuration for setting up BGP Continue reading

Using BIRD BGP Daemon as a BGP Route Reflector

In this challenge lab, you’ll configure a BIRD daemon running in a container as a BGP route reflector in a transit autonomous system. You should be familiar with the configuration concepts if you completed the IBGP lab exercises, but will probably struggle with BIRD configuration if you’re not familiar with it.

Click here to start the lab in your browser using GitHub Codespaces (or set up your own lab infrastructure). After starting the lab environment, change the directory to challenge/01-bird-rr, build the BIRD container with netlab clab build bird if needed, and execute netlab up.

LIU001: Growing Pains

Starting any new endeavor is hard. That’s particularly true for a career in tech. And that’s the reason Alexis Bertholf and Kevin Nanns are launching the Life In Uptime podcast. In each episode they’ll sit down with engineers, leaders, and builders in tech to uncover the stories behind their careers to help you see how... Read more »

netlab: Applying Simple Configuration Changes

For years, netlab has had custom configuration templates that can be used to deploy custom configurations onto lab devices. The custom configuration templates can be Jinja2 templates, and you can create different templates (for the same functionality) for different platforms. However, using that functionality if you need an extra command or two makes approximately as much sense as using a Kubernetes cluster to deploy a BusyBox container.

netlab release 25.09 solves that problem with the files plugin and the inline config functionality.