netlab: Embed Configuration Templates in a Lab Topology File

A few days ago, I described how you can use the new config.inline functionality to apply additional configuration commands to individual devices in a netlab-powered lab.

However, sometimes you have to apply the same set of commands to several devices. Although you could use device groups to do that, netlab release 25.09 offers a much better mechanism: you can embed custom configuration templates in the lab topology file.

Ultra Ethernet: Creating Endpoint Object

Endpoint Creation and Operation

[Updated 12-October, 2025: Figure & uet addressing section]

In libfabric and Ultra Ethernet Transport (UET), the endpoint, represented by the object fid_ep, serves as the primary communication interface between a process and the underlying network fabric. Every data exchange, whether it involves message passing, remote memory access (RMA), or atomic operations, ultimately passes through an endpoint. It acts as a software abstraction of the transport hardware, exposing a programmable interface that the application can use to perform high-performance data transfers.

Conceptually, an endpoint resembles a socket in the TCP/IP world. However, while sockets hide much of the underlying network stack behind a simple API, endpoints expose far more detail and control. They allow the process to define which completion queues to use, what capabilities to enable, and how multiple communication contexts are managed concurrently. This design gives applications, especially large distributed training frameworks and HPC workloads, direct control over latency, throughput, and concurrency in ways that traditional sockets cannot provide.

Furthermore, socket-based communication typically relies on the operating system’s networking stack and consumes CPU cycles for data movement and protocol handling. In contrast, endpoint communication paths can interact directly with the NIC, enabling user-space data transfers Continue reading

Netpicker NetBox Plugin and Automation

Netpicker NetBox Plugin and Automation

Netpicker brings together configuration backups, security and compliance checks, and automation in one place. It supports over 150 network vendors, including Cisco, Juniper, Arista, Palo Alto, Fortinet and many more. It also integrates well with other tools like Netbox, Nautobot, and Infrahub for inventory management, and Slurp’it for network discovery. All of these features, including backups, integrations, and automation, are available in the free version. The paid version adds features such as RBAC, approvals, detailed logging, support, and workflows for teams that need more control.

In our previous introductory post, we looked at what Netpicker is, how to set it up, and how it can back up configurations across multiple vendors. In this post, we'll focus on Netpicker Automation and how to use the Netpicker plugin with Netbox, so let's get to it.

SPONSORED

Netpicker has partnered with me for this post, and they also support my blog as a sponsor.

Learn more

Prerequisites

This post assumes you already have a functioning Netpicker and NetBox instances. If you're completely new to Netpicker, check out the introductory post first, where we covered the basics like installation and initial setup. For this example, we’re using NetBox Community version 4.3. Continue reading

Hedge 283: Technical Planning

We network engineers often find ourselves without a viable plan–our plans always seem to go awry, to the point that many network engineers just give up on planning. Is “giving up” the right solution? Or can we learn to be better planners? Jonathan Adams and Tim McConnaughy join Russ to discussion planning for network engineers.

download

IBM Ships Homegrown “Spyre” Accelerators, Embraces Anthropic For AI Push

Big Blue may have missed the boat on being one of the big AI model builders, but its IBM Research division has built its own enterprise-grade family of models and its server and research divisions have plenty of experience building accelerators and supercomputers.

IBM Ships Homegrown “Spyre” Accelerators, Embraces Anthropic For AI Push was written by Timothy Prickett Morgan at The Next Platform.

netlab 25.10: Cisco 8000v, Nicer Graphs

netlab release 25.10 includes:

You’ll find more details in the release notes.

Introducing REACT: Why We Built an Elite Incident Response Team

Cloudforce One’s mission is to help defend the Internet. In Q2’25 alone, Cloudflare stopped an average of 190 billion cyber threats every single day. But real-world customer experiences showed us that stopping attacks at the edge isn’t always enough. We saw ransomware disrupt financial operations, data breaches cripple real estate firms, and misconfigurations cause major data losses.

In each case, the real damage occurred inside networks.

These internal breaches uncovered another problem: customers had to hand off incidents to separate internal teams for investigation and remediation. Those handoffs created delays and fractured the response. The result was a gap that attackers could exploit. Critical context collected at the edge didn’t reach the teams managing cleanup, and valuable time was lost. Closing this gap has become essential, and we recognized the need to take responsibility for providing customers with a more unified defense.

Today, Cloudforce One is launching a new suite of incident response and security services to help organizations prepare for and respond to breaches.

These services are delivered by Cloudforce One REACT (Respond, Evaluate, Assess, Consult Team), a group of seasoned responders and security veterans who investigate threats, hunt adversaries, and work closely with executive leadership to guide Continue reading

Troubleshooting Multi-Pod EVPN

An engineer reading my multi-pod EVPN article asked an interesting question:

How do you handle troubleshooting when VTEPs cannot reach each other across pods?

The ancient Romans already knew the rough answer: divide and conquer.

In this particular case, the “divide” part starts with a simple realization: VXLAN/EVPN is just another application running on top of IP.

How to Deploy Whisker and Goldmane in Manifest Only Calico Setups

Your Step-by-Step to Observability Without the Operator

If you’re running Calico using manifests, you may have found that enabling the observability features introduced in version 3.30, including Whisker and Goldmane, requires a more hands-on approach. Earlier documentation focused on the Tigera operator, which automates key tasks such as certificate management and secure service configuration. In a manifest-based setup, these responsibilities shift to the user. While the process involves more manual steps, it provides greater transparency and control over each component. With the right guidance, setting up these observability tools is entirely achievable and offers valuable insight into the internal behavior of your Calico deployment.

We’ve heard from many of you in the Calico Slack community: you’re eager to try out Whisker and Goldmane but aren’t sure how to set them up without Helm or the operator. For anyone who’s up for a challenge, this blog post provides a step-by-step guide on how to get everything wired up the hard way.
However, even if you already use the operator, keep reading! We’re going to pull back the curtain on the magic it performs behind the scenes. Understanding these mechanics will help you troubleshoot, customize, and better appreciate a managed Continue reading

Leaner, More Efficient Storage Infrastructure for the AI Era

The AI era demands a simple infrastructure strategy that prioritizes scalability, performance and cost efficiency in managing AI data pipelines. A key challenge is supporting large language model (LLM) training, which requires massive data, compute and storage resources. Efficient training relies on the continuous feeding of large data sets and the storage of model parameters, intermediate results and checkpoints. Above all, the infrastructure strategy must ensure that the AI resources are scalable, reliable and cost-efficient. Scaling AI Training Infrastructure As models grow, so do the demands on high-performance block storage system with multiattach capabilities. The block Continue reading

How we found a bug in Go’s arm64 compiler

Every second, 84 million HTTP requests are hitting Cloudflare across our fleet of data centers in 330 cities. It means that even the rarest of bugs can show up frequently. In fact, it was our scale that recently led us to discover a bug in Go's arm64 compiler which causes a race condition in the generated code.

This post breaks down how we first encountered the bug, investigated it, and ultimately drove to the root cause.

Investigating a strange panic

We run a service in our network which configures the kernel to handle traffic for some products like Magic Transit and Magic WAN. Our monitoring watches this closely, and it started to observe very sporadic panics on arm64 machines.

We first saw one with a fatal error stating that traceback did not unwind completely. That error suggests that invariants were violated when traversing the stack, likely because of stack corruption. After a brief investigation we decided that it was probably rare stack memory corruption. This was a largely idle control plane service where unplanned restarts have negligible impact, and so we felt that following up was not a priority unless it kept happening.

And then it kept happening. 

Coredumps Continue reading

Ultra Ethernet: Fabric Object – What it is and How it is created

Fabric Object


Fabric Object Overview

In libfabric, a fabric represents a logical network domain, a group of hardware and software resources that can communicate with each other through a shared network. All network ports that can exchange traffic belong to the same fabric domain. In practice, a fabric corresponds to one interconnected network, such as an Ethernet or Ultra Ethernet Transport (UET) fabric.

A good way to think about a fabric is to compare it to a Virtual Data Center (VDC) in a cloud environment. Just as a VDC groups together compute, storage, and networking resources into an isolated logical unit, a libfabric fabric groups together network interfaces, addresses, and transport resources that belong to the same communication context. Multiple fabrics can exist on the same system, just like multiple VDCs can operate independently within one cloud infrastructure.

The fabric object acts as the top-level context for all communication. Before an application can create domains, endpoints, or memory regions, it must first open a fabric using the fi_fabric() call. This creates the foundation for all other libfabric objects.

Each fabric is associated with a specific provider,  for example, libfabric-uet, which defines how the fabric interacts with the underlying hardware and Continue reading

Changes in ipSpace.net RSS Feeds

TL&DR: You shouldn’t see any immediate impact of this change, but I’ll eventually clean up old stuff, so you might want to check the URLs if you use RSS/Atom feeds to get the list of ipSpace.net blog posts or podcast episodes. The (hopefully) final URLs are listed on this page.

Executive Summary: I cleaned up the whole ipSpace.net RSS/Atom feeds system. The script that generated the content for various feeds has been replaced with static Hugo-generated RSS/Atom feeds. I added redirects for all the old stuff I could find (including ioshints.blogspot.com), but I could have missed something. The only defunct feed is the free content feed (which hasn’t changed in a while, anyway), as it required scanning the documents database. You can use this page to find the (ever-increasing) free content.

And now for the real story ;)

Ultra Etherent: Discovery

Updated 8-Ocotber 2025

Creating the fi_info Structure

Before the application can discover what communication services are available, it first needs a way to describe what it is looking for. This description is built using a structure called fi_info. The fi_info structure acts like a container that holds the application’s initial requirements, such as desired endpoint type or capabilities.

The first step is to reserve memory for this structure in the system’s main memory. The fi_allocinfo() helper function does this for the application. When called, fi_allocinfo() allocates space for a new fi_info structure, which this book refers to as the pre-fi_info, that will later be passed to the libfabric core for matching against available providers.

At this stage, most of the fields inside the pre-fi_info structure are left at their default values. The application typically sets only the most relevant parameters that express what it needs, such as the desired endpoint type or provider name, and leaves the rest for the provider to fill in later.

In addition to the main fi_info structure, the helper function also allocates memory for a set of sub-structures. These describe different parts of the communication stack, including the fabric, domain, and endpoints. The fixed sub-structures Continue reading

Working for a Vendor with David Gee

When I first met David Gee, he worked for a large system integrator. A few years later, he moved to a networking vendor, worked for a few of them, then for a software vendor, and finally decided to start his own system integration business.

Obviously, I wanted to know what drove him to make those changes, what lessons he learned working in various parts of the networking industry, and what (looking back with perfect hindsight) he would have changed.