NetworkingNexus.net

Notes on Active Queue Management Techniques

“When you can measure what you are speaking about, and express it in numbers, you know something about it” - Lord Kelvin

Ultra Ethernet: Creating Endpoint Object

Endpoint Creation and Operation

[Updated 12-October, 2025: Figure & uet addressing section]

In libfabric and Ultra Ethernet Transport (UET), the endpoint, represented by the object fid_ep, serves as the primary communication interface between a process and the underlying network fabric. Every data exchange, whether it involves message passing, remote memory access (RMA), or atomic operations, ultimately passes through an endpoint. It acts as a software abstraction of the transport hardware, exposing a programmable interface that the application can use to perform high-performance data transfers.

Conceptually, an endpoint resembles a socket in the TCP/IP world. However, while sockets hide much of the underlying network stack behind a simple API, endpoints expose far more detail and control. They allow the process to define which completion queues to use, what capabilities to enable, and how multiple communication contexts are managed concurrently. This design gives applications, especially large distributed training frameworks and HPC workloads, direct control over latency, throughput, and concurrency in ways that traditional sockets cannot provide.

Furthermore, socket-based communication typically relies on the operating system’s networking stack and consumes CPU cycles for data movement and protocol handling. In contrast, endpoint communication paths can interact directly with the NIC, enabling user-space data transfers Continue reading

Netpicker NetBox Plugin and Automation

Netpicker brings together configuration backups, security and compliance checks, and automation in one place. It supports over 150 network vendors, including Cisco, Juniper, Arista, Palo Alto, Fortinet and many more. It also integrates well with other tools like Netbox, Nautobot, and Infrahub for inventory management, and Slurp’it for network discovery. All of these features, including backups, integrations, and automation, are available in the free version. The paid version adds features such as RBAC, approvals, detailed logging, support, and workflows for teams that need more control.

In our previous introductory post, we looked at what Netpicker is, how to set it up, and how it can back up configurations across multiple vendors. In this post, we'll focus on Netpicker Automation and how to use the Netpicker plugin with Netbox, so let's get to it.

Netpicker has partnered with me for this post, and they also support my blog as a sponsor.

Learn more

Prerequisites

This post assumes you already have a functioning Netpicker and NetBox instances. If you're completely new to Netpicker, check out the introductory post first, where we covered the basics like installation and initial setup. For this example, we’re using NetBox Community version 4.3. Continue reading

Worth Reading 101025

A number of recent changes have helped to push Ethernet forward, advancing its capabilities to better meet the needs of AI.

Cloudflare has confessed to a coding error using a React useEffect hook, notorious for being problematic if not handled carefully, that caused an outage for the platform’s dashboard and many of its APIs.

Working with MikroTik and IP Infusion’s OcNOS to interop EVPN/VxLAN has been on my wish list for a long time.

Have you heard about MRT dumps, but never tried to use them because the bar seems too high? Or are you tired of doing “parse -> grep -> process” every time you touch BGP MRT dumps?

Unfortunately, history is again being rewritten. It is rapidly becoming clear that LLMs are not economical.

As part of our research into post-quantum cryptography (PQC) for DNSSEC, we test PQC as a drop-in replacement for classical algorithms. We explore a transition where both run simultaneously, analysing how resolvers validate records, edge cases, and the feasibility and impact of such a period.

In much of the world, we are in an era that I like to call the “post-gigabit era”. Many users have access to gigabit connections—or at least hundreds of Continue reading

Hedge 283: Technical Planning

We network engineers often find ourselves without a viable plan–our plans always seem to go awry, to the point that many network engineers just give up on planning. Is “giving up” the right solution? Or can we learn to be better planners? Jonathan Adams and Tim McConnaughy join Russ to discussion planning for network engineers.

download

IBM Ships Homegrown “Spyre” Accelerators, Embraces Anthropic For AI Push

Big Blue may have missed the boat on being one of the big AI model builders, but its IBM Research division has built its own enterprise-grade family of models and its server and research divisions have plenty of experience building accelerators and supercomputers. …

IBM Ships Homegrown “Spyre” Accelerators, Embraces Anthropic For AI Push was written by Timothy Prickett Morgan at The Next Platform.

TNO045: IP and Optical Integration: Automation Across Layers

LightRiver has software and service products focused on the automation, optimization, and simplification of multi-layer, multi-vendor, and multi-generation networking. Today we have a team from LightRiver lead by Jim Brinksma to help explain how LightRiver is advancing automation in optical and bridging the gap between the IP and optical layers. They discuss the challenges, obstacles... Read more »

HN800: Root Cause Analysis for the Entire Stack (Sponsored)

Today’s show is one of those “We’re living in the future” episodes, where we talk about using AI to perform root cause analysis of a performance issue. But not root cause analysis for just the networking part of the stack. The full stack. Why? Because it’s not good enough to say “it’s not the network”.... Read more »

netlab 25.10: Cisco 8000v, Nicer Graphs

netlab release 25.10 includes:

Support for container version of Cisco 8000v emulator (finally a reasonable IOS-XR platform)
Support for vJunosEVO (vPTX) release 24+ (it needs UEFI BIOS), thanks to Aleksey Popov and Stefano Sasso
Wildcards or regular expressions in group- or as_list members.
Graphing improvements
OSPFv2/v3 on OpenBSD thanks to Remi Locherer
OSPFv2/v3 interface parameters on IOS XR

You’ll find more details in the release notes.

IPB185: When IPv6 VPN and DNS Don’t Cooperate

Sometimes weirdness occurs within DNS if you’re on an IPv4 network and you connect to a dual-stack or v6-only VPN. Maybe the browser doesn’t connect, but you can still send pings, or vice versa. Is the OS getting confused about which stack and which order of interfaces to request services? Is the weird behavior being... Read more »

Introducing REACT: Why We Built an Elite Incident Response Team

Cloudforce One’s mission is to help defend the Internet. In Q2’25 alone, Cloudflare stopped an average of 190 billion cyber threats every single day. But real-world customer experiences showed us that stopping attacks at the edge isn’t always enough. We saw ransomware disrupt financial operations, data breaches cripple real estate firms, and misconfigurations cause major data losses.

In each case, the real damage occurred inside networks.

These internal breaches uncovered another problem: customers had to hand off incidents to separate internal teams for investigation and remediation. Those handoffs created delays and fractured the response. The result was a gap that attackers could exploit. Critical context collected at the edge didn’t reach the teams managing cleanup, and valuable time was lost. Closing this gap has become essential, and we recognized the need to take responsibility for providing customers with a more unified defense.

Today, Cloudforce One is launching a new suite of incident response and security services to help organizations prepare for and respond to breaches.

These services are delivered by Cloudforce One REACT (Respond, Evaluate, Assess, Consult Team), a group of seasoned responders and security veterans who investigate threats, hunt adversaries, and work closely with executive leadership to guide Continue reading

N4N040: Mock Interview for a Network Administration Job

The job interview is an inescapable part of a career journey. Today on N Is For Networking, we conduct a mock interview with a candidate who’s applying for a junior network administration role. The goal is to give Toni Mrowetz, our candidate, feedback. At the same time, we hope this helps anyone listening who might... Read more »

Troubleshooting Multi-Pod EVPN

An engineer reading my multi-pod EVPN article asked an interesting question:

How do you handle troubleshooting when VTEPs cannot reach each other across pods?

The ancient Romans already knew the rough answer: divide and conquer.

In this particular case, the “divide” part starts with a simple realization: VXLAN/EVPN is just another application running on top of IP.

How to Deploy Whisker and Goldmane in Manifest Only Calico Setups

Your Step-by-Step to Observability Without the Operator

If you’re running Calico using manifests, you may have found that enabling the observability features introduced in version 3.30, including Whisker and Goldmane, requires a more hands-on approach. Earlier documentation focused on the Tigera operator, which automates key tasks such as certificate management and secure service configuration. In a manifest-based setup, these responsibilities shift to the user. While the process involves more manual steps, it provides greater transparency and control over each component. With the right guidance, setting up these observability tools is entirely achievable and offers valuable insight into the internal behavior of your Calico deployment.

We’ve heard from many of you in the Calico Slack community: you’re eager to try out Whisker and Goldmane but aren’t sure how to set them up without Helm or the operator. For anyone who’s up for a challenge, this blog post provides a step-by-step guide on how to get everything wired up the hard way.
However, even if you already use the operator, keep reading! We’re going to pull back the curtain on the magic it performs behind the scenes. Understanding these mechanics will help you troubleshoot, customize, and better appreciate a managed Continue reading

Cisco Takes On Broadcom, Nvidia For Fat AI Datacenter Interconnects

Wide area networks and datacenter interconnects, or DCIs, as we have known them for the past decade or so are nowhere beefy enough or fast enough to take on the job of scaling AI training workloads across multiple datacenters. …

Cisco Takes On Broadcom, Nvidia For Fat AI Datacenter Interconnects was written by Timothy Prickett Morgan at The Next Platform.

Leaner, More Efficient Storage Infrastructure for the AI Era

The AI era demands a simple infrastructure strategy that prioritizes scalability, performance and cost efficiency in managing AI data pipelines. A key challenge is supporting large language model (LLM) training, which requires massive data, compute and storage resources. Efficient training relies on the continuous feeding of large data sets and the storage of model parameters, intermediate results and checkpoints. Above all, the infrastructure strategy must ensure that the AI resources are scalable, reliable and cost-efficient. Scaling AI Training Infrastructure As models grow, so do the demands on high-performance block storage system with multiattach capabilities. The block Continue reading

How we found a bug in Go’s arm64 compiler

Every second, 84 million HTTP requests are hitting Cloudflare across our fleet of data centers in 330 cities. It means that even the rarest of bugs can show up frequently. In fact, it was our scale that recently led us to discover a bug in Go's arm64 compiler which causes a race condition in the generated code.

This post breaks down how we first encountered the bug, investigated it, and ultimately drove to the root cause.

Investigating a strange panic

We run a service in our network which configures the kernel to handle traffic for some products like Magic Transit and Magic WAN. Our monitoring watches this closely, and it started to observe very sporadic panics on arm64 machines.

We first saw one with a fatal error stating that traceback did not unwind completely. That error suggests that invariants were violated when traversing the stack, likely because of stack corruption. After a brief investigation we decided that it was probably rare stack memory corruption. This was a largely idle control plane service where unplanned restarts have negligible impact, and so we felt that following up was not a priority unless it kept happening.

And then it kept happening.

Coredumps Continue reading

Ultra Ethernet: Fabric Object – What it is and How it is created

Fabric Object

Fabric Object Overview

In libfabric, a fabric represents a logical network domain, a group of hardware and software resources that can communicate with each other through a shared network. All network ports that can exchange traffic belong to the same fabric domain. In practice, a fabric corresponds to one interconnected network, such as an Ethernet or Ultra Ethernet Transport (UET) fabric.

A good way to think about a fabric is to compare it to a Virtual Data Center (VDC) in a cloud environment. Just as a VDC groups together compute, storage, and networking resources into an isolated logical unit, a libfabric fabric groups together network interfaces, addresses, and transport resources that belong to the same communication context. Multiple fabrics can exist on the same system, just like multiple VDCs can operate independently within one cloud infrastructure.

The fabric object acts as the top-level context for all communication. Before an application can create domains, endpoints, or memory regions, it must first open a fabric using the fi_fabric() call. This creates the foundation for all other libfabric objects.

Each fabric is associated with a specific provider, for example, libfabric-uet, which defines how the fabric interacts with the underlying hardware and Continue reading

D2DO284: AI, MCP, and the Identities that Tie Them All Together

AI is developing at an incredible pace. With that development comes questions. For instance, how do you connect your resources to agents? How do agents connect with each other? And how do you keep it all secure? Our guest Christian Posta is here to guide us through AI, MCP, and the concept of workload identities.... Read more »

Changes in ipSpace.net RSS Feeds

TL&DR: You shouldn’t see any immediate impact of this change, but I’ll eventually clean up old stuff, so you might want to check the URLs if you use RSS/Atom feeds to get the list of ipSpace.net blog posts or podcast episodes. The (hopefully) final URLs are listed on this page.

Executive Summary: I cleaned up the whole ipSpace.net RSS/Atom feeds system. The script that generated the content for various feeds has been replaced with static Hugo-generated RSS/Atom feeds. I added redirects for all the old stuff I could find (including ioshints.blogspot.com), but I could have missed something. The only defunct feed is the free content feed (which hasn’t changed in a while, anyway), as it required scanning the documents database. You can use this page to find the (ever-increasing) free content.

And now for the real story ;)