OSPF Router ID and Loopback Interface Myths

Daniel Dib wrote a nice article describing the history of the loopback interface1, triggering an inevitable mention of the role of a loopback interface in OSPF and related flood of ancient memories on my end.

Before going into the details, let’s get one fact straight: an OSPF router ID was always (at least from the days of OSPFv1 described in RFC 1133) just a 32-bit identifier, not an IPv4 address2. Straight from the RFC 1133:

Why Modern IPv6 Failed This Massive Kubernetes Networking Test

PARIS —When I worked for NASA in the 1980s, I helped build a Near Space Network tracking program using Datatrieve on VAX/VMS for the backend. When completed, it manually tracked just over a thousand static network links. That’s nothing — nothing — compared to what Starlink. This is not easy, as OpenInfra Summit Europe 2025. The problem they face is that while the mega-constellations of Low Earth Orbit (LEO) and Medium Earth Orbit (MEO) are revolutionizing telecom, traditional network routing protocols such as Open Shortest Path First (OSPF) and Border Gateway Protocol (BGP) struggle with their dynamic topologies — not to mention the next-generation Internet protocol, IPv6. The Challenge of Emulating Dynamic Satellite Networks So, the goal is to emulate large-scale, satellite mesh networks where the nodes are constantly moving and falling in and out of contact as they orbit the Earth and the world revolves underneath them. Deutsche Continue reading

NB548: Broadcom Brings Chips to Wi-Fi 8 Party; Attorneys General Scrutinize HPE/Juniper Settlement

Take a Network Break! On today’s coverage, F5 releases an emergency security update after state-backed threat actors breach internal systems, and North Korean attackers use the blockchain to host and hide malware. Broadcom is shipping an 800G NIC aimed at AI workloads, and Broadcom joins the Wi-Fi 8 party early with a sampling of pre-standard... Read more »

AI / ML network performance metrics at scale

The charts above show information from a GPU cluster running an AI / ML training workload. The 244 nodes in the cluster are connected by 100G links to a single large switch. Industry standard sFlow telemetry from the switch is shown in the two trend charts generated by the sFlow-RT real-time analytics engine. The charts are updated every 100mS.
  • Per Link Telemetry shows RoCEv2 traffic on 5 randomly selected links from the cluster. Each trend is computed based on sFlow random packet samples collected on the link. The packet header in each sample is decoded and the metric is computed for packets identified as RoCEv2.
  • Combined Fabric-Wide Telemetry combines the signals from all the links to create a fabric wide metric. The signals are highly correlated since the AI training compute / exchange cycle is synchronized across all compute nodes in the cluster. Constructive interference from combining data from all the links removes the noise in each individual signal and clearly shows the traffic pattern for the cluster.
This is a relatively small cluster. For larger clusters, the effect is even more pronounced, resulting in extremely sharp cluster-wide metrics. The sFlow instrumentation embedded as a standard feature of data center Continue reading

Using Git Pre-Commit Hooks

A while ago I wrote an article about linting Markdown files with markdownlint. In that article, I presented the use case of linting the Markdown source files for this site. While manually running linting checks is fine—there are times and situations when this is appropriate and necessary—this is the sort of task that is ideally suited for a Git pre-commit hook. In this post, I’ll discuss Git pre-commit hooks in the context of using them to run linting checks.

Before moving on, a disclaimer: I am not an expert on Git hooks. This post shares my limited experience and provides an example based on what I use for this site. I have no doubt that my current implementation will improve over time as my knowledge and experience grow.

What is a Git Hook?

As this page explains, a hook is a program “you can place in a hooks directory to trigger actions at certain points in git’s execution.” Generally, a hook is a script of some sort. Git supports different hooks that get invoked in response to specific actions in Git; in this particular instance, I’m focusing on the pre-commit hook. This hook gets invoked by git-commit (i.e. Continue reading

netlab: Embed Files in a Lab Topology

Today, I’ll focus on another feature of the new files plugin – you can use it to embed any (hopefully small) file in a lab topology (configlets are just a special case in which the plugin creates the relative file path from the configlets dictionary data).

You could use this functionality to include configuration files for Linux containers, custom reports, or even plugins in the lab topology, and share a complete solution as a single file that can be downloaded from a GitHub repository.

TNO046: Prisma AIRS: Securing the Multi-Cloud and AI Runtime (Sponsored)

Multi-cloud, automation, and AI are changing how modern networks operate and how firewalls and security policies are administered. In today’s sponsored episode with Palo Alto Networks, we dig into offerings such as CLARA (Cloud and AI Risk Assessment) that help ops teams gain more visibility into the structure and workflows of their multi-cloud networks. We... Read more »

Hedge 284: Netops and Corporate Culture

We all know netops, NRE, and devops can increase productivity, increase Mean Time Between Mistakes (MTBM), and decrease MTTR–but how do we deploy and use these tools? We often think of the technical hurdles you face in their deployment, but most of the blockers are actually cultural. Chris Grundemann, Eyvonne, Russ, and Tom discuss the cultural issues with deploying netops on this episode of the Hedge.
 

 
download

Load Balancing Monitor Groups: Multi-Service Health Checks for Resilient Applications

Modern applications are not monoliths. They are complex, distributed systems where availability depends on multiple independent components working in harmony. A web server might be running, but if its connection to the database is down or the authentication service is unresponsive, the application as a whole is unhealthy. Relying on a single health check is like knowing the “check engine” light is not on, but not knowing that one of your tires has a puncture. It’s great your engine is going, but you’re probably not driving far.

As applications grow in complexity, so does the definition of "healthy." We've heard from customers, big and small, that they need to validate multiple services to consider an endpoint ready to receive traffic. For example, they may need to confirm that an underlying API gateway is healthy and that a specific ‘/login’ service is responsive before routing users there. Until now, this required building custom, synthetic services to aggregate these checks, adding operational overhead and another potential point of failure.

Today, we are introducing Monitor Groups for Cloudflare Load Balancing. This feature provides a new way to create sophisticated, multi-service health assessments directly on our platform. With Monitor Groups, you can bundle Continue reading

Lab: Hide Transit Subnets in IS-IS Networks

Sometimes you want to assign IPv4/IPv6 subnets to transit links in your network (for example, to identify interfaces in traceroute outputs), but don’t need to have those subnets in the IP routing tables throughout the whole network. Like OSPF, IS-IS has a nerd knob you can use to exclude transit subnets from the router PDUs.

Want to check how that feature works with your favorite device? Use the Hide Transit Subnets in IS-IS Networks lab exercise.

Click here to start the lab in your browser using GitHub Codespaces (or set up your own lab infrastructure). After starting the lab environment, change the directory to feature/4-hide-transit and execute netlab up.

Ultra Ethernet: Memory Region

Memory Registration and Endpoint Binding in UET with libfabric 

[updated 25-Oct, 2025 - (RIs in the figure)]

In distributed AI workloads, each process requires memory regions that are visible to the fabric for efficient data transfer. The Job framework or application typically allocates these buffers in GPU VRAM to maximize throughput and enable low-latency direct memory access. These buffers store model parameters, gradients, neuron outputs, and temporary workspace, such as intermediate activations or partial gradients during collective operations in forward and backward passes.


Memory Registration and Key Generation

Once memory is allocated, it must be registered with the fabric domain using fi_mr_reg(). Registration informs the NIC that the memory is pinned and accessible for data transfers initiated by endpoints. The fabric library associates the buffer with a Memory Region handle (fid_mr) and internally generates a remote protection key (fi_mr_key), which uniquely identifies the memory region within the Job and domain context.

The local endpoint binds the fid_mr using fi_mr_bind() to define permitted operations, FI_REMOTE_WRITE in figure 4-10. This allows the NIC to access local memory efficiently and perform zero-copy operations.

The application retrieves the memory key using fi_mr_key(fid_mr) and constructs a Resource Index (RI) entry. The RI entry serves as Continue reading

Improving the trustworthiness of Javascript on the Web

The web is the most powerful application platform in existence. As long as you have the right API, you can safely run anything you want in a browser.

Well… anything but cryptography.

It is as true today as it was in 2011 that Javascript cryptography is Considered Harmful. The main problem is code distribution. Consider an end-to-end-encrypted messaging web application. The application generates cryptographic keys in the client’s browser that lets users view and send end-to-end encrypted messages to each other. If the application is compromised, what would stop the malicious actor from simply modifying their Javascript to exfiltrate messages?

It is interesting to note that smartphone apps don’t have this issue. This is because app stores do a lot of heavy lifting to provide security for the app ecosystem. Specifically, they provide integrity, ensuring that apps being delivered are not tampered with, consistency, ensuring all users get the same app, and transparency, ensuring that the record of versions of an app is truthful and publicly visible.

It would be nice if we could get these properties for our end-to-end encrypted web application, and the web as a whole, without requiring a single central authority like Continue reading