AI / ML network performance metrics at scale

The charts above show information from a GPU cluster running an AI / ML training workload. The 244 nodes in the cluster are connected by 100G links to a single large switch. Industry standard sFlow telemetry from the switch is shown in the two trend charts generated by the sFlow-RT real-time analytics engine. The charts are updated every 100mS.
  • Per Link Telemetry shows RoCEv2 traffic on 5 randomly selected links from the cluster. Each trend is computed based on sFlow random packet samples collected on the link. The packet header in each sample is decoded and the metric is computed for packets identified as RoCEv2.
  • Combined Fabric-Wide Telemetry combines the signals from all the links to create a fabric wide metric. The signals are highly correlated since the AI training compute / exchange cycle is synchronized across all compute nodes in the cluster. Constructive interference from combining data from all the links removes the noise in each individual signal and clearly shows the traffic pattern for the cluster.
This is a relatively small cluster. For larger clusters, the effect is even more pronounced, resulting in extremely sharp cluster-wide metrics. The sFlow instrumentation embedded as a standard feature of data center Continue reading

Hedge 284: Netops and Corporate Culture

We all know netops, NRE, and devops can increase productivity, increase Mean Time Between Mistakes (MTBM), and decrease MTTR–but how do we deploy and use these tools? We often think of the technical hurdles you face in their deployment, but most of the blockers are actually cultural. Chris Grundemann, Eyvonne, Russ, and Tom discuss the cultural issues with deploying netops on this episode of the Hedge.
 

 
download

Load Balancing Monitor Groups: Multi-Service Health Checks for Resilient Applications

Modern applications are not monoliths. They are complex, distributed systems where availability depends on multiple independent components working in harmony. A web server might be running, but if its connection to the database is down or the authentication service is unresponsive, the application as a whole is unhealthy. Relying on a single health check is like knowing the “check engine” light is not on, but not knowing that one of your tires has a puncture. It’s great your engine is going, but you’re probably not driving far.

As applications grow in complexity, so does the definition of "healthy." We've heard from customers, big and small, that they need to validate multiple services to consider an endpoint ready to receive traffic. For example, they may need to confirm that an underlying API gateway is healthy and that a specific ‘/login’ service is responsive before routing users there. Until now, this required building custom, synthetic services to aggregate these checks, adding operational overhead and another potential point of failure.

Today, we are introducing Monitor Groups for Cloudflare Load Balancing. This feature provides a new way to create sophisticated, multi-service health assessments directly on our platform. With Monitor Groups, you can bundle Continue reading

Lab: Hide Transit Subnets in IS-IS Networks

Sometimes you want to assign IPv4/IPv6 subnets to transit links in your network (for example, to identify interfaces in traceroute outputs), but don’t need to have those subnets in the IP routing tables throughout the whole network. Like OSPF, IS-IS has a nerd knob you can use to exclude transit subnets from the router PDUs.

Want to check how that feature works with your favorite device? Use the Hide Transit Subnets in IS-IS Networks lab exercise.

Click here to start the lab in your browser using GitHub Codespaces (or set up your own lab infrastructure). After starting the lab environment, change the directory to feature/4-hide-transit and execute netlab up.

Ultra Ethernet: Memory Region

Memory Registration and Endpoint Binding in UET with libfabric 


In distributed AI workloads, each process requires memory regions that are visible to the fabric for efficient data transfer. The Job framework or application typically allocates these buffers in GPU VRAM to maximize throughput and enable low-latency direct memory access. These buffers store model parameters, gradients, neuron outputs, and temporary workspace, such as intermediate activations or partial gradients during collective operations in forward and backward passes.


Memory Registration and Key Generation

Once memory is allocated, it must be registered with the fabric domain using fi_mr_reg(). Registration informs the NIC that the memory is pinned and accessible for data transfers initiated by endpoints. The fabric library associates the buffer with a Memory Region handle (fid_mr) and internally generates a remote protection key (fi_mr_key), which uniquely identifies the memory region within the Job and domain context.

The local endpoint binds the fid_mr using fi_mr_bind() to define permitted operations, FI_REMOTE_WRITE in figure 4-10. This allows the NIC to access local memory efficiently and perform zero-copy operations.

The application retrieves the memory key using fi_mr_key(fid_mr) and constructs a Resource Index (RI) entry. The RI entry serves as a compact, portable identifier of the memory region Continue reading

Improving the trustworthiness of Javascript on the Web

The web is the most powerful application platform in existence. As long as you have the right API, you can safely run anything you want in a browser.

Well… anything but cryptography.

It is as true today as it was in 2011 that Javascript cryptography is Considered Harmful. The main problem is code distribution. Consider an end-to-end-encrypted messaging web application. The application generates cryptographic keys in the client’s browser that lets users view and send end-to-end encrypted messages to each other. If the application is compromised, what would stop the malicious actor from simply modifying their Javascript to exfiltrate messages?

It is interesting to note that smartphone apps don’t have this issue. This is because app stores do a lot of heavy lifting to provide security for the app ecosystem. Specifically, they provide integrity, ensuring that apps being delivered are not tampered with, consistency, ensuring all users get the same app, and transparency, ensuring that the record of versions of an app is truthful and publicly visible.

It would be nice if we could get these properties for our end-to-end encrypted web application, and the web as a whole, without requiring a single central authority like Continue reading

How NRP Scales Global Scientific Research with Calico

The National Research Platform (NRP) operates a globally distributed, high-performance computing and networking environment, with an average of 15,000 pods across 450 nodes supporting more than 3,000 scientific project namespaces. With its head node in San Diego, NRP connects research institutions and data centers worldwide via links ranging from 10 to 400 Gbps, serving more than 5,000 users in 70+ locations.

  • Non-profit company
  • Uses Calico Open Source

NRP is a partnership of more than 50 institutions, led by researchers at UC San Diego, University of Nebraska-Lincoln, and Massachusetts Green High Performance Computing Center and includes contributions by the National Science Foundation, the Department of Energy, the Department of Defense, and many research universities and R&E networking organizations in the US and around the world.

Challenges

Complex Network Visibility and Debugging

NRP needed a way to diagnose connectivity problems across globally distributed storage nodes. Frequent changes to edge network configurations, ACLs, firewalls, and static routes caused blocked ports, forcing manual troubleshooting with tools such as nmap and iperf. This process slowed down root-cause analysis and problem resolution.

Balancing Performance with Granular Security at Scale

Scientific workflows demanded maximum throughput over 100/400 Gbps links and jumbo frames. Traditional host firewalls introduced unacceptable Continue reading

Ellison: Oracle Is Leveraging All Of Its Advantages To Build A Different Cloud

For years, Oracle has found itself solidly in the second tier of cloud providers, well behind the top three of Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform, which combined account for more than 60 percent of global cloud infrastructure services market.

Ellison: Oracle Is Leveraging All Of Its Advantages To Build A Different Cloud was written by Jeffrey Burt at The Next Platform.

Ultra Ethernet: Address Resolution with Address Vector Table

Address Vector


Overview

To enable Remote Memory Access (RMA) operations between processes, each endpoint — representing a communication channel much like a TCP socket — must know the destination process’s location within the fabric. This location is represented by the Fabric Address (FA) assigned to a Fabric Endpoint (FEP).

During job initialization, FAs are distributed through a control-plane–like procedure in which the master rank collects FAs from all ranks and then broadcasts the complete Rank-to-FA mapping to every participant (see Chapter 3 for details). Each process stores this Rank–FA mapping locally as a structure, which can then be inserted into the Address Vector (AV) Table.

When FAs from the distributed Rank-to-FA table are inserted into the AV Table, the provider assigns each entry an index number, which is published to the application as an fi_addr_t handle. After an endpoint object is bound to the AV Table, the application uses this handle — rather than the full address — when referencing a destination process. This abstraction hides the underlying address structure from the application and allows fast and efficient lookups during communication.

This mechanism resembles the functionality of a BGP Route Reflector (RR) in IP networks. Each RR client advertises its Continue reading

EVPN Designs: Multi-Pod with IP-Only WAN Routers

In the multi-pod EVPN design, I described a simple way to merge two EVPN fabrics into a single end-to-end fabric. Here are a few highlights of that design:

  • Each fabric is running OSPF and IBGP, with core (spine) devices being route reflectors
  • There’s an EBGP session between the WAN edge routers (sometimes called border leaf switches)
  • Every BGP session carries IPv4 (underlay) and EVPN (overlay) routes.

In that design, the WAN edge routers have to support EVPN (at least in the control plane) and carry all EVPN routes for both fabrics. Today, we’ll change the design to use simpler WAN edge routers that support only IP forwarding.

Unpacking Cloudflare Workers CPU Performance Benchmarks

On October 4, independent developer Theo Browne published a series of benchmarks designed to compare server-side JavaScript execution speed between Cloudflare Workers and Vercel, a competing compute platform built on AWS Lambda. The initial results showed Cloudflare Workers performing worse than Node.js on Vercel at a variety of CPU-intensive tasks, by a factor of as much as 3.5x.

We were surprised by the results. The benchmarks were designed to compare JavaScript execution speed in a CPU-intensive workload that never waits on external services. But, Cloudflare Workers and Node.js both use the same underlying JavaScript engine: V8, the open source engine from Google Chrome. Hence, one would expect the benchmarks to be executing essentially identical code in each environment. Physical CPUs can vary in performance, but modern server CPUs do not vary by anywhere near 3.5x.

On investigation, we discovered a wide range of small problems that contributed to the disparity, ranging from some bad tuning in our infrastructure, to differences between the JavaScript libraries used on each platform, to some issues with the test itself. We spent the week working on many of these problems, which means over the past week Workers got better and faster Continue reading

Why Can’t We Have Good Documentation

Daniel Dib asked a sad question on LinkedIn:

Where did all the great documentation go?

In more detail:

There was a time when documentation answered almost all questions:

  • What is the thing?
  • What does the thing do?
  • Why would you use the thing?
  • How do you configure the thing?

I’ve seen the same thing happening in training, and here’s my cynical TL&DR answer: because the managers of the documentation/training departments don’t understand the true value of what they’re producing and thus cannot justify a decent budget to make it happen.

1 2 3 3,820