We all know netops, NRE, and devops can increase productivity, increase Mean Time Between Mistakes (MTBM), and decrease MTTR–but how do we deploy and use these tools? We often think of the technical hurdles you face in their deployment, but most of the blockers are actually cultural. Chris Grundemann, Eyvonne, Russ, and Tom discuss the cultural issues with deploying netops on this episode of the Hedge.
download
Modern applications are not monoliths. They are complex, distributed systems where availability depends on multiple independent components working in harmony. A web server might be running, but if its connection to the database is down or the authentication service is unresponsive, the application as a whole is unhealthy. Relying on a single health check is like knowing the “check engine” light is not on, but not knowing that one of your tires has a puncture. It’s great your engine is going, but you’re probably not driving far.
As applications grow in complexity, so does the definition of "healthy." We've heard from customers, big and small, that they need to validate multiple services to consider an endpoint ready to receive traffic. For example, they may need to confirm that an underlying API gateway is healthy and that a specific ‘/login’ service is responsive before routing users there. Until now, this required building custom, synthetic services to aggregate these checks, adding operational overhead and another potential point of failure.
Today, we are introducing Monitor Groups for Cloudflare Load Balancing. This feature provides a new way to create sophisticated, multi-service health assessments directly on our platform. With Monitor Groups, you can bundle Continue reading
Sometimes you want to assign IPv4/IPv6 subnets to transit links in your network (for example, to identify interfaces in traceroute outputs), but don’t need to have those subnets in the IP routing tables throughout the whole network. Like OSPF, IS-IS has a nerd knob you can use to exclude transit subnets from the router PDUs.
Want to check how that feature works with your favorite device? Use the Hide Transit Subnets in IS-IS Networks lab exercise.
Click here to start the lab in your browser using GitHub Codespaces (or set up your own lab infrastructure). After starting the lab environment, change the directory to feature/4-hide-transit and execute netlab up.
The world’s largest and one might argue most important chip foundry is telling Wall Street that its AI-related sales are running ahead of schedule. …
TSMC Is Running Ahead Of Forecasts On AI Growth was written by Timothy Prickett Morgan at The Next Platform.
[updated 25-Oct, 2025 - (RIs in the figure)]
In distributed AI workloads, each process requires memory regions that are visible to the fabric for efficient data transfer. The Job framework or application typically allocates these buffers in GPU VRAM to maximize throughput and enable low-latency direct memory access. These buffers store model parameters, gradients, neuron outputs, and temporary workspace, such as intermediate activations or partial gradients during collective operations in forward and backward passes.
Once memory is allocated, it must be registered with the fabric domain using fi_mr_reg(). Registration informs the NIC that the memory is pinned and accessible for data transfers initiated by endpoints. The fabric library associates the buffer with a Memory Region handle (fid_mr) and internally generates a remote protection key (fi_mr_key), which uniquely identifies the memory region within the Job and domain context.
The local endpoint binds the fid_mr using fi_mr_bind() to define permitted operations, FI_REMOTE_WRITE in figure 4-10. This allows the NIC to access local memory efficiently and perform zero-copy operations.
The application retrieves the memory key using fi_mr_key(fid_mr) and constructs a Resource Index (RI) entry. The RI entry serves as Continue reading
The web is the most powerful application platform in existence. As long as you have the right API, you can safely run anything you want in a browser.
Well… anything but cryptography.
It is as true today as it was in 2011 that Javascript cryptography is Considered Harmful. The main problem is code distribution. Consider an end-to-end-encrypted messaging web application. The application generates cryptographic keys in the client’s browser that lets users view and send end-to-end encrypted messages to each other. If the application is compromised, what would stop the malicious actor from simply modifying their Javascript to exfiltrate messages?
It is interesting to note that smartphone apps don’t have this issue. This is because app stores do a lot of heavy lifting to provide security for the app ecosystem. Specifically, they provide integrity, ensuring that apps being delivered are not tampered with, consistency, ensuring all users get the same app, and transparency, ensuring that the record of versions of an app is truthful and publicly visible.
It would be nice if we could get these properties for our end-to-end encrypted web application, and the web as a whole, without requiring a single central authority like Continue reading

The National Research Platform (NRP) operates a globally distributed, high-performance computing and networking environment, with an average of 15,000 pods across 450 nodes supporting more than 3,000 scientific project namespaces. With its head node in San Diego, NRP connects research institutions and data centers worldwide via links ranging from 10 to 400 Gbps, serving more than 5,000 users in 70+ locations.
NRP is a partnership of more than 50 institutions, led by researchers at UC San Diego, University of Nebraska-Lincoln, and Massachusetts Green High Performance Computing Center and includes contributions by the National Science Foundation, the Department of Energy, the Department of Defense, and many research universities and R&E networking organizations in the US and around the world.

NRP needed a way to diagnose connectivity problems across globally distributed storage nodes. Frequent changes to edge network configurations, ACLs, firewalls, and static routes caused blocked ports, forcing manual troubleshooting with tools such as nmap and iperf. This process slowed down root-cause analysis and problem resolution.
Scientific workflows demanded maximum throughput over 100/400 Gbps links and jumbo frames. Traditional host firewalls introduced unacceptable Continue reading
For years, Oracle has found itself solidly in the second tier of cloud providers, well behind the top three of Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform, which combined account for more than 60 percent of global cloud infrastructure services market. …
Ellison: Oracle Is Leveraging All Of Its Advantages To Build A Different Cloud was written by Jeffrey Burt at The Next Platform.
To enable Remote Memory Access (RMA) operations between processes, each endpoint — representing a communication channel much like a TCP socket — must know the destination process’s location within the fabric. This location is represented by the Fabric Address (FA) assigned to a Fabric Endpoint (FEP).
During job initialization, FAs are distributed through a control-plane–like procedure in which the master rank collects FAs from all ranks and then broadcasts the complete Rank-to-FA mapping to every participant (see Chapter 3 for details). Each process stores this Rank–FA mapping locally as a structure, which can then be inserted into the Address Vector (AV) Table.
When FAs from the distributed Rank-to-FA table are inserted into the AV Table, the provider assigns each entry an index number, which is published to the application as an fi_addr_t handle. After an endpoint object is bound to the AV Table, the application uses this handle — rather than the full address — when referencing a destination process. This abstraction hides the underlying address structure from the application and allows fast and efficient lookups during communication.
This mechanism resembles the functionality of a BGP Route Reflector (RR) in IP networks. Each RR client advertises its Continue reading
In the multi-pod EVPN design, I described a simple way to merge two EVPN fabrics into a single end-to-end fabric. Here are a few highlights of that design:
In that design, the WAN edge routers have to support EVPN (at least in the control plane) and carry all EVPN routes for both fabrics. Today, we’ll change the design to use simpler WAN edge routers that support only IP forwarding.
On October 4, independent developer Theo Browne published a series of benchmarks designed to compare server-side JavaScript execution speed between Cloudflare Workers and Vercel, a competing compute platform built on AWS Lambda. The initial results showed Cloudflare Workers performing worse than Node.js on Vercel at a variety of CPU-intensive tasks, by a factor of as much as 3.5x.
We were surprised by the results. The benchmarks were designed to compare JavaScript execution speed in a CPU-intensive workload that never waits on external services. But, Cloudflare Workers and Node.js both use the same underlying JavaScript engine: V8, the open source engine from Google Chrome. Hence, one would expect the benchmarks to be executing essentially identical code in each environment. Physical CPUs can vary in performance, but modern server CPUs do not vary by anywhere near 3.5x.
On investigation, we discovered a wide range of small problems that contributed to the disparity, ranging from some bad tuning in our infrastructure, to differences between the JavaScript libraries used on each platform, to some issues with the test itself. We spent the week working on many of these problems, which means over the past week Workers got better and faster Continue reading
It is Oracle OpenWorld CloudWorld AI World this week, so we expect a lot of AI infrastructure announcements from Big Red, with AI being the biggest new workload to hit the enterprise in decades. …
Oracle First In Line For AMD “Altair” MI450 GPUs, “Helios” Racks was written by Timothy Prickett Morgan at The Next Platform.