How ARP Killed a Static Route

The amount of weird stuff we discover in netlab integration tests is astounding, or maybe I have a knack for looking into the wrong dark corners (my wife would definitely agree with that). Today’s special: when having two next hops kills a static route.

TL&DR: default ARP settings on a multi-subnet Linux host are less than optimal.

We use these principles when creating netlab integration tests:

  • They should contain a single device-under-test and a bunch of attached probes.
  • They should test a single feature.
  • They should not rely on the device-under-test. All validation has to be done on probes.

How do you test static routes under these restrictions? Here’s what we did:

Ultra Valuable uv for Dynamic, On-Demand Python Virtual Environments

What is UV? Astral's uv is a fast, all-in-one Python package and project manager written in Rust that unifies and accelerates Python development workflows by replacing multiple tools and actions including: pip, pip-tools, poetry, pipx, pyenv, virtualenv, and twine initializing a git repository creating base files like .gitignore and pyproject.toml (think of this as requirements.txt READ MORE

The post Ultra Valuable uv for Dynamic, On-Demand Python Virtual Environments appeared first on The Gratuitous Arp.

AI Metrics with Prometheus and Grafana

The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons

This article gives step-by-step instructions to set up the dashboard in a production environment.

git clone https://github.com/sflow-rt/prometheus-grafana.git
cd prometheus-grafana
env RT_IMAGE=ai-metrics ./start.sh

The easiest way to get started is to use Docker, see Deploy real-time network dashboards using Docker compose, and deploy the sflow/ai-metrics image bundling the AI Metrics application to generate metrics.

scrape_configs:
  - job_name: 'sflow-rt-ai-metrics'
    metrics_path: /app/ai-metrics/scripts/metrics.js/prometheus/txt
    scheme: http
     Continue reading

Tech Bytes: Network Observability AIOps Tips For Success (Sponsored)

Today on the Tech Bytes podcast we’re talking AI readiness with sponsor Broadcom. More specifically, getting your network observability ready to support AI operations. This isn’t just a hardware or software issue. It’s also a data issue. We’ll get some tips with our guest Jeremy Rossbach. Jeremy is Chief Technical Evangelist and Lead Product Marketing... Read more »

AI for Network Engineers: Understanding Flow, Flowlet, and Packet-Based Load Balancing

Though BGP supports the traditional Flow-based Layer 3 Equal Cost Multi-Pathing (ECMP) traffic load balancing method, it is not the best fit for a RoCEv2-based AI backend network. This is because GPU-to-GPU communication creates massive elephant flows, which RDMA-capable NICs transmit at line rate. These flows can easily cause congestion in the backend network.

In ECMP, all packets of a single flow follow the same path. If that path becomes congested, ECMP does not adapt or reroute traffic. This leads to uneven bandwidth usage across the network. Some links become overloaded, while others remain idle. In AI workloads, where multiple high-bandwidth flows occur at the same time, this imbalance can degrade performance.

Deep learning models rely heavily on collective operations like all-reduce, all-gather, and broadcast. These generate dense traffic patterns between GPUs, often at terabit-per-second speeds. If these flows are not evenly distributed, a single congested path can slow down the entire training job.

This chapter introduces two alternative load balancing methods to traditional Flow-Based with Layer 3 ECMP: 1) Flowlet-Based Load Balancing with Adaptive Routing, and 2) Packet-Based Load Balancing with Packet Spraying. Both aim to improve traffic distribution in RoCEv2-based AI backend networks, where conventional flow-based routing often Continue reading

🔍 Debugging Web Apps with Cursor Just Got Smarter: Evaluating Browser Assist Tools

In my previous post, I shared my experience using Vibe coding and highlighted one of the biggest challenges in that workflow: AI coding tools often lack awareness of what’s happening in the browser when you run your app. This leads to a frustrating dev loop: you’re forced to constantly copy-paste screenshots, console errors, and network … Continue reading 🔍 Debugging Web Apps with Cursor Just Got Smarter: Evaluating Browser Assist Tools

TSMC: The Second Most Profitable Company In The AI Revolution

The AI boom has been very, very good to Taiwan Semiconductor Manufacturing Co, which is positioned to do well if Nvidia continues with its hegemony over AI training and inference or if the rebel alliance forms behind AMD or if the hyperscalers and cloud builders dedicate a substantial portion of their capital budgets to etching and packaging homegrown compute engines.

TSMC: The Second Most Profitable Company In The AI Revolution was written by Timothy Prickett Morgan at The Next Platform.

HN777: Overlay All the Things?

Today’s Heavy Networking is all about overlay technologies, their history, development, and current state, both from engineer and vendor perspectives. We discuss why the industry turns to overlays to solve problems, and look at overlay and segmentation approaches including VXLAN, SRv6, and EVPN. We also drill into the idea that EVPN could become the standard... Read more »

NVIDIA GTC 2025 Wrap-Up: 18 New Products to Watch

If you follow the tech news, you have read a lot about NVIDIA and its graphics processing units (GPUs). However, it would be incorrect to conclude that NVIDIA is solely focused on GPUs. My biggest revelation from NVIDIA’s GTC 2025 conference last month was that NVIDIA innovates across compute, networking and storage. Most of these innovations are all about AI, but gamers should not be concerned; there is a new RTX chip for you. The new announcements and key technologies that were the spotlight of CEO GeForce RTX 5090 will be the new high-end desktop GPU for gamers and creative professionals. (Did you know that RTX stands for Ray Tracing Texel Extreme? Continue reading

From Python to Go 019. Interaction With Applications Via REST API.

Hello my friend,

So far we’ve covered all means to interact with network devices, which are meaningful in our opinion: SSH, NETCONF/YANG, and GNMI/YANG. There is one more protocol, which exists for managing network devices, which is called RESTCONF, which is application of REST API to network devices. From our experience, its support across network vendors is very limited; therefore, we don’t cover it. However, REST API itself is immensely important, as it is still the most widely used protocol for applications to talk to each other. And this is the focus for today’s blog.

I See Everywhere Stop Learning Code, Why Do You Teach It?

Generative AI, Agentic AI, all other kinds of AI is absolutely useful things. The advancements there are very quick and we ourselves using them in our projects. At the same time, if you don’t know how to code, how to solve algorithmic tasks, how can you reason if the solution provided by AI is correct? If that optimal? And moreover, when it breaks, because every software breaks sooner or later, how can you fix it? That’s why we believe it is absolutely important to learn software development, tools and algorithms. Perhaps, more Continue reading

How To Read a Traceroute for Network Troubleshooting

The traceroute tool is one of the most valuable yet straightforward diagnostic utilities available for network troubleshooting. Built into virtually every operating system, traceroute runs a connection test from one computer to another device, showing each “hop” the data takes between network devices. This comprehensive guide will help you understand how traceroute works, interpret its results and recognize common network problems it can reveal. Traceroute: Understanding What It Does To see traceroute in action, we can begin with a simple example of running a traceroute from your computer to Catchpoint’s servers. The specific results will be different for each person. However, in most cases, the results will show you around four to 20 “hops” that packets take to get from your computer to Catchpoint’s servers and back. The first one would likely be your local router, and from there, the data will take multiple “hops” through your internal network and out through your internet service provider (ISP) and over the internet, before finally reaching Catchpoint’s servers. Figure 1 shows an example of what you might see on the command prompt of a Windows computer. Figure 1: Image of a traceroute command and the results generated. Understanding how to run this Continue reading

Stacking Up Google’s “Ironwood” TPU Pod To Other AI Supercomputers

As part of the pre-briefings ahead of the Google Cloud Next 2025 conference last week and then during the keynote address, the top brass at Google kept comparing a pod of “Ironwood” TPU v7p systems to the “El Capitan” supercomputer at Lawrence Livermore National Laboratory.

Stacking Up Google’s “Ironwood” TPU Pod To Other AI Supercomputers was written by Timothy Prickett Morgan at The Next Platform.

N4N022: SNMP Fundamentals

Following last week’s introduction to network monitoring, we discuss the Simple Network Management Protocol (SNMP), one of the most implemented types of network monitoring. We discuss how it is organized, operations that SNMP can perform, and versions of SNMP. This week’s bonus conversation is a discussion on the future for SNMP. Episode Links: MIB tree... Read more »

Rant: You Should Have Written a Book

I apologize for the rant; I have to vent my frustration with people whose quantity of opinions seems to be exceeding their experience (or maybe they’re coming from an alternate universe with different laws of physics, which would be way cool but also unlikely). You’ve been warned; please feel free to move on or skip the rant part of the blog post.

Rant mode: ON

This is the (unedited) gem I received after making some of my EVPN videos public:

Nvidia Sacrifices Profits To Preserve Revenues In The US

Making a graphics card for gamers is one thing, but manufacturing a rackscale supercomputer with over 600,000 components that burns 120 kilowatts of power, that has over 5,000 copper cables for an all-to-all interconnect mesh for 72 dual-chip compute engines, and that weighs over 3,000 pounds is another thing entirely.

Nvidia Sacrifices Profits To Preserve Revenues In The US was written by Timothy Prickett Morgan at The Next Platform.