Generating an optimal segment list for SR-TE

Segment Routing simplifies MPLS for the network operator – but not for the developer.

How Traffic Engineering works

Consider the topology:

I want to steer traffic from R1 to R7 using only blue links. R1 (or controller) runs Constrained Shortest …

How ARP Killed a Static Route

The amount of weird stuff we discover in netlab integration tests is astounding, or maybe I have a knack for looking into the wrong dark corners (my wife would definitely agree with that). Today’s special: when having two next hops kills a static route.

TL&DR: default ARP settings on a multi-subnet Linux host are less than optimal.

We use these principles when creating netlab integration tests:

They should contain a single device-under-test and a bunch of attached probes.
They should test a single feature.
They should not rely on the device-under-test. All validation has to be done on probes.

How do you test static routes under these restrictions? Here’s what we did:

Ultra Valuable uv for Dynamic, On-Demand Python Virtual Environments

What is UV? Astral's uv is a fast, all-in-one Python package and project manager written in Rust that unifies and accelerates Python development workflows by replacing multiple tools and actions including: pip, pip-tools, poetry, pipx, pyenv, virtualenv, and twine initializing a git repository creating base files like .gitignore and pyproject.toml (think of this as requirements.txt READ MORE

The post Ultra Valuable uv for Dynamic, On-Demand Python Virtual Environments appeared first on The Gratuitous Arp.

AI Metrics with Prometheus and Grafana

The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

Total Traffic Total traffic entering fabric
Operations Total RoCEv2 operations broken out by type
Core Link Traffic Histogram of load on fabric links
Edge Link Traffic Histogram of load on access ports
RDMA Operations Total RDMA operations
RDMA Bytes Average RDMA operation size
Credits Average number of credits in RoCEv2 acknowledgements
Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
Congestion Total ECN / CNP congestion messages
Errors Total ingress / egress errors
Discards Total ingress / egress discards
Drop Reasons Packet drop reasons

This article gives step-by-step instructions to set up the dashboard in a production environment.

git clone https://github.com/sflow-rt/prometheus-grafana.git
cd prometheus-grafana
env RT_IMAGE=ai-metrics ./start.sh

The easiest way to get started is to use Docker, see Deploy real-time network dashboards using Docker compose, and deploy the sflow/ai-metrics image bundling the AI Metrics application to generate metrics.

scrape_configs:
  - job_name: 'sflow-rt-ai-metrics'
    metrics_path: /app/ai-metrics/scripts/metrics.js/prometheus/txt
    scheme: http
     Continue reading

NB523: CVE’s Reprieve, Google WAN For Sale, NVIDIA Pledges More US-Made Chips

Take a Network Break! We start with a red alert for the CVE program, which nearly lost government funding before CISA stepped in, and then raise an alert for a Dpanel vulnerability. Google makes its Cloud WAN available to enterprises and governments that want connectivity options, a US District Court rules Google’s ad business a... Read more »

Tech Bytes: Network Observability AIOps Tips For Success (Sponsored)

Today on the Tech Bytes podcast we’re talking AI readiness with sponsor Broadcom. More specifically, getting your network observability ready to support AI operations. This isn’t just a hardware or software issue. It’s also a data issue. We’ll get some tips with our guest Jeremy Rossbach. Jeremy is Chief Technical Evangelist and Lead Product Marketing... Read more »

AI for Network Engineers: Understanding Flow, Flowlet, and Packet-Based Load Balancing

Though BGP supports the traditional Flow-based Layer 3 Equal Cost Multi-Pathing (ECMP) traffic load balancing method, it is not the best fit for a RoCEv2-based AI backend network. This is because GPU-to-GPU communication creates massive elephant flows, which RDMA-capable NICs transmit at line rate. These flows can easily cause congestion in the backend network.

In ECMP, all packets of a single flow follow the same path. If that path becomes congested, ECMP does not adapt or reroute traffic. This leads to uneven bandwidth usage across the network. Some links become overloaded, while others remain idle. In AI workloads, where multiple high-bandwidth flows occur at the same time, this imbalance can degrade performance.

Deep learning models rely heavily on collective operations like all-reduce, all-gather, and broadcast. These generate dense traffic patterns between GPUs, often at terabit-per-second speeds. If these flows are not evenly distributed, a single congested path can slow down the entire training job.

This chapter introduces two alternative load balancing methods to traditional Flow-Based with Layer 3 ECMP: 1) Flowlet-Based Load Balancing with Adaptive Routing, and 2) Packet-Based Load Balancing with Packet Spraying. Both aim to improve traffic distribution in RoCEv2-based AI backend networks, where conventional flow-based routing often Continue reading

🔍 Debugging Web Apps with Cursor Just Got Smarter: Evaluating Browser Assist Tools

In my previous post, I shared my experience using Vibe coding and highlighted one of the biggest challenges in that workflow: AI coding tools often lack awareness of what’s happening in the browser when you run your app. This leads to a frustrating dev loop: you’re forced to constantly copy-paste screenshots, console errors, and network … Continue reading →

TNO025: Know the Packet Flow

Knowing the flow of how a packet gets from one end to the other is one of the most important pieces of advice given by today’s guest, Mark Prosser. He shares his journey in the IT industry, his thoughts on Ops vs. design, and the differences between working in a telco NOC and the enterprise... Read more »

IPB173: The IPv6 Test Pod Project

What if there was an simple way to test software and services to see if they work in a variety of IPv6 environments, including v6-only, dual-stack, and others? That’s the goal of the IPv6 Test Pod Project. This project provides a device, at no cost to the applicant, that comes with a variety of v6... Read more »

HN777: Overlay All the Things?

Today’s Heavy Networking is all about overlay technologies, their history, development, and current state, both from engineer and vendor perspectives. We discuss why the industry turns to overlays to solve problems, and look at overlay and segmentation approaches including VXLAN, SRv6, and EVPN. We also drill into the idea that EVPN could become the standard... Read more »

NVIDIA GTC 2025 Wrap-Up: 18 New Products to Watch

If you follow the tech news, you have read a lot about NVIDIA and its graphics processing units (GPUs). However, it would be incorrect to conclude that NVIDIA is solely focused on GPUs. My biggest revelation from NVIDIA’s GTC 2025 conference last month was that NVIDIA innovates across compute, networking and storage. Most of these innovations are all about AI, but gamers should not be concerned; there is a new RTX chip for you. The new announcements and key technologies that were the spotlight of CEO GeForce RTX 5090 will be the new high-end desktop GPU for gamers and creative professionals. (Did you know that RTX stands for Ray Tracing Texel Extreme? Continue reading

From Python to Go 019. Interaction With Applications Via REST API.

Hello my friend,

So far we’ve covered all means to interact with network devices, which are meaningful in our opinion: SSH, NETCONF/YANG, and GNMI/YANG. There is one more protocol, which exists for managing network devices, which is called RESTCONF, which is application of REST API to network devices. From our experience, its support across network vendors is very limited; therefore, we don’t cover it. However, REST API itself is immensely important, as it is still the most widely used protocol for applications to talk to each other. And this is the focus for today’s blog.

I See Everywhere Stop Learning Code, Why Do You Teach It?

Generative AI, Agentic AI, all other kinds of AI is absolutely useful things. The advancements there are very quick and we ourselves using them in our projects. At the same time, if you don’t know how to code, how to solve algorithmic tasks, how can you reason if the solution provided by AI is correct? If that optimal? And moreover, when it breaks, because every software breaks sooner or later, how can you fix it? That’s why we believe it is absolutely important to learn software development, tools and algorithms. Perhaps, more Continue reading

How To Read a Traceroute for Network Troubleshooting

The traceroute tool is one of the most valuable yet straightforward diagnostic utilities available for network troubleshooting. Built into virtually every operating system, traceroute runs a connection test from one computer to another device, showing each “hop” the data takes between network devices. This comprehensive guide will help you understand how traceroute works, interpret its results and recognize common network problems it can reveal. Traceroute: Understanding What It Does To see traceroute in action, we can begin with a simple example of running a traceroute from your computer to Catchpoint’s servers. The specific results will be different for each person. However, in most cases, the results will show you around four to 20 “hops” that packets take to get from your computer to Catchpoint’s servers and back. The first one would likely be your local router, and from there, the data will take multiple “hops” through your internal network and out through your internet service provider (ISP) and over the internet, before finally reaching Catchpoint’s servers. Figure 1 shows an example of what you might see on the command prompt of a Windows computer. Figure 1: Image of a traceroute command and the results generated. Understanding how to run this Continue reading

N4N022: SNMP Fundamentals

Following last week’s introduction to network monitoring, we discuss the Simple Network Management Protocol (SNMP), one of the most implemented types of network monitoring. We discuss how it is organized, operations that SNMP can perform, and versions of SNMP. This week’s bonus conversation is a discussion on the future for SNMP. Episode Links: MIB tree... Read more »

Rant: You Should Have Written a Book

I apologize for the rant; I have to vent my frustration with people whose quantity of opinions seems to be exceeding their experience (or maybe they’re coming from an alternate universe with different laws of physics, which would be way cool but also unlikely). You’ve been warned; please feel free to move on or skip the rant part of the blog post.

Rant mode: ON

This is the (unedited) gem I received after making some of my EVPN videos public:

D2DO269: Intergalactic Dataspeak and Making Data Actionable

Data and what you do with it is the topic of today’s show with guest Dominic Chapman, the Head of Product at Axiom. Dominic’s career focus is on data, how to store and how to operationalize all the data that different technology stacks create. We talk with Dominic about how the data does not lie,... Read more »

netlab: Execute a Command on Multiple Devices

When I was updating the Network Migration with BGP Local-AS Feature blog post, I wanted to execute the same command (show ip bgp) on all routers in my network.

Not a problem: since Dan Partelly added the netlab exec command, it’s as simple as netlab exec * show ip bgp. Well, not exactly; there are still a few quirks.

Why I joined Cloudflare: to build world-class partnerships in EMEA

Cloudflare is not just another technology company. It’s a mission-driven force, committed to helping build a better Internet; one that is faster, safer, and more resilient. That mission is more critical than ever as organizations worldwide navigate an increasingly complex digital landscape, rife with cyber threats, regulatory challenges, and the need for scalable, cost-effective solutions.

In EMEA, that mission has special significance. The region is a patchwork of diverse markets, industries, and regulatory environments. It demands a partner-centric approach, one that empowers businesses of all sizes to harness Cloudflare’s comprehensive connectivity cloud platform to protect, connect, and accelerate their operations. That’s why I joined Cloudflare as VP of EMEA Partnerships.

A moment of inflection

Every great company has an inflection point, a moment when the market, the strategy, and the execution align to create unstoppable momentum. Cloudflare is at that moment now.

With record revenue growth, increasing traction among large customers, and an expanding suite of Zero Trust, AI, and network security solutions, Cloudflare is emerging as the partner of choice for enterprises and service providers across EMEA .

But what excites me most is the people, the opportunity to build a team in EMEA that is world-class in its expertise, Continue reading

Internet Governance – The End of Multi-Stakeholderism?

The recent erratic moves by the US President to initiate a trade war on a global scale will have far-reaching implications beyond stock markets and will inevitably include the digital world and what we refer to as Internet Governance. The US moves on the unilateral imposition of tariffs can be interpreted as a vote of no confidence in global trade and open markets by the US, and a resurgence of a theme of strategic national self-reliance in all areas of economic activity, including the digital realm. The tenets of Multi-Stakeholderism, the foundation of Internet Governance, are crumbling.

« Previous 1 … 14 15 16 17 18 … 3,447 Next »

Archive

How Traffic Engineering works

I See Everywhere Stop Learning Code, Why Do You Teach It?

A moment of inflection