Archive

Category Archives for "Networking"

NAN090: Examining Network Automation’s Present and Future With Chris Grundemann

Today we explore the current state of network automation with Chris Grundemann, a co-founder of Network Automation Forum. Chris gives the history of the Network Automation Forum and AutoCon conference, which aims to blend technology and community. He shares ideas for advancing network automation, and invites other network engineers to participate in shaping the future... Read more »

D2DO270: Spacelift Is Your Infrastructure Glue (Sponsored)

Working with multiple Infrastructure as Code (IAC) tools can be painful. Spacelift provides a platform that operates on top of disparate IaC tools, including Ansible, Kubernetes, Pulumi, Terraform, and OpenTofu. Spacelift helps build the automation with and between these tools, creating graphs of graphs that make your dependencies just work. On today’s show, we talk... Read more »

PP059: News Roundup – Oracle Plays Breach Word Games, Fast Flux Worries CISA, AI Package Hallucinations, and More

Once a month, the Packet Protector podcast likes to see what’s going on out there via our news roundup. There’s a lot happening! Today we discuss Fortinet warning that a threat actor has found a way to maintain read-only access on Fortinet devices even if you’ve applied the patch for the original threat. Avanti VPNs... Read more »

New year, no shutdowns: the Q1 2025 Internet disruption summary

Cloudflare’s network spans more than 330 cities in over 125 countries, where we interconnect with over 13,000 network providers in order to provide a broad range of services to millions of customers. The breadth of both our network and our customer base provides us with a unique perspective on Internet resilience, enabling us to observe the impact of Internet disruptions at both a local and national level, as well as at a network level.

As we have noted in the past, this post is intended as a summary overview of observed and confirmed disruptions, and is not an exhaustive or complete list of issues that have occurred during the quarter. A larger list of detected traffic anomalies is available in the Cloudflare Radar Outage Center. Note that both bytes-based and request-based traffic graphs are used within the post to illustrate the impact of the observed disruptions — the choice of metric was generally made based on which better illustrated the impact of the disruption.

In the first quarter of 2025, we observed a significant number of Internet disruptions due to cable damage and power outages. Severe storms caused outages in Ireland and Réunion, and an earthquake caused ongoing connectivity issues Continue reading

How ARP Killed a Static Route

The amount of weird stuff we discover in netlab integration tests is astounding, or maybe I have a knack for looking into the wrong dark corners (my wife would definitely agree with that). Today’s special: when having two next hops kills a static route.

TL&DR: default ARP settings on a multi-subnet Linux host are less than optimal.

We use these principles when creating netlab integration tests:

  • They should contain a single device-under-test and a bunch of attached probes.
  • They should test a single feature.
  • They should not rely on the device-under-test. All validation has to be done on probes.

How do you test static routes under these restrictions? Here’s what we did:

Ultra Valuable uv for Dynamic, On-Demand Python Virtual Environments

What is UV? Astral's uv is a fast, all-in-one Python package and project manager written in Rust that unifies and accelerates Python development workflows by replacing multiple tools and actions including: pip, pip-tools, poetry, pipx, pyenv, virtualenv, and twine initializing a git repository creating base files like .gitignore and pyproject.toml (think of this as requirements.txt READ MORE

The post Ultra Valuable uv for Dynamic, On-Demand Python Virtual Environments appeared first on The Gratuitous Arp.

AI Metrics with Prometheus and Grafana

The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons

This article gives step-by-step instructions to set up the dashboard in a production environment.

git clone https://github.com/sflow-rt/prometheus-grafana.git
cd prometheus-grafana
env RT_IMAGE=ai-metrics ./start.sh

The easiest way to get started is to use Docker, see Deploy real-time network dashboards using Docker compose, and deploy the sflow/ai-metrics image bundling the AI Metrics application to generate metrics.

scrape_configs:
  - job_name: 'sflow-rt-ai-metrics'
    metrics_path: /app/ai-metrics/scripts/metrics.js/prometheus/txt
    scheme: http
     Continue reading

Tech Bytes: Network Observability AIOps Tips For Success (Sponsored)

Today on the Tech Bytes podcast we’re talking AI readiness with sponsor Broadcom. More specifically, getting your network observability ready to support AI operations. This isn’t just a hardware or software issue. It’s also a data issue. We’ll get some tips with our guest Jeremy Rossbach. Jeremy is Chief Technical Evangelist and Lead Product Marketing... Read more »

AI for Network Engineers: Understanding Flow, Flowlet, and Packet-Based Load Balancing

Though BGP supports the traditional Flow-based Layer 3 Equal Cost Multi-Pathing (ECMP) traffic load balancing method, it is not the best fit for a RoCEv2-based AI backend network. This is because GPU-to-GPU communication creates massive elephant flows, which RDMA-capable NICs transmit at line rate. These flows can easily cause congestion in the backend network.

In ECMP, all packets of a single flow follow the same path. If that path becomes congested, ECMP does not adapt or reroute traffic. This leads to uneven bandwidth usage across the network. Some links become overloaded, while others remain idle. In AI workloads, where multiple high-bandwidth flows occur at the same time, this imbalance can degrade performance.

Deep learning models rely heavily on collective operations like all-reduce, all-gather, and broadcast. These generate dense traffic patterns between GPUs, often at terabit-per-second speeds. If these flows are not evenly distributed, a single congested path can slow down the entire training job.

This chapter introduces two alternative load balancing methods to traditional Flow-Based with Layer 3 ECMP: 1) Flowlet-Based Load Balancing with Adaptive Routing, and 2) Packet-Based Load Balancing with Packet Spraying. Both aim to improve traffic distribution in RoCEv2-based AI backend networks, where conventional flow-based routing often Continue reading

🔍 Debugging Web Apps with Cursor Just Got Smarter: Evaluating Browser Assist Tools

In my previous post, I shared my experience using Vibe coding and highlighted one of the biggest challenges in that workflow: AI coding tools often lack awareness of what’s happening in the browser when you run your app. This leads to a frustrating dev loop: you’re forced to constantly copy-paste screenshots, console errors, and network … Continue reading 🔍 Debugging Web Apps with Cursor Just Got Smarter: Evaluating Browser Assist Tools

HN777: Overlay All the Things?

Today’s Heavy Networking is all about overlay technologies, their history, development, and current state, both from engineer and vendor perspectives. We discuss why the industry turns to overlays to solve problems, and look at overlay and segmentation approaches including VXLAN, SRv6, and EVPN. We also drill into the idea that EVPN could become the standard... Read more »

NVIDIA GTC 2025 Wrap-Up: 18 New Products to Watch

If you follow the tech news, you have read a lot about NVIDIA and its graphics processing units (GPUs). However, it would be incorrect to conclude that NVIDIA is solely focused on GPUs. My biggest revelation from NVIDIA’s GTC 2025 conference last month was that NVIDIA innovates across compute, networking and storage. Most of these innovations are all about AI, but gamers should not be concerned; there is a new RTX chip for you. The new announcements and key technologies that were the spotlight of CEO GeForce RTX 5090 will be the new high-end desktop GPU for gamers and creative professionals. (Did you know that RTX stands for Ray Tracing Texel Extreme? Continue reading

From Python to Go 019. Interaction With Applications Via REST API.

Hello my friend,

So far we’ve covered all means to interact with network devices, which are meaningful in our opinion: SSH, NETCONF/YANG, and GNMI/YANG. There is one more protocol, which exists for managing network devices, which is called RESTCONF, which is application of REST API to network devices. From our experience, its support across network vendors is very limited; therefore, we don’t cover it. However, REST API itself is immensely important, as it is still the most widely used protocol for applications to talk to each other. And this is the focus for today’s blog.

I See Everywhere Stop Learning Code, Why Do You Teach It?

Generative AI, Agentic AI, all other kinds of AI is absolutely useful things. The advancements there are very quick and we ourselves using them in our projects. At the same time, if you don’t know how to code, how to solve algorithmic tasks, how can you reason if the solution provided by AI is correct? If that optimal? And moreover, when it breaks, because every software breaks sooner or later, how can you fix it? That’s why we believe it is absolutely important to learn software development, tools and algorithms. Perhaps, more Continue reading

How To Read a Traceroute for Network Troubleshooting

The traceroute tool is one of the most valuable yet straightforward diagnostic utilities available for network troubleshooting. Built into virtually every operating system, traceroute runs a connection test from one computer to another device, showing each “hop” the data takes between network devices. This comprehensive guide will help you understand how traceroute works, interpret its results and recognize common network problems it can reveal. Traceroute: Understanding What It Does To see traceroute in action, we can begin with a simple example of running a traceroute from your computer to Catchpoint’s servers. The specific results will be different for each person. However, in most cases, the results will show you around four to 20 “hops” that packets take to get from your computer to Catchpoint’s servers and back. The first one would likely be your local router, and from there, the data will take multiple “hops” through your internal network and out through your internet service provider (ISP) and over the internet, before finally reaching Catchpoint’s servers. Figure 1 shows an example of what you might see on the command prompt of a Windows computer. Figure 1: Image of a traceroute command and the results generated. Understanding how to run this Continue reading

N4N022: SNMP Fundamentals

Following last week’s introduction to network monitoring, we discuss the Simple Network Management Protocol (SNMP), one of the most implemented types of network monitoring. We discuss how it is organized, operations that SNMP can perform, and versions of SNMP. This week’s bonus conversation is a discussion on the future for SNMP. Episode Links: MIB tree... Read more »
1 4 5 6 7 8 3,437