State of Network Automation with Urs Baumann

I stopped tracking the (lack of) progress in network automation years ago, when I realized I had nothing new to say. As an eternal optimist, I hoped I was just missing something, but Urs Baumann (the guest of Software Gone Wild Episode 206) destroyed my hopes when he said, “I can still use the same slides I created 10 years ago”. On a more positive note, he recently completed his Master’s thesis on AI in network engineering, so we ended with a nice chat on its potential impact.

Worth Reading: AI and Knowledge Stagnation

Another week, another interesting AI article (is anyone writing about anything else these days?), this time from Noah Smith (another author worth following). I found this gem hidden in his weekly roundup:

Instead of trying to write a piece of code from scratch, or prove a math theorem from scratch, or figure out some piece of knowledge for yourself, you just ask AI to do it all for you. So everyone ends up getting the right answers to questions whose answers are already known, so they don’t end up adding anything new.

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

Rust Workers run on the Cloudflare Workers platform by compiling Rust to WebAssembly, but as we’ve found, WebAssembly has some sharp edges. When things go wrong with a panic or an unexpected abort, the runtime can be left in an undefined state. For users of Rust Workers, panics were historically fatal, poisoning the instance and possibly even bricking the Worker for a period of time.

While we were able to detect and mitigate these issues, there remained a small chance that a Rust Worker would unexpectedly fail and cause other requests to fail along with it. An unhandled Rust abort in a Worker affecting one request might escalate into a broader failure affecting sibling requests or even continue to affect new incoming requests. The root cause of this was in wasm-bindgen, the core project that generates the Rust-to-JavaScript bindings Rust Workers depend on, and its lack of built-in recovery semantics.

In this post, we’ll share how the latest version of Rust Workers handles comprehensive Wasm error recovery that solves this abort-induced sandbox poisoning. This work has been contributed back into wasm-bindgen as part of our collaboration within the wasm-bindgen organization formed last year. First with panic=unwind support, which ensures that Continue reading

Ten Years of ITNOG

I spent the last two days in Bologna at ITNOG 10 in the excellent company of Italian networking engineers (many of them personal friends) and a few guests from around the world. As always, the organizers and the program committee didn’t disappoint – it was a smoothly organized, lovely event full of interesting presentations. Thanks a million to everyone involved; I’ll definitely be back!

Now for the highlights, starting with the ultimate catnip for the differently attentive: running two presentations in parallel on the same screen with the soundtrack distributed via headphones. I’ve never seen anything like that, and while it looked weird (I have no idea how the presenters took it), it turned out to be very useful, as you could easily tune out AI-washing presentations and switch to something more interesting. On the other hand, you could be faced with a hard choice of having to select one of two excellent presentations:

KubeVirt Networking: How to Preserve VM IP Addresses During Migration

Organisations are re-evaluating their VM infrastructure. The economics have shifted, the tooling has matured, and the case for running two separate platforms, one for containers, one for VMs, is getting harder to justify. Platform teams that spent years managing hypervisor infrastructure are being asked to consolidate, and most are landing on the same answer: Kubernetes.

KubeVirt makes running VMs on Kubernetes possible. But KubeVirt networking – what happens to a VM’s IP address, VLAN, and security posture when it lands in a cluster – is where most migration plans hit a wall. The reasons go beyond cost:

  • Most enterprises already run Kubernetes. Containers are already there. Adding VMs to the same platform consolidates tooling, lifecycle management, networking models, and security policy into a single operational model.
  • Two platforms means double the overhead. Separate infrastructure means separate upgrade cycles, separate monitoring, separate network configuration, and separate on-call runbooks. Platform consolidation has direct operational value.
  • Kubernetes is mature enough. KubeVirt has reached the point where it’s a viable production choice for enterprise VM workloads.

The decision to migrate is being made. The question is how to do it without causing chaos.

Introducing KubeVirt

KubeVirt extends the Kubernetes API with new custom resource Continue reading

Moving past bots vs. humans

For us humans to interact with the online world, we need a gateway: keyboard, screen, browser, device. What is called "human detection" online are patterns that humans use when interacting with such devices. These patterns have changed in recent years: a startup CEO now uses their browser to summarize the news, a tech enthusiast automates the process to book their concert tickets when sales open at night, someone who's visually impaired enables accessibility on their screen reader, and companies route their employee traffic through zero trust proxies.

At the same time, website owners are still looking to protect their data, manage their resources, control content distribution, and prevent abuse. These problems aren’t solved by knowing whether the client is a human or a bot: There are wanted bots and there are unwanted humans. These problems require knowing intent and behavior. The ability to detect automation remains critical. However, as the distinctions between actors become blurry, the systems we build now should accommodate a future where "bots vs. humans" is not the important data point.

What actually matters is not humanity in the abstract, but questions such as: is this attack traffic, is that crawler load proportional to the traffic it Continue reading

Building the agentic cloud: everything we launched during Agents Week 2026

Today marks the end of our first Agents Week, an innovation week dedicated entirely to the age of agents. It couldn’t have been more timely: over the past year, agents have swiftly changed how people work. Coding agents are helping developers ship faster than ever. Support agents resolve tickets end-to-end. Research agents validate hypotheses across hundreds of sources in minutes. And people aren't just running one agent: they're running several in parallel and around the clock.

As Cloudflare's CTO Dane Knecht and VP of Product Rita Kozlov noted in our welcome to Agents Week post, the potential scale of agents is staggering: If even a fraction of the world's knowledge workers each run a few agents in parallel, you need compute capacity for tens of millions of simultaneous sessions. The one-app-serves-many-users model the cloud was built on doesn't work for that. But that's exactly what developers and businesses want to do: build agents, deploy them to users, and run them at scale.

Getting there means solving problems across the entire stack. Agents need compute that scales from full operating systems to lightweight isolates. They need security and identity built into how they run.  They need an agent toolbox Continue reading

The AI engineering stack we built internally — on the platform we ship

In the last 30 days, 93% of Cloudflare’s R&D organization used AI coding tools powered by infrastructure we built on our own platform.

Eleven months ago, we undertook a major project: to truly integrate AI into our engineering stack. We needed to build the internal MCP servers, access layer, and AI tooling necessary for agents to be useful at Cloudflare. We pulled together engineers from across the company to form a tiger team called iMARS (Internal MCP Agent/Server Rollout Squad). The sustained work landed with the Dev Productivity team, who also own much of our internal tooling including CI/CD, build systems, and automation.

Here are some numbers that capture our own agentic AI use over the last 30 days:

  • 3,683 internal users actively using AI coding tools (60% company-wide, 93% across R&D), out of approximately 6,100 total employees

  • 47.95 million AI requests 

  • 295 teams are currently utilizing agentic AI tools and coding assistants.

  • 20.18 million AI Gateway requests per month

  • 241.37 billion tokens routed through AI Gateway

  • 51.83 billion tokens processed on Workers AI

The impact on developer velocity internally is clear: we’ve never seen a quarter-to-quarter increase in merge requests to this degree.

As AI Continue reading

Orchestrating AI Code Review at scale

Code review is a fantastic mechanism for catching bugs and sharing knowledge, but it is also one of the most reliable ways to bottleneck an engineering team. A merge request sits in a queue, a reviewer eventually context-switches to read the diff, they leave a handful of nitpicks about variable naming, the author responds, and the cycle repeats. Across our internal projects, the median wait time for a first review was often measured in hours.

When we first started experimenting with AI code review, we took the path that most other people probably take: we tried out a few different AI code review tools and found that a lot of these tools worked pretty well, and a lot of them even offered a good amount of customisation and configurability! Unfortunately, though, the one recurring theme that kept coming up was that they just didn’t offer enough flexibility and customisation for an organisation the size of Cloudflare.

So, we jumped to the next most obvious path, which was to grab a git diff, shove it into a half-baked prompt, and ask a large language model to find bugs. The results were exactly as noisy as you might expect, with a flood Continue reading

Multicast PIM Bootstrap Router (BSR) (VI)

Multicast PIM Bootstrap Router (BSR) (VI)

We are at the tail end of our multicast series. So far, we have covered multicast basics, IGMP, PIM Dense Mode, PIM Sparse Mode, and Auto-RP. In this post, we will look at Bootstrap Router, or BSR.

Multicast PIM Auto RP
AutoRP solves this problem by allowing routers to dynamically learn the RP address. Instead of manually configuring the RP on each router
Multicast PIM Bootstrap Router (BSR) (VI)

Bootstrap Router (BSR) Overview

In the previous post, we looked at Auto-RP, which is Cisco's proprietary method for dynamically distributing RP information to all routers in the network. BSR solves the same problem but is defined in the PIM standards (RFC 5059), making it vendor-neutral and interoperable across different platforms. The core idea is the same. Instead of manually configuring the RP address on every router, BSR allows routers to learn RP information automatically. A router called the Bootstrap Router collects RP information from Candidate RPs and distributes it to all PIM routers in the network.

The key difference is in how this information flows. With Auto-RP, the Mapping Agent sends RP mappings to the specific multicast group 224.0.1.40. This creates the chicken-and-egg problem we discussed in the previous post Continue reading

1 2 3 3,863