Archive

Category Archives for "Networking"

3-Hour Cloudflare Outage Knocks Out AI Chatbots, Shopify

On Nov. 18, 2025, major outage lasting several hours that disrupted access to numerous popular websites and online services worldwide. This was only the latest in a wave of major Internet service providers going down. Others have included Amazon Web Services and Cloudflare, the root cause was a database system’s permissions blunder. This resulted in popular sites and services such as Shopify, Amazon, and Robox failing, and in essentially all AI chatbots, such as ChatGPT, Perplexity, and Anthropic Claude, being knocked out. Root Cause: A Database Permissions Blunder Specifically, the outage was triggered not by a cyberattack, but by a software bug in Cloudflare’s Bot Management system. Specifically, a recent change to the permissions for a database query generated an overlarge “feature file” that was used by the Bot Management module with many duplicate entries. This Continue reading

Building VXLAN/EVPN Data Center Lab with netlab

Dmitry Klepcha published an excellent document describing how you can use netlab to build a series of data center fabric labs, starting from a simple IP network (without routing) and finishing with a complex EVPN/VXLAN network using symmetric IRB and MLAG toward hosts.

But wait, there’s more: all the lab topologies he used in his exercises are available on GitHub, which means that you could just clone the repo and start using them (I also “borrowed” some of his ideas as future netlab improvements).

Finally, thanks a million to Roman Pomazanov for bringing Dmitry’s work to my attention (and for the quote at the end of his post ;).

Interesting: an MCP Agent for Link-State Routing Protocols

Vadim Semenov created a nice demo that allows you to use an LLM to query the collected link-state graphs through an MCP agent (SuzieQ would probably be faster and easier to deploy, but hey, AI).

If you want to kick the tires, you’ll find the source code on GitHub (Network AI assistant, MCP server for Topolograph service). You’ll also need Vadim’s previous projects: Topolograph and OSPF watcher or IS-IS watcher.

HS117: Environmental, Social and Governance Initiatives: What That Means for Your Organization (Sponsored)

Environmental, Social, Governance (ESG) initiatives aren’t just “the right thing to do”, they can also save companies real dollars, particularly if they’re investing in data centers and other infrastructure. Join Jonathan Ciccio, Continuous Improvement Manager for The Siemon Company, as we discuss The Siemon Company’s ESG initiatives. The Siemon Company has been in business for... Read more »

Multi-Pod EVPN Troubleshooting: Fixing Next Hops

Last month, I wrote about the specifics of troubleshooting multi-pod EVPN designs. Today, I’d like to start a journey through an example in which (channeling my inner CCIE preparation lab instructor) I broke as many things as I could think of.

Here’s the lab topology we’ll use (and as usual, the corresponding netlab topology file and device configurations are on GitHub). Our network has two sites (pods), each with a spine switch, a leaf switch, and a host attached to the leaf switch. The inter-pod link is connected to the spine switches to minimize the number of devices.

Cloudflare outage on November 18, 2025

On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able Continue reading

Tech Bytes: Monitoring AI Agents and MCP Servers with Cisco ThousandEyes (Sponsored)

If you’re building, testing, or using AI agents and services, how can you get network visibility into, and ensure performance of, these tools? On today’s Tech Bytes podcast, we talk with sponsor Cisco ThousandEyes  about how to monitor and measure performance of AI agents, AI models and services, and Model Context Protocol servers. We’ll get... Read more »

UET Data Transfer Operation: Introduction

Introduction

[Updated 22 November 2025: Handoff Section]

The previous chapter described how an application gathers information about available hardware resources and uses that information to initialize the job environment. During this initialization, hardware resources are abstracted and made accessible to the UET provider as objects.

This chapter explains the data transport process, using gradient synchronization as an example.

Figure 5-1 depicts two GPUs—Rank 0 and Rank 2—participating in the same training job (JobID: 101). Both GPUs belong to the same NCCL topology and are connected to the Scale-Out Backend Network’s rail0.

Because the training model is large, each layer of neural network is split across two GPUs using tensor parallelism, meaning that the computations of a single layer are distributed between GPUs. 

During the first forward-pass training iteration, the predicted model output does not match the expected result. This triggers the backward pass process, in which gradients—values indicating how much each weight parameter should be adjusted to improve the next forward-pass prediction—are computed.

Rank 0 computes its gradients, which in Figure 5-1 are stored as a 2D matrix with 3 rows and 1024 columns. The results are stored in a memory space registered for the process in local VRAM. Continue reading

Replicate is joining Cloudflare

We have some big news to share today: Replicate, the leading platform for running AI models, is joining Cloudflare.

We first started talking to Replicate because we shared a lot in common beyond just a passion for bright color palettes. Our mission for Cloudflare’s Workers developer platform has been to make building and deploying full-stack applications as easy as possible. Meanwhile, Replicate has been on a similar mission to make deploying AI models as easy as writing a single line of code. And we realized we could build something even better together by integrating the Replicate platform into Cloudflare directly.

We are excited to share this news and even more excited for what it will mean for customers. Bringing Replicate’s tools into Cloudflare will continue to make our Developer Platform the best place on the Internet to build and deploy any AI or agentic workflow.

What does this mean for you? 

Before we spend more time talking about the future of AI, we want to answer the questions that are top of mind for Replicate and Cloudflare users. In short: 

For existing Replicate users: Your APIs and workflows will continue to work without interruption. You will soon benefit from the Continue reading

AI Enshittification: Swiss Airlines Edition

Remember the vendor consultants who persuasively told you how to use their gear to build a disaster recovery solution with stretched VLANs, even though the only disaster recovery they ever experienced was the frantic attempt to restart their PowerPoint slide deck? Fortunately, I was only involved in the aftermath of their activity when the laws of physics reasserted themselves, and I helped the poor victims rearchitect their network into a somewhat saner state.

There’s another batch of snake-oil salesmen consultants peddling their warez to the gullible incompetent managers: the AI preachers promising reduction in support costs. Like the other group of consultants, they have never worked in support and have never implemented a working AI solution in their lives, but that never bothered them or their audience.

Unfortunately, this time I had the unfortunate “privilege” of having the painful front-row seat.

SC25: SDSC Expanse cluster live AI/ML metrics

The SDSC Expanse cluster live AI/ML metrics dashboard is a joint InMon / San Diego Supercomputer Center (SDSC) demonstration at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC25) conference being held this week in St. Louis, November 16-21. Click on the dashboard link during the show to see live traffic.

By default, the dashboard shows the Last 24 Hours of traffic. Explore the data: select Last 30 Days to get a long term view, select Last 5 Minutes to get an up to the second view, click on items in a chart legend to show selected metric, drag to select an interval and zoom in.

The Expanse cluster at the San Diego Supercomputer Center is a batch-oriented science computing gateway serving thousands of users and a wide range of research projects, see Google News for examples.

The SDSC Expanse cluster live AI/ML metrics dashboard displays real-time metrics for workloads running on the cluster:

  • Total Traffic Total traffic entering fabric
  • Cluster Services Traffic associated with Lustre, Ceph and NFS storage, and Slurm workload management
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • Continue reading
1 3 4 5 6 7 3,476