Archive

Category Archives for "Networking"

Get better visibility for the WAF with payload logging

As the surface area for attacks on the web increases, Cloudflare’s Web Application Firewall (WAF)  provides a myriad of solutions to mitigate these attacks. This is great for our customers, but the cardinality in the workloads of the millions of requests we service means that generating false positives is inevitable. This means that the default configuration we have for our customers has to be fine-tuned. 

Fine-tuning isn’t an opaque process: customers have to get some data points and then decide what works for them. This post explains the technologies we offer to enable customers to see why the WAF takes certain actions — and the improvements that have been made to reduce noise and increase signal.

The Log action is great — can we do more?

Cloudflare’s WAF protects origin servers from different kinds of layer 7 attacks, which are attacks that target the application layer. Protection is provided with various tools like:

IOS/XR Route Redistribution Configuration Mess

One would hope that the developers of a network operating system wouldn’t feel the irresistible urge to reinvent what should have been a common configuration feature for every routing protocol. Alas, the IOS/XR developers failed to get that memo.

I decided to implement route redistribution (known as route import in netlab) for OSPFv2/OSPFv3, IS-IS, and BGP on IOS/XR (Cisco 8000v running IOS/XR release 24.4.1) and found that each routing protocol uses a different syntax for the source routing protocol part of the redistribute command.

HN806: Let’s Get NUTS!

Unit testing is a software development practice for checking that an individual component of code works before integrating that unit with other components in a larger program. A new open source project called Network Unit Testing System, or NUTS, brings the same concept to network automation. The big idea is that by incorporating unit tests into... Read more »

Tutorial: Implement a Nginx Gateway Fabric as an Alternative to Ingress

The Kubernetes ecosystem is undergoing a fundamental shift in how it manages external traffic. On Nov. 12, 2025, Kubernetes Kubernetes Gateway API emerges as the successor, which is a standardized, extensible framework that addresses these fundamental limitations. Rather than relying on fragmented implementations and proprietary annotations, Gateway API introduces a unified model that supports multiprotocol routing (L4 and L7), fine-grained traffic control, header-based pattern matching, request mirroring, and native traffic metrics. Achieving General Availability in 2023, Gateway API represents the Kubernetes community’s answer to the ingress problem. For a detailed comparison of Ingress controller and Gateway, refer to my previous article Continue reading

3-Hour Cloudflare Outage Knocks Out AI Chatbots, Shopify

On Nov. 18, 2025, major outage lasting several hours that disrupted access to numerous popular websites and online services worldwide. This was only the latest in a wave of major Internet service providers going down. Others have included Amazon Web Services and Cloudflare, the root cause was a database system’s permissions blunder. This resulted in popular sites and services such as Shopify, Amazon, and Robox failing, and in essentially all AI chatbots, such as ChatGPT, Perplexity, and Anthropic Claude, being knocked out. Root Cause: A Database Permissions Blunder Specifically, the outage was triggered not by a cyberattack, but by a software bug in Cloudflare’s Bot Management system. Specifically, a recent change to the permissions for a database query generated an overlarge “feature file” that was used by the Bot Management module with many duplicate entries. This Continue reading

Building VXLAN/EVPN Data Center Lab with netlab

Dmitry Klepcha published an excellent document describing how you can use netlab to build a series of data center fabric labs, starting from a simple IP network (without routing) and finishing with a complex EVPN/VXLAN network using symmetric IRB and MLAG toward hosts.

But wait, there’s more: all the lab topologies he used in his exercises are available on GitHub, which means that you could just clone the repo and start using them (I also “borrowed” some of his ideas as future netlab improvements).

Finally, thanks a million to Roman Pomazanov for bringing Dmitry’s work to my attention (and for the quote at the end of his post ;).

Interesting: an MCP Agent for Link-State Routing Protocols

Vadim Semenov created a nice demo that allows you to use an LLM to query the collected link-state graphs through an MCP agent (SuzieQ would probably be faster and easier to deploy, but hey, AI).

If you want to kick the tires, you’ll find the source code on GitHub (Network AI assistant, MCP server for Topolograph service). You’ll also need Vadim’s previous projects: Topolograph and OSPF watcher or IS-IS watcher.

HS117: Environmental, Social and Governance Initiatives: What That Means for Your Organization (Sponsored)

Environmental, Social, Governance (ESG) initiatives aren’t just “the right thing to do”, they can also save companies real dollars, particularly if they’re investing in data centers and other infrastructure. Join Jonathan Ciccio, Continuous Improvement Manager for The Siemon Company, as we discuss The Siemon Company’s ESG initiatives. The Siemon Company has been in business for... Read more »

Multi-Pod EVPN Troubleshooting: Fixing Next Hops

Last month, I wrote about the specifics of troubleshooting multi-pod EVPN designs. Today, I’d like to start a journey through an example in which (channeling my inner CCIE preparation lab instructor) I broke as many things as I could think of.

Here’s the lab topology we’ll use (and as usual, the corresponding netlab topology file and device configurations are on GitHub). Our network has two sites (pods), each with a spine switch, a leaf switch, and a host attached to the leaf switch. The inter-pod link is connected to the spine switches to minimize the number of devices.

Cloudflare outage on November 18, 2025

On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able Continue reading