On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.
The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able Continue reading
[Updated 22 November 2025: Handoff Section]
The previous chapter described how an application gathers information about available hardware resources and uses that information to initialize the job environment. During this initialization, hardware resources are abstracted and made accessible to the UET provider as objects.
This chapter explains the data transport process, using gradient synchronization as an example.
Figure 5-1 depicts two GPUs—Rank 0 and Rank 2—participating in the same training job (JobID: 101). Both GPUs belong to the same NCCL topology and are connected to the Scale-Out Backend Network’s rail0.
Because the training model is large, each layer of neural network is split across two GPUs using tensor parallelism, meaning that the computations of a single layer are distributed between GPUs.
During the first forward-pass training iteration, the predicted model output does not match the expected result. This triggers the backward pass process, in which gradients—values indicating how much each weight parameter should be adjusted to improve the next forward-pass prediction—are computed.
Rank 0 computes its gradients, which in Figure 5-1 are stored as a 2D matrix with 3 rows and 1024 columns. The results are stored in a memory space registered for the process in local VRAM. Continue reading
We have some big news to share today: Replicate, the leading platform for running AI models, is joining Cloudflare.
We first started talking to Replicate because we shared a lot in common beyond just a passion for bright color palettes. Our mission for Cloudflare’s Workers developer platform has been to make building and deploying full-stack applications as easy as possible. Meanwhile, Replicate has been on a similar mission to make deploying AI models as easy as writing a single line of code. And we realized we could build something even better together by integrating the Replicate platform into Cloudflare directly.
We are excited to share this news and even more excited for what it will mean for customers. Bringing Replicate’s tools into Cloudflare will continue to make our Developer Platform the best place on the Internet to build and deploy any AI or agentic workflow.
Before we spend more time talking about the future of AI, we want to answer the questions that are top of mind for Replicate and Cloudflare users. In short:
For existing Replicate users: Your APIs and workflows will continue to work without interruption. You will soon benefit from the Continue reading
Remember the vendor consultants who persuasively told you how to use their gear to build a disaster recovery solution with stretched VLANs, even though the only disaster recovery they ever experienced was the frantic attempt to restart their PowerPoint slide deck? Fortunately, I was only involved in the aftermath of their activity when the laws of physics reasserted themselves, and I helped the poor victims rearchitect their network into a somewhat saner state.
There’s another batch of snake-oil salesmen consultants peddling their warez to the gullible incompetent managers: the AI preachers promising reduction in support costs. Like the other group of consultants, they have never worked in support and have never implemented a working AI solution in their lives, but that never bothered them or their audience.
Unfortunately, this time I had the unfortunate “privilege” of having the painful front-row seat.
By default, the dashboard shows the Last 24 Hours of traffic. Explore the data: select Last 30 Days to get a long term view, select Last 5 Minutes to get an up to the second view, click on items in a chart legend to show selected metric, drag to select an interval and zoom in.
The Expanse cluster at the San Diego Supercomputer Center is a batch-oriented science computing gateway serving thousands of users and a wide range of research projects, see Google News for examples.The SDSC Expanse cluster live AI/ML metrics dashboard displays real-time metrics for workloads running on the cluster:

If you’re using IP address truncation thinking it makes data “anonymous” or “non-personal,” you’re creating a false sense of security.
In March 2024, I received my first PR from an airplane: Sander Steffann was flying to South Africa to deliver an Ansible training and fixed a minor annoyance in the then-new multilab feature.
Of course, I wanted to know more about his setup, but it took us over a year and a half till we managed to sit down (virtually) and chat about it, the state of IPv6, the impact of CG-NAT on fraud prevention, and why digital twins don’t make sense in large datacenter migrations.
For more details, listen to Episode 202 of Software Gone Wild.
Sometimes we just like to talk about “life issues” as they relate to network engineering and technology career fields. Loneliness seems to especially plague network engineering and other “small” IT fields, where communities are small, change is rapid, and stress is high. Loneliness expert Lucy Rose joins Eyvonne, Tom, and Russ to discuss loneliness.
How do you find the root cause of a configuration management failure when you have a peak of hundreds of changes in 15 minutes on thousands of servers?
That was the challenge we faced as we built the infrastructure to reduce release delays due to failures of Salt, a configuration management tool. (We eventually reduced such failures on the edge by over 5%, as we’ll explain below.) We’ll explore the fundamentals of Salt, and how it is used at Cloudflare. We then describe the common failure modes and how they delay our ability to release valuable changes to serve our customers.
By first solving an architectural problem, we provided the foundation for self-service mechanisms to find the root cause of Salt failures on servers, datacenters and groups of datacenters. This system is able to correlate failures with git commits, external service failures and ad hoc releases. The result of this has been a reduction in the duration of software release delays, and an overall reduction in toilsome, repetitive triage for SRE.
To start, we will go into the basics of the Cloudflare network and how Salt operates within it. And then we’ll get to how we solved the challenge Continue reading
I managed to push out netlab release 25.11 yesterday. Here are the highlights:
We also had to make a few potentially-breaking changes, fixed a bunch of bugs, and added over a dozen small improvements.
You’ll find all the details in the release notes.
Remote bindings are bindings that connect to a deployed resource on your Cloudflare account instead of a locally simulated resource – and recently, we announced that remote bindings are now generally available.
With this launch, you can now connect to deployed resources like R2 buckets and D1 databases while running Worker code on your local machine. This means you can test your local code changes against real data and services, without the overhead of deploying for each iteration.
In this blog post, we’ll dig into the technical details of how we built it, creating a seamless local development experience.
A key part of the Cloudflare Workers platform has been the ability to develop your code locally without having to deploy it every time you wanted to test something – though the way we’ve supported this has changed greatly over the years.
We started with wrangler dev running in remote mode. This works by deploying and connecting to a preview version of your Worker that runs on Cloudflare’s network every time you make a change to your code, allowing you to test things out as you develop. However, remote mode isn’t perfect — Continue reading
Many engineers who tried out (or use) various AI products would agree that they’re useful when used correctly, but way overhyped. However, as Anil Dash explains in his Majority AI View article, we rarely hear that opinion:
What’s amazing is the reality that virtually 100% of tech experts I talk to in the industry feel this way, yet nobody outside of that cohort will mention this reality.