Cloudflare incident on August 21, 2025

On August 21, 2025, an influx of traffic directed toward clients hosted in the Amazon Web Services (AWS) us-east-1 facility caused severe congestion on links between Cloudflare and AWS us-east-1. This impacted many users who were connecting to or receiving connections from Cloudflare via servers in AWS us-east-1 in the form of high latency, packet loss, and failures to origins.

Customers with origins in AWS us-east-1 began experiencing impact at 16:27 UTC. The impact was substantially reduced by 19:38 UTC, with intermittent latency increases continuing until 20:18 UTC.

This was a regional problem between Cloudflare and AWS us-east-1, and global Cloudflare services were not affected. The degradation in performance was limited to traffic between Cloudflare and AWS us-east-1. The incident was a result of a surge of traffic from a single customer that overloaded Cloudflare's links with AWS us-east-1. It was a network congestion event, not an attack or a BGP hijack.

We’re very sorry for this incident. In this post, we explain what the failure was, why it occurred, and what we’re doing to make sure this doesn’t happen again.

Background

Cloudflare helps anyone to build, connect, protect, and accelerate their websites on the Internet. Most customers host their Continue reading

How a volunteer-run wildfire site in Portugal stayed online during DDoS attacks

On July 31, 2025, just as Portugal entered the peak of another intense wildfire season, João Pina, also known as Tomahock, received an automated alert from Cloudflare. His volunteer-run project, fogos.pt, now a trusted source of real-time wildfire information for millions across Portugal, was under attack.

One of the several alerts fogos.pt received related to the DDoS attack

What started in 2015 as a late-night side project with friends around a dinner table in Aveiro has grown into a critical public resource. During wildfires, the site is where firefighters, journalists, citizens, and even government agencies go to understand what’s happening on the ground. Over the years, fogos.pt has evolved from parsing PDFs into visual maps to a full-featured app and website with historical data, weather overlays, and more. It’s also part of Project Galileo, Cloudflare’s initiative to protect vulnerable but important public interest sites at no cost.

Wildfires are not just a Portuguese challenge. They are frequent across southern Europe (Spain, Greece, currently also under alert), California, Australia, and in Canada, which in 2023 faced record-setting fires. In all these cases, reliable information can be crucial, sometimes life-saving. Other organizations offering similar public services can Continue reading

Configuring BGP Community Propagation is Confusing

A large number of vendors claim to use industry-standard CLI, which means “something that looks like Cisco IOS, but we can’t say that in public.” The implementations of that “standard” are full of quirks; as I was making fun of Cisco IOS last week, it’s only fair to look at how others deal with BGP community propagation.

netlab has BGP configuration templates for 14 different platforms1, including these implementations that look like Cisco IOS from a distance if you squint just right2: Arista EOS, Aruba CX, and FRRouting. You can check the configuration templates if you wish; here’s the TC&DB3 overview:

How we built AI face cropping for Images

During Developer Week 2024, we introduced AI face cropping in private beta. This feature automatically crops images around detected faces, and marks the first release in our upcoming suite of AI image manipulation capabilities.

AI face cropping is now available in Images for everyone. To bring this feature to general availability, we moved our CPU-based prototype to a GPU-based implementation in Workers AI, enabling us to address a number of technical challenges, including memory leaks that could hamper large-scale use.

Photograph by Suad Kamardeen (@suadkamardeen) on Unsplash

Turning raw images into production-ready assets

We developed face cropping with two particular use cases in mind:

Social media platforms and AI chatbots. We observed a lot of traffic from customers who use Images to turn unedited images of people into smaller profile pictures in neat, fixed shapes.

E-commerce platforms. The same product photo might appear in a grid of thumbnails on a gallery page, then again on an individual product page with a larger view. The following example illustrates how cropping can change the emphasis from the model’s shirt to their sunglasses.

Photograph by Media Modifier (@mediamodifier) on Unsplash

When handling high volumes of media content, preparing images for production can be Continue reading

SwiNOG 40: Trustworthy Network Automation

The SwiNOG 40 event started with an interesting presentation on Building Trustworthy Network Automation (video) by Damien Garros (now CEO @ OpsMill) who discussed the principles one can use to build a trustworthy network automation solution, including idempotency, dry runs, and transactional changes. He also covered the crucial roles of the declarative approach, version control, and testing.

If you have ever watched any of my network automation materials, you won’t be surprised by anything he said, but if you’re just starting your network automation journey, you MUST watch this presentation to get your bearings straight.

Announcing the Cloudflare Browser Developer Program

Today, we are announcing Cloudflare’s Browser Developer Program, a collaborative initiative to strengthen partnership between Cloudflare and browser development teams.

Browser developers can apply to join here

At Cloudflare, we aim to help build a better Internet. One way we achieve this is by providing website owners with the tools to detect and block unwanted traffic from bots through Cloudflare Challenges or Turnstile. As both bots and our detection systems become more sophisticated, the security checks required to validate human traffic become more complicated. While we aim to strike the right balance, we recognize these security measures can sometimes cause issues for legitimate browsers and their users.

Building a better web together

A core objective of the program is to provide a space for intentional collaboration where we can work directly with browser developers to ensure that both accessibility and security can co-exist. We aim to support the evolving browser landscape, while upholding our responsibility to our customers to deliver the best security products. This program provides a dedicated channel for browser teams to share feedback, report issues, and help ensure that Cloudflare’s Challenges and Turnstile work seamlessly with all browsers.

What the program includes

Browser developers in Continue reading

Parallelization Strategies in Neural Networks

From a network engineer’s perspective, it is not mandatory to understand the full functionality of every application running in a datacenter. However, understanding the communication patterns of the most critical applications—such as their packet and flow sizes, entropy, transport frequency, and link utilization—is essential. Additionally, knowing the required transport services, including reliability, in-order packet delivery, and lossless transmission, is important.

In AI fabrics, a neural network, including both its training and inference phases, can be considered an application. For this reason, this section first briefly explains the basic operation of the simplest neural network: the Feed Forward Neural Network (FNN). It then discusses the operation of a single neuron. Although a deep understanding of the application itself is not required, this section equips the reader with knowledge of what pieces of information are exchanged between GPUs during each phase and why these data exchanges are important.


Feedforward Neural Network: Forward Pass


Figure 1-7 illustrates a simple four-layer Feed Forward Neural Network (FNN) distributed across four GPUs. The two leftmost GPUs reside in Node-1, and the other two GPUs reside in Node-2. The training data is fed into the first layer. In real neural networks, this first layer is the input Continue reading

AI Cluster Networking

Introduction

The Ultra Ethernet Specification v1.0 (UES), created by the Ultra Ethernet Consortium (UEC), defines end-to-end communication practices for Remote Direct Memory Access (RDMA) services in AI and HPC workloads over Ethernet network infrastructure. UES not only specifies a new RDMA-optimized transport layer protocol, Ultra Ethernet Transport (UET), but also defines how the full application stack—from Software through Transport, Network, Link, and Physical—can be adjusted to provide improved RDMA services while continuing to leverage well-established standards. UES includes, but is not limited to, a software API, mechanisms for low-latency and lossless packet delivery, and an end-to-end secure software communication path. 

Before diving into the details of Ultra Ethernet, let’s briefly look at what we are dealing with when we talk about an AI cluster. From this point onward, we focus on Ultra Ethernet from the AI cluster perspective. This chapter first introduces the AI cluster networking. Then, it briefly explains how a neural network operates during the training process, including an short introduction to the backpropagation algorithm and its forward and backward pass functionality.

Note: This book doesn’t include any complex mathematical algorithms related backpropagation algorithm, or detailed explanation of different neural networks. I have written a book Continue reading

The Rise of CLI-Based AI Coding Agents: Claude code vs Gemini CLI

Introduction I have been a Cursor user for vibe coding for 3 months. I was very skeptical about using Claude Code and Gemini CLI at first, since I wasn’t comfortable with the idea of using a terminal as an AI agent. But in the last 1–2 months, I’ve been trying them both — and it … Continue reading The Rise of CLI-Based AI Coding Agents: Claude code vs Gemini CLI

How Agentic AI Is Redefining Campus and Branch Network Needs

The workplace is being redefined. AI workloads, an explosion of connected devices, and changing working patterns are forcing organizations to rethink their campus and branch network designs to support business goals and deliver great digital experiences to customers and employees. Over the last decade, IT teams have had to manage significant change with the adoption of cloud computing, widespread use of mobile devices, and SaaS applications becoming critical to core business operations. Now, the transformation that is AI presents an opportunity to gain a core competitive advantage and a productivity multiplier for those organizations that successfully embrace it. When it comes to the rise of Small Language Models (SLMs) and agentic AI, sophisticated AI capabilities are moving closer to where business happens — at the branch office and on campus. This shift to “edge AI” promises exciting possibilities but also brings significant implications for network infrastructure that network architects and decision-makers must address now. Understanding Local Small Language Models (SLMs) at the Edge Local SLMs are designed to be compact and efficient enough to run on local servers or even dedicated edge devices. For tasks like answering simple queries or summarizing documents using local data, these models perform inference right Continue reading

Congestion Control at IETF 123

The Internet Engineering Task Force (IETF) meets three times a year to develop Internet Standards and related best practices. At its July 2025 meeting in Madrid, several sessions explored the evolving role of congestion control in transport protocols and sparked the observations in this post.