Archive

Category Archives for "Networking"

Hyper-volumetric DDoS attacks skyrocket: Cloudflare’s 2025 Q2 DDoS threat report

Welcome to the 22nd edition of the Cloudflare DDoS Threat Report. Published quarterly, this report offers a comprehensive analysis of the evolving threat landscape of Distributed Denial of Service (DDoS) attacks based on data from the Cloudflare network. In this edition, we focus on the second quarter of 2025. To view previous reports, visit www.ddosreport.com.

June was the busiest month for DDoS attacks in 2025 Q2, accounting for nearly 38% of all observed activity. One notable target was an independent Eastern European news outlet protected by Cloudflare, which reported being attacked following its coverage of a local Pride parade during LGBTQ Pride Month.

Key DDoS insights

  • DDoS attacks continue to break records. During 2025 Q2, Cloudflare automatically blocked the largest ever reported DDoS attacks, peaking at 7.3 terabits per second (Tbps) and 4.8 billion packets per second (Bpps).

  • Overall, in 2025 Q2, hyper-volumetric DDoS attacks skyrocketed. Cloudflare blocked over 6,500 hyper-volumetric DDoS attacks, an average of 71 per day. 

  • Although the overall number of DDoS attacks dropped compared to the previous quarter — which saw an unprecedented surge driven by a large-scale campaign targeting Cloudflare’s network and critical Internet infrastructure protected by Cloudflare — the Continue reading

Blog Reboot

When I first launched this site, many years ago, it served as a humble lab notebook and sharing short personal stories from my working life. I shared diagrams, Junos configs , and field notes written after late night maintenance windows or proof of concepts. Those stories took on a life of their own. They brought […]

The post Blog Reboot first appeared on Rick Mur.

Integration Testing in Infrahub – Validate Your Automation in Real Environments

Integration Testing in Infrahub - Validate Your Automation in Real Environments

Testing individual components is a good start, but what happens when you need to validate how everything works together? In this post, we’ll show you how to run integration tests in Infrahub that verify your schema, data, and Git workflows in a real, running environment.

You’ll learn how to spin up isolated Infrahub instances on the fly using Docker and Testcontainers, automate schema and data loading, and catch issues before they reach production.

SPONSORED

OpsMill has partnered with me for this post, and they also support my blog as a sponsor. The post is originally published under https://opsmill.com/blog/integration-testing-infrahub/

You don’t need to be a Python expert to follow along. We’ll walk through everything step by step, with example code and tooling recommendations. You can also follow this guide in video form on the Cisco DevNet YouTube channel:

All the sample data and code used here are available on the OpsMill GitHub repo, so you can set up your own test environment and try it yourself.

Quick recap

Previously, we covered how to write smoke and unit tests using the Continue reading

Triggering QUIC

We look in detail at the mechanisms used to trigger a client application (typically a browser) to connect to the server using the QUIC transport protocol.

Go is still not good

Previous posts Why Go is not my favourite language and Go programs are not portable have me critiquing Go for over a decade.

These things about Go are bugging me more and more. Mostly because they’re so unnecessary. The world knew better, and yet Go was created the way it was.

For readers of previous posts you’ll find some things repeated here. Sorry about that.

Error variable scope is forced to be wrong

Here’s an example of the language forcing you to do the wrong thing. It’s very helpful for the reader of code (and code is read more often than it’s written), to minimize the scope of a variable. If by mere syntax you can tell the reader that a variable is just used in these two lines, then that’s a good thing.

Example:

if err := foo(); err != nil {
   return err
}

(enough has been said about this verbose repeated boilerplate that I don’t have to. I also don’t particularly care)

So that’s fine. The reader knows err is here and only here.

But then you encounter this:

bar, err := foo()
if err != nil {
  return err
}
if err =  Continue reading

AWS Transit Gateway Peering Attachments (VIII)

AWS Transit Gateway Peering Attachments (VIII)

Hi all, welcome back to the AWS networking series. This is actually part 3 of just Transit Gateway. I know some of you might be thinking, why are we still talking about Transit Gateway? But please bear with me. TGW is such an important concept, and it shows up in almost every architecture you come across.

So far, we've covered what a Transit Gateway is, how to create one, how route tables work, and how to manage associations and propagations. We also looked at how to create a VPN and attach it to the TGW, and we went through the process of sharing a TGW with other AWS accounts using AWS Resource Access Manager (RAM). In this post, we'll look at how to peer a Transit Gateway with another TGW, even when they are in different regions. So let's get to it.

If you're completely new to Transit Gateway, I highly recommend checking out the earlier introductory posts listed below.

Tracing network packets with eBPF and pwru

pwru (packet, where are you?) is an open source tool from Cilium that used eBPF instrumentation in recent Linux kernels to trace network packets through the kernel.

In this article we will use Multipass to create a virtual machine to experiment with pwru. Multipass is a command line tool for running Ubuntu virtual machines on Mac or Windows. Multipass uses the native virtualization capabilities of the host operating system to simplify the creation of virtual machines.

multipass launch --name=ebpf noble
multipass exec ebpf -- sudo apt update
multipass exec ebpf -- sudo apt -y install git clang llvm make libbpf-dev flex bison golang
multipass exec ebpf -- git clone https://github.com/cilium/pwru.git
multipass exec ebpf --working-directory pwru -- make
multipass exec ebpf -- sudo ./pwru/pwru -h
Run the commands above to create the virtual machine and build pwru from sources.
multipass exec ebpf -- sudo ./pwru/pwru port https
Run pwru to trace https traffic on the virtual machine.
multipass exec ebpf -- curl https://sflow-rt.com
In a second window, run the above command to generate an https request from the virtual machine.
SKB                CPU PROCESS          NETNS      MARK/x        IFACE       PROTO  MTU   LEN   TUPLE FUNC
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0               0          Continue reading

Google Brings the Lustre Parallel File System to Its Cloud

Google Cloud now offers a fully managed version of the Google Cloud Managed Lustre service went live (“general availability”) globally on July 8. An open source, high-performance file system, running those supercomputing jobs. And Lustre’s ability to stream data in the range of terabytes per second should also make it appealing to very approximately 1TB) — and can scale up to 8PiB or more. With this release, Google Cloud has caught up with other cloud providers in offering a cloud-based Lustre. It competes with Amazon FSx and Oracle‘s EXAScaler Continue reading

Dual-Stack Common-Services VRF Confuses Aruba CX

As I was running the netlab pre-release integration tests, I noticed that ArubaCX failed the IPv6 Common Services test (it worked before). Here’s the gist of what that test does:

  • It creates three VRFs (red, blue, and common)
  • It imports routes from red and blue VRF into the common VRF and routes from the common VRF into the red and blue VRF (the schoolbook example of common services VRF)
  • Just to be on the safe side, it imports red routes into the red VRF and so on.

Here’s the relevant part of the netlab lab topology:

AI Metrics with InfluxDB Cloud

The InfluxDB AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons
This article shows how to integrate with InfluxDB Cloud instead of running the services locally.

Note: InfluxDB Cloud has a free service tier that can be used to test this example.

Save the following compose.yml file on a system running Docker.

configs:
  config.telegraf:
    content: |
      [agent]
        interval = '15s'
        round_interval = true
        omit_hostname = true
      [[outputs.influxdb_v2]]
        urls = ['https://<INFLUXDB_CLOUD_INSTANCE>.cloud2.influxdata.com']
         Continue reading

Quicksilver v2: evolution of a globally distributed key-value store (Part 1)

Quicksilver is a key-value store developed internally by Cloudflare to enable fast global replication and low-latency access on a planet scale. It was initially designed to be a global distribution system for configurations, but over time it gained popularity and became the foundational storage system for many products in Cloudflare.

A previous post described how we moved Quicksilver to production and started replicating on all machines across our global network. That is what we called Quicksilver v1: each server has a full copy of the data and updates it through asynchronous replication. The design served us well for some time. However, as our business grew with an ever-expanding data center footprint and a growing dataset, it became more and more expensive to store everything everywhere.

We realized that storing the full dataset on every server is inefficient. Due to the uniform design, data accessed in one region or data center is replicated globally, even if it's never accessed elsewhere. This leads to wasted disk space. We decided to introduce a more efficient system with two new server roles: replica, which stores the full dataset and proxy, which acts as a persistent cache, evicting unused key-value pairs to free Continue reading

AWS Transit Gateway VPN Attachment and Resource Access Manager (VII)

AWS Transit Gateway VPN Attachment and Resource Access Manager (VII)

In the previous post, we covered the basics of Transit Gateway, what it is, what problem it solves, and we also looked at how to create one. We walked through attaching two VPCs to the TGW and establishing connectivity between them. We also covered the important concepts of TGW attachments, associations, and propagations.

AWS Transit Gateway Introduction
This complexity (when you have many VPCs) is why, in this post, we will look at AWS Transit Gateway (TGW). A Transit Gateway is an incredibly important networking resource in AWS that solves these scaling challenges.
AWS Transit Gateway VPN Attachment and Resource Access Manager (VII)

In this post, we will build on that knowledge and look at

  • Creating and attaching a Site-to-Site VPN to the Transit Gateway.
  • We will also cover sharing a single Transit Gateway across different AWS accounts using Resource Access Manager.
As always, if you find this post helpful, press the ‘clap’ button. It means a lot to me and helps me know you enjoy this type of content. If I get enough claps for this series, I’ll make sure to write more on this specific topic.

Transit Gateway AWS Site-to-Site VPN Attachment

We have already seen how to create a Site-to-Site Continue reading

Worth Reading: The Secret Rules of the Terminal

Did you ever wonder why pressing an up-arrow in a (Linux) terminal window sometimes recalls the previous command but other times creates ^[[A?

Julia Evans did, and spent months exploring the quirks of the Linux terminal (and writing blog posts describing what she found), finally resulting in The Secret Rules of the Terminal (including the various shells, terminal emulators, escape codes, and TTY driver). A must-read if you’re a newbie who wants to understand why things happen the way they do.

How TimescaleDB helped us scale analytics and reporting

At Cloudflare, PostgreSQL and ClickHouse are our standard databases for transactional and analytical workloads. If you’re part of a team building products with configuration in our Dashboard, chances are you're using PostgreSQL. It’s fast, versatile, reliable, and backed by over 30 years of development and real-world use. It has been a foundational part of our infrastructure since the beginning, and today we run hundreds of PostgreSQL instances across a wide range of configurations and replication setups.

ClickHouse is a more recent addition to our stack. We started using it around 2017, and it has enabled us to ingest tens of millions of rows per second while supporting millisecond-level query performance. ClickHouse is a remarkable technology, but like all systems, it involves trade-offs.

In this post, I’ll explain why we chose TimescaleDB — a Postgres extension — over ClickHouse to build the analytics and reporting capabilities in our Zero Trust product suite.

Designing for future growth

After a decade in software development, I’ve grown to appreciate systems that are simple and boring. Over time, I’ve found myself consistently advocating for architectures with the fewest moving parts possible. Whenever I see a system diagram with more than three boxes, I ask: Why Continue reading

Calico Whisker & Staged Network Policies: Secure Kubernetes Workloads Without Downtime

Rolling out network policies in a live Kubernetes cluster can feel like swapping wings mid-flight—one typo or overly broad rule and critical traffic is grounded. Calico’s Staged Network Policies remove the turbulence by letting you deploy policies in staged mode, so you can observe their impact before enforcing anything. Add Whisker, the open-source policy enforcement and testing tool (introduced as part of Calico Open Source 3.30) that captures every flow and tags it with a policy verdict, and you’ve got a safety harness that proves your change is sound long before you flip the switch. In this post, we’ll walk you through how you can leverage these capabilities to tighten security, validate intent, and ship changes confidently—without a single packet of downtime.

Deploying a Kubernetes Cluster

Calico for Policy is a CNI agnostic tool. Refer to the Calico Open Source docs for a list of supported CNIs. The git repository for this blog post can be found here.

For this post, let’s deploy a simple AKS cluster with Azure CNI.

## Configure 
az group create --name calicooss --location eastus2

## Create a 3 node AKS cluster with Azure CNI
az aks create \
  --resource-group calicooss \
  --name  Continue reading

Introducing simple and secure egress policies by hostname in Cloudflare’s SASE platform

Cloudflare’s SASE platform is on a mission to strengthen our platform-wide support for hostname- and domain-based policies. This mission is being driven by enthusiastic demands from our customers, and boosted along the way by several interesting engineering challenges. Today, we’re taking a deep dive into the first milestone of this mission, which we recently released in open beta: egress policies by hostname, domain, content category, and application. Let’s dive right in! 

Egress policies and IP ACLs

Customers use our egress policies to control how their organization's Internet traffic connects to external services. An egress policy allows a customer to control the source IP address their traffic uses, as well as the geographic location that their traffic uses to egress onto the public Internet. Control of the source IP address is especially useful when accessing external services that apply policies to traffic based on source IPs, using IP Access Control Lists (ACLs). Some services use IP ACLs because they improve security, while others use them because they are explicitly required by regulation or compliance frameworks. 

(That said, it's important to clarify that we do not recommend relying on IP ACLs as the only security mechanism used to gate Continue reading

Ultra Ethernet

Introduction

Remote Direct Memory Access over Converged Ethernet (RoCE) is a transport model that extends InfiniBand semantics over Ethernet networks. It enables direct memory access between hosts by encapsulating InfiniBand transport headers—such as the InfiniBand Transport Header (IBTH) and the RDMA Extended Transport Header (RETH)—within Ethernet, IP, and UDP packets. In by book "Deep Learning for Network Engineers" Chapter 9, describes how RDMA NICs process application work requests, known as InfiniBand verbs, and how these are encoded into IBTH and RETH headers for delivery to remote targets using RoCEv2.

This post shifts focus to the Ultra Ethernet Transport (UET) model, developed by the Ultra Ethernet Consortium (UEC). UET defines an alternative RDMA transport architecture that operates over standard Ethernet networks, without relying on InfiniBand message formats or semantics. While both RoCEv2 and UET enable remote memory access between nodes, UET is not based on InfiniBand transport headers, and the term RoCE is not used in UET systems.

Instead, UET introduces a new Ultra Ethernet (UE) layer composed of several sublayers, including the Semantic Sublayer (SES) and the Packet Delivery Sublayer (PDS). These sublayers are responsible for encoding and transmitting RDMA operations—such as memory addresses, remote keys (RKEYs), operation codes, and Continue reading

Expanding a Running Netlab Topology

One of the happy netlab users sent me an interesting challenge:

  • He’s built a large lab and added tons of extra configuration to the lab devices.
  • Afterwards, he realized he’d like to add a few more devices to the lab and was worried about losing all the changes he had made.

Unfortunately, you cannot add new devices to an already-running lab. You must shut down the lab, change the topology description, and start a new lab. However, there are things you can do to preserve the extra work you already did: