Cloudflare incident on June 20, 2024

On Thursday, June 20, 2024, two independent events caused an increase in latency and error rates for Internet properties and Cloudflare services that lasted 114 minutes. During the 30-minute peak of the impact, we saw that 1.4 - 2.1% of HTTP requests to our CDN received a generic error page, and observed a 3x increase for the 99th percentile Time To First Byte (TTFB) latency.

These events occurred because:

  1. Automated network monitoring detected performance degradation, re-routing traffic suboptimally and causing backbone congestion between 17:33 and 17:50 UTC
  2. A new Distributed Denial-of-Service (DDoS) mitigation mechanism deployed between 14:14 and 17:06 UTC triggered a latent bug in our rate limiting system that allowed a specific form of HTTP request to cause a process handling it to enter an infinite loop between 17:47 and 19:27 UTC

Impact from these events were observed in many Cloudflare data centers around the world.

With respect to the backbone congestion event, we were already working on expanding backbone capacity in the affected data centers, and improving our network mitigations to use more information about the available capacity on alternative network paths when taking action. In the remainder of this blog post, we will go into Continue reading

Looking for a Simple Multihop EBGP Use Case

I plan to add several challenge labs using multihop EBGP sessions to the BGP labs project, including:

  1. Running BGP between VMs and central BGP route servers
  2. Using multihop EBGP session to send full Internet routing table to a customer without overloading the PE-router
  3. Running EBGP EVPN session between loopbacks advertised with EBGP IPv4 session (🤢)

However, I would love to start with a simple use case to help engineers unfamiliar with BGP realize when they might have to use multihop EBGP sessions. Unfortunately, I can’t find one, and the scenarios where I used multihop EBGP in the past (EBGP load balancing and using a low-end router in the EBGP path, where I was effectively using the reverse application of #2 as a customer) are mostly irrelevant.

Would you have an easy-to-understand use case that is best solved with a multihop EBGP session? Please share it in the comments. Thanks a million!

Looking for a Simple Multihop EBGP Use Case

I plan to add several challenge labs using multihop EBGP sessions to the BGP labs project, including:

  1. Running BGP between VMs and central BGP route servers
  2. Using multihop EBGP session to send full Internet routing table to a customer without overloading the PE-router
  3. Running EBGP EVPN session between loopbacks advertised with EBGP IPv4 session (🤢)

However, I would love to start with a simple use case to help engineers unfamiliar with BGP realize when they might have to use multihop EBGP sessions. Unfortunately, I can’t find one, and the scenarios where I used multihop EBGP in the past (EBGP load balancing and using a low-end router in the EBGP path, where I was effectively using the reverse application of #2 as a customer) are mostly irrelevant.

Would you have an easy-to-understand use case that is best solved with a multihop EBGP session? Please share it in the comments. Thanks a million!

How to Best Use Panorama Templates and Stacks?

How to Best Use Panorama Templates and Stacks?

I’ve been working with Palo Alto Firewalls and Panorama for a few years now, yet the best ways to use Templates still seem somewhat mysterious. I bet many of you feel the same way. Since every network is unique, there isn’t one “right” way to manage this. In this blog post, I’ll break down what Templates and Template Stacks are in Panorama and share some effective strategies for organizing them. Let’s dive in.

A Quick Note on Panorama

If you’re new to Panorama, it’s a centralized management tool that simplifies managing multiple Palo Alto firewalls from a single place. There are two key concepts in Panorama which are Device Groups and Templates. Device Groups manage the configurations you’d usually find under the Policies and Objects tabs on the firewall, while Templates manage with configurations from the Network and Device tabs.

How to Best Use Panorama Templates and Stacks?

It’s important to note that Device Groups and Templates serve different purposes and manage different parts of the configurations. This blog post will focus exclusively on Templates. If you need a refresher on Device Groups and Templates, I’ve covered that in a previous post. Feel free to check it out here for a quick recap.

Ruminations About Europe’s “Alice Recoque” Exascale Supercomputer

Designing chips and shepherding them through the foundry and package and assembly is a complex and difficult process, and not having these skills at a national level has profound implications for the competitiveness of those nations.

Ruminations About Europe’s “Alice Recoque” Exascale Supercomputer was written by Timothy Prickett Morgan at The Next Platform.

Kubernetes network policies: 4 pain points and how to address them

Kubernetes is used everywhere, from test environments to the most critical production foundations that we use daily, making it undoubtedly a de facto in cloud computing. While this is great news for everyone who works with, administers, and expands Kubernetes, the downside is that it makes Kubernetes a favorable target for malicious actors.

Malicious actors typically exploit flaws in the system to gain access to a portion of the environment. They then chain these flaws together to move laterally within the environment, ultimately seeking root access or access to critical information.

While the best way to fix security flaws in any software is to patch it with appropriate fixes that the project maintainers publish, there are certain security practices that you can adopt to fortify your environment, like using network policies. However, most people find network policies complex and overwhelming, which discourages them from implementing policies in their environment.

In this blog post, we will examine four pain points that people face when they want to implement network policies and provide solutions to help you effectively secure your Kubernetes environment.

What is a network policy and why should I use it?

In Kubernetes, a network policy (KNP) resource is the Continue reading

HS076: Greg’s Finale

This is Greg’s last Heavy Strategy episode before he heads off to retirement. He gives us his final pieces of career and life advice, opinions on private equity, and a Cookie Monster quote. We also briefly introduce John Burke, the new Heavy Strategy co-host. Farewell, Greg. Thank you for all the great debates. Episode Transcript... Read more »

Running BGP Labs in GitHub Codespaces

I love open-source tools (and their GitHub repositories). Someone launches a cool idea, and you can dig through their source code to figure out how it works. It beats reading documentation or fixing AI hallucinations every day of the week ;)

Not too long ago, the containerlab team launched the ability to run containerlab within a free1 container2 running on GitHub, and that seemed like a perfect solution to run the BGP labs (Jeroen van Bemmel pointing me in the right direction was another significant step forward).

Running BGP Labs in GitHub Codespaces

I love open-source tools (and their GitHub repositories). Someone launches a cool idea, and you can dig through their source code to figure out how it works. It beats reading documentation or fixing AI hallucinations every day of the week ;)

Not too long ago, the containerlab team launched the ability to run containerlab within a free1 container2 running on GitHub, and that seemed like a perfect solution to run the BGP labs (Jeroen van Bemmel pointing me in the right direction was another significant step forward).

Why Didn’t We Have Anycast Gateways Before VXLAN?

A while back I started thinking about why it took so long before we started using anycast gateways. I started thinking about what would be the reason(s) for not doing it earlier. I came up with some good reasons and it started making sense to me. I then asked you all what your thoughts were and received a ton of great responses. Here are a few that were mentioned:

  • It was a natural evolution.
  • More powerful devices.
  • We didn’t have overlays.
  • There were no protocols to map what device a MAC sits behind.
  • Reusing the same IP would cause IP conflicts.

These are all certainly true to some degree. I would argue though that the main reason why we didn’t have it earlier is because of the topology and protocols we used in traditional LANs. The typical design was to have three layers, access, distribution, and core. The links in access to distribution layer were L2 only and the distribution layer had all the L3 configuration. A typical topology looked like this:

In a topology like this, there are only two devices that host the L3 configuration needed for hosts. When you have two of something, it’s natural to think Continue reading

NB484: HPE, Nvidia Cozy Up on AI Infrastructure; Norway Uncovers Rare Earth Elements Motherlode

Take a Network Break! We start with a longish round of FUs, and then dive into news. HPE announces HPE Private Cloud, which includes software, switches, and GPUs from Nvidia for building out infrastructure to support AI workloads. HPE also announced KVM virtualization for HPE Private Cloud. We debate whether Britian’s anti-trust inquiry to the... Read more »

Tech Bytes: Performance, Power Efficiency Drive Dutch IXP Upgrade to Nokia Routers, Silicon (Sponsored)

Today on the Tech Bytes podcast, sponsored by Nokia, we hear from Nokia customer NL-IX. NL-IX is a leading Internet Exchange based in the Netherlands. The organization recently redesigned and upgraded its network. Nokia played a role in that upgrade, including 400 and 800G routers and FP5 silicon. We’re joined by Dirk Kalkman, Chief Network... Read more »

Helping keep customers safe with leaked password notification

Password reuse is a real problem. When people use the same password across multiple services, it creates a risk that a breach of one service will give attackers access to a different, apparently unrelated, service. Attackers know people reuse passwords and build giant lists of known passwords and known usernames or email addresses.

If you got to the end of that paragraph and realized you’ve reused the same password multiple places, stop reading and go change those passwords. We’ll wait.

To help protect Cloudflare customers who have used a password attackers know about, we are releasing a feature to improve the security of the Cloudflare dashboard for all our customers by automatically checking whether their Cloudflare user password has appeared in an attacker's list. Cloudflare will securely check a customer’s password against threat intelligence sources that monitor data breaches in other services.

If a customer logs in to Cloudflare with a password that was leaked in a breach elsewhere on the Internet, Cloudflare will alert them and ask them to choose a new password.

For some customers, the news that their password was known to hackers will come as a surprise – no one wants to intentionally use passwords that Continue reading

MikroTik ROSv7 Cheat Sheets – OSPFv2/v3 standard area LSA propagation.

When learning how to use OSPF with MikroTik, it can sometimes be difficult to understand how the different LSA types flow between areas.

In MikroTik’s OSPF documentation they briefly cover the LSA for OSPFv2 but don’t have OSPFv3 listed yet.

To better illustrate how the LSAs work, I created these graphical overviews for OSPFv2 and OSPFv3. When troubleshooting OSPF, it’s very helpful to understand which LSAs you should see in an area and how IPv4 and IPv6 differ.

Hope you find these helpful!

OSPFv2

PDF: https://stubarea51.net/wp-content/uploads/2024/06/ROSv7-OSPF-Fundamentals-SA51-OSPFv2-Default-Area.pdf

OSPFv3

PDF: https://stubarea51.net/wp-content/uploads/2024/06/ROSv7-OSPF-Fundamentals-SA51-OSPFv3-Default-Area.pdf