Archive

Category Archives for "CloudFlare"

Cloudflare’s perspective of the October 30 OVHcloud outage

On October 30, 2024, cloud hosting provider OVHcloud (AS16276) suffered a brief but significant outage. According to their incident report, the problem started at 13:23 UTC, and was described simply as “An incident is in progress on our backbone infrastructure.” OVHcloud noted that the incident ended 17 minutes later, at 13:40 UTC. As a major global cloud hosting provider, some customers use OVHcloud as an origin for sites delivered by Cloudflare — if a given content asset is not in our cache for a customer’s site, we retrieve the asset from OVHcloud.

We observed traffic starting to drop at 13:21 UTC, just ahead of the reported start time. By 13:28 UTC, it was approximately 95% lower than pre-incident levels. Recovery appeared to start at 13:31 UTC, and by 13:40 UTC, the reported end time of the incident, it had reached approximately 50% of pre-incident levels.

Traffic from OVHcloud (AS16276) to Cloudflare

Cloudflare generally exchanges most of our traffic with OVHcloud over peering links. However, as shown below, peered traffic volume during the incident fell significantly. It appears that some small amount of traffic briefly began to flow over transit links from Cloudflare to OVHcloud due to sudden Continue reading

Migrating billions of records: moving our active DNS database while it’s in use

According to a survey done by W3Techs, as of October 2024, Cloudflare is used as an authoritative DNS provider by 14.5% of all websites. As an authoritative DNS provider, we are responsible for managing and serving all the DNS records for our clients’ domains. This means we have an enormous responsibility to provide the best service possible, starting at the data plane. As such, we are constantly investing in our infrastructure to ensure the reliability and performance of our systems.

DNS is often referred to as the phone book of the Internet, and is a key component of the Internet. If you have ever used a phone book, you know that they can become extremely large depending on the size of the physical area it covers. A zone file in DNS is no different from a phone book. It has a list of records that provide details about a domain, usually including critical information like what IP address(es) each hostname is associated with. For example:

example.com      59 IN A 198.51.100.0
blog.example.com 59 IN A 198.51.100.1
ask.example.com  59 IN A 198.51.100.2

It is not unusual Continue reading

Forced offline: the Q3 2024 Internet disruption summary

Cloudflare’s network spans more than 330 cities in over 120 countries, where we interconnect with over 13,000 network providers in order to provide a broad range of services to millions of customers. The breadth of both our network and our customer base provides us with a unique perspective on Internet resilience, enabling us to observe the impact of Internet disruptions. Thanks to Cloudflare Radar functionality released earlier this year, we can explore the impact from a routing perspective, as well as a traffic perspective, at both a network and location level.

As we have noted in the past, this post is intended as a summary overview of observed and confirmed disruptions, and is not an exhaustive or complete list of issues that have occurred during the quarter. 

A larger list of detected traffic anomalies is available in the Cloudflare Radar Outage Center.

Having said that, the third quarter of 2024 was particularly active, with quite a few significant Internet disruptions. Unfortunately, governments continued to impose nationwide Internet shutdowns intended to prevent cheating on exams. Damage to both terrestrial and submarine cables impacted Internet connectivity across Africa and in other parts of the world. Damage caused by an active hurricane Continue reading

Elephants in tunnels: how Hyperdrive connects to databases inside your VPC networks

With September’s announcement of Hyperdrive’s ability to send database traffic from Workers over Cloudflare Tunnels, we wanted to dive into the details of what it took to make this happen.

Hyper-who?

Accessing your data from anywhere in Region Earth can be hard. Traditional databases are powerful, familiar, and feature-rich, but your users can be thousands of miles away from your database. This can cause slower connection startup times, slower queries, and connection exhaustion as everything takes longer to accomplish.

Cloudflare Workers is an incredibly lightweight runtime, which enables our customers to deploy their applications globally by default and renders the cold start problem almost irrelevant. The trade-off for these light, ephemeral execution contexts is the lack of persistence for things like database connections. Database connections are also notoriously expensive to spin up, with many round trips required between client and server before any query or result bytes can be exchanged.

Hyperdrive is designed to make the centralized databases you already have feel like they’re global while keeping connections to those databases hot. We use our global network to get faster routes to your database, keep connection pools primed, and cache your most frequently run queries as close to users Continue reading

Durable Objects aren’t just durable, they’re fast: a 10x speedup for Cloudflare Queues

Cloudflare Queues let a developer decouple their Workers into event-driven services. Producer Workers write events to a Queue, and consumer Workers are invoked to take actions on the events. For example, you can use a Queue to decouple an e-commerce website from a service which sends purchase confirmation emails to users. During 2024’s Birthday Week, we announced that Cloudflare Queues is now Generally Available, with significant performance improvements that enable larger workloads. To accomplish this, we switched to a new architecture for Queues that enabled the following improvements:

  • Median latency for sending messages has dropped from ~200ms to ~60ms

  • Maximum throughput for each Queue has increased over 10x, from 400 to 5000 messages per second

  • Maximum Consumer concurrency for each Queue has increased from 20 to 250 concurrent invocations

Median latency drops from ~200ms to ~60ms as Queues are migrated to the new architecture

In this blog post, we'll share details about how we built Queues using Durable Objects and the Cloudflare Developer Platform, and how we migrated from an initial Beta architecture to a geographically-distributed, horizontally-scalable architecture for General Availability.

v1 Beta architecture

When initially designing Cloudflare Queues, we decided to build something simple that we could get Continue reading

Billions and billions (of logs): scaling AI Gateway with the Cloudflare Developer Platform

With the rapid advancements occurring in the AI space, developers face significant challenges in keeping up with the ever-changing landscape. New models and providers are continuously emerging, and understandably, developers want to experiment and test these options to find the best fit for their use cases. This creates the need for a streamlined approach to managing multiple models and providers, as well as a centralized platform to efficiently monitor usage, implement controls, and gather data for optimization.

AI Gateway is specifically designed to address these pain points. Since its launch in September 2023, AI Gateway has empowered developers and organizations by successfully proxying over 2 billion requests in just one year, as we highlighted during September’s Birthday Week. With AI Gateway, developers can easily store, analyze, and optimize their AI inference requests and responses in real time.

With our initial architecture, AI Gateway faced a significant challenge: the logs, those critical trails of data interactions between applications and AI models, could only be retained for 30 minutes. This limitation was not just a minor inconvenience; it posed a substantial barrier for developers and businesses needing to analyze long-term patterns, ensure compliance, or simply debug over more extended periods.

In Continue reading

Build durable applications on Cloudflare Workers: you write the Workflows, we take care of the rest

Workflows, Cloudflare’s durable execution engine that allows you to build reliable, repeatable multi-step applications that scale for you, is now in open beta. Any developer with a free or paid Workers plan can build and deploy a Workflow right now: no waitlist, no sign-up form, no fake line around-the-block.

If you learn by doing, you can create your first Workflow via a single command (or visit the docs for the full guide):

npm create cloudflare@latest workflows-starter -- \
  --template "cloudflare/workflows-starter"

Open the src/index.ts file, poke around, start extending it, and deploy it with a quick wrangler deploy.

If you want to learn more about how Workflows works, how you can use it to build applications, and how we built it, read on.

Workflows? Durable Execution?

Workflows—which we announced back during Developer Week earlier this year—is our take on the concept of “Durable Execution”: the ability to build and execute applications that are durable in the face of errors, network issues, upstream API outages, rate limits, and (most importantly) infrastructure failure.

As over 2.4 million developers continue to build applications on top of Cloudflare Workers, R2, and Workers AI, we’ve noticed more developers building multi-step applications and workflows Continue reading

Introducing Access for Infrastructure: SSH

BastionZero joined Cloudflare in May 2024. We are thrilled to announce Access for Infrastructure as BastionZero’s native integration into our SASE platform, Cloudflare One. Access for Infrastructure will enable organizations to apply Zero Trust controls in front of their servers, databases, network devices, Kubernetes clusters, and more. Today, we’re announcing short-lived SSH access as the first available feature. Over the coming months we will announce support for other popular infrastructure access target types like Remote Desktop Protocol (RDP), Kubernetes, and databases.

Applying Zero Trust principles to infrastructure

Organizations have embraced Zero Trust initiatives that modernize secure access to web applications and networks, but often the strategies they use to manage privileged access to their infrastructure can be siloed, overcomplicated, or ineffective. When we speak to customers about their infrastructure access solution, we see common themes and pain points:

  • Too risky: Long-lived credentials and shared keys get passed around and inflate the risk of compromise, excessive permissions, and lateral movement

  • Too clunky: Manual credential rotations and poor visibility into infrastructure access slow down incident response and compliance efforts

Some organizations have dealt with the problem of privileged access to their infrastructure by purchasing a Privileged Access Management (PAM) solution Continue reading

4.2 Tbps of bad packets and a whole lot more: Cloudflare’s Q3 DDoS report

Welcome to the 19th edition of the Cloudflare DDoS Threat Report. Released quarterly, these reports provide an in-depth analysis of the DDoS threat landscape as observed across the Cloudflare network. This edition focuses on the third quarter of 2024.

With a 296 Terabit per second (Tbps) network located in over 330 cities worldwide, Cloudflare is used as a reverse proxy by nearly 20% of all websites. Cloudflare holds a unique vantage point to provide valuable insights and trends to the broader Internet community.

Key insights 

  • The number of DDoS attacks spiked in the third quarter of 2024. Cloudflare mitigated nearly 6 million DDoS attacks, representing a 49% increase QoQ and 55% increase YoY.

  • Out of those 6 million, Cloudflare’s autonomous DDoS defense systems detected and mitigated over 200 hyper-volumetric DDoS attacks exceeding rates of 3 terabits per second (Tbps) and 2 billion packets per second (Bpps). The largest attack peaked at 4.2 Tbps and lasted just a minute.

  • The Banking & Financial Services industry was subjected to the most DDoS attacks. China was the country most targeted by DDoS attacks, and Indonesia was the largest source of DDoS attacks.

To learn more about DDoS attacks and other types Continue reading

Training a million models per day to save customers of all sizes from DDoS attacks

Our always-on DDoS protection runs inside every server across our global network.  It constantly analyzes incoming traffic, looking for signals associated with previously identified DDoS attacks. We dynamically create fingerprints to flag malicious traffic, which is dropped when detected in high enough volume — so it never reaches its destination — keeping customer websites online.

In many cases, flagging bad traffic can be straightforward. For example, if we see too many requests to a destination with the same protocol violation, we can be fairly sure this is an automated script, rather than a surge of requests from a legitimate web browser.

Our DDoS systems are great at detecting attacks, but there’s a minor catch. Much like the human immune system, they are great at spotting attacks similar to things they have seen before. But for new and novel threats, they need a little help knowing what to look for, which is an expensive and time-consuming human endeavor.

Cloudflare protects millions of Internet properties, and we serve over 60 million HTTP requests per second on average, so trying to find unmitigated attacks in such a huge volume of traffic is a daunting task. In order to protect the smallest of companies, Continue reading

Is this thing on? Using OpenBMC and ACPI power states for reliable server boot

Introduction

At Cloudflare, we provide a range of services through our global network of servers, located in 330 cities worldwide. When you interact with our long-standing application services, or newer services like Workers AI, you’re in contact with one of our fleet of thousands of servers which support those services.

These servers which provide Cloudflare services are managed by a Baseboard Management Controller (BMC). The BMC is a special purpose processor  — different from the Central Processing Unit (CPU) of a server — whose sole purpose is ensuring a smooth operation of the server.

Regardless of the server vendor, each server has this BMC. The BMC runs independently of the CPU and has its own embedded operating system, usually referred to as firmware. At Cloudflare, we customize and deploy a server-specific version of the BMC firmware. The BMC firmware we deploy at Cloudflare is based on the Linux Foundation Project for BMCs, OpenBMC. OpenBMC is an open-sourced firmware stack designed to work across a variety of systems including enterprise, telco, and cloud-scale data centers. The open-source nature of OpenBMC gives us greater flexibility and ownership of this critical server subsystem, instead of the closed nature of proprietary firmware. Continue reading

Building Vectorize, a distributed vector database, on Cloudflare’s Developer Platform

Vectorize is a globally distributed vector database that enables you to build full-stack, AI-powered applications with Cloudflare Workers. Vectorize makes querying embeddings — representations of values or objects like text, images, audio that are designed to be consumed by machine learning models and semantic search algorithms — faster, easier and more affordable.

In this post, we dive deep into how we built Vectorize on Cloudflare’s Developer Platform, leveraging Cloudflare’s global network, Cache, Workers, R2, Queues, Durable Objects, and container platform.

What is a vector database?

A vector database is a queryable store of vectors. A vector is a large array of numbers called vector dimensions.

A vector database has a similarity search query: given an input vector, it returns the vectors that are closest according to a specified metric, potentially filtered on their metadata.

Vector databases are used to power semantic search, document classification, and recommendation and anomaly detection, as well as contextualizing answers generated by LLMs (Retrieval Augmented Generation, RAG).

Why do vectors require special database support?

Conventional data structures like B-trees, or binary search trees expect the data they index to be cheap to compare and to follow Continue reading

The story of web framework Hono, from the creator of Hono

Hono is a fast, lightweight web framework that runs anywhere JavaScript does, built with Web Standards. Of course, it runs on Cloudflare Workers.

It was three years ago, in December 2021. At that time, I wanted to create applications for Cloudflare Workers, but the code became verbose without using a framework, and couldn't find a framework that suited my needs. Itty-router was very nice but too simple. Worktop and Sunder did the same things I wanted to do, but their APIs weren't quite to my liking. I was also interested in creating a router — a program that determines which action is executed based on the HTTP method and URL path of the Request — made of a Trie tree structure because it’s fast. So, I started building a web framework with a Trie tree-based router.

 “While trying to create my applications, I ended up creating my framework for them.” — a classic example of yak shaving. However, Hono is now used by many developers, including Cloudflare, which uses Hono in core products. So, this journey into the depths of yak shaving was ultimately meaningful.

Write once, run anywhere

Hono truly runs anywhere — not just on Cloudflare Continue reading

Analysis of the EPYC 145% performance gain in Cloudflare Gen 12 servers

Cloudflare's network spans more than 330 cities in over 120 countries, serving over 60 million HTTP requests per second and 39 million DNS queries per second on average. These numbers will continue to grow, and at an accelerating pace, as will Cloudflare’s infrastructure to support them. While we can continue to scale out by deploying more servers, it is also paramount for us to develop and deploy more performant and more efficient servers.

At the heart of each server is the processor (central processing unit, or CPU). Even though many aspects of a server rack can be redesigned to improve the cost to serve a request, CPU remains the biggest lever, as it is typically the primary compute resource in a server, and the primary enabler of new technologies.

Cloudflare’s 12th Generation server with AMD EPYC 9684-X (codenamed Genoa-X) is 145% more performant and 63% more efficient. These are big numbers, but where do the performance gains come from? Cloudflare’s hardware system engineering team did a sensitivity analysis on three variants of 4th generation AMD EPYC processor to understand the contributing factors.

For the 4th generation AMD EPYC Processors, AMD offers three architectural variants: 

  1. mainstream classic Zen 4 cores, codenamed Continue reading

Protect against identity-based attacks by sharing Cloudflare user risk scores with Okta

Cloudflare One, our secure access service edge (SASE) platform, is introducing a new integration with Okta, the identity and access management (IAM) vendor, to share risk indicators in real-time and simplify how organizations can dynamically manage their security posture in response to changes across their environments.

For many organizations, it is becoming increasingly challenging and inefficient to adapt to risks across their growing attack surface. In particular, security teams struggle with multiple siloed tools that fail to share risk data effectively with each other, leading to excessive manual effort to extract signals from the noise. To address this complexity, Cloudflare launched risk posture management capabilities earlier this year to make it easier for organizations to accomplish three key jobs on one platform:

  1. Evaluating risk posed by people by using first-party user entity and behavior analytics (UEBA) models

  2. Exchanging risk telemetry with best-in-class security tools, and

  3. Enforcing risk controls based on those dynamic first- and third-party risk scores.

Today’s announcement builds on these capabilities (particularly job #2) and our partnership with Okta by enabling organizations to share Cloudflare’s real-time user risk scores with Okta, which can then automatically enforce policies based on that user’s risk. In this way, organizations can adapt Continue reading

CJ Desai: Why I joined Cloudflare as President of Product and Engineering

I am thrilled to embark on this journey to run Product and Engineering at Cloudflare, driving forward the mission of helping build a better Internet. 

A little about me

While I was a graduate student at University of Illinois, the university introduced the Mosaic web browser to students. In addition to being super easy to install and use, it displayed pictures next to text for the first time. This may not seem impressive today, but back then it felt like a magical step forward.

This simple but powerful upgrade opened up the once niche user base from academics to the masses, transforming the world wide web to become an Internet phenomenon. Since then, I’ve always sought to be part of teams that worked on transformational technologies, including Software-as-a-Service, cloud computing, and AI. Innovation is the life blood of every technology company. To this day, I’m inspired by building products and technology that get adopted at mass scale.

Why Cloudflare

The world is in a very interesting moment for technological innovation: the AI landscape is uncharted and developing at an exponential rate; the urgency for enterprises to reduce tech debt and reliance on legacy applications is at an all Continue reading

What’s new in Cloudflare One: Digital Experience (DEX) monitoring notifications and seamless access to Cloudflare Gateway with China Express

At Cloudflare, we are constantly innovating and launching new features and capabilities across our product portfolio. We are introducing roundup blog posts to ensure that you never miss the latest updates across our platform. In this post, we are excited to share two new ways that our customers can continue to keep their web properties performant and secure with Cloudflare One: new Digital Experience Monitoring (DEX) notifications help proactively identify issues that can affect the end-user digital experience, and integration with China Express enables secure access to China-hosted sites for Cloudflare Gateway customers.   

Using DEX Notifications for proactive monitoring with Cloudflare Zero Trust

As with other notification types, DEX notifications can be configured and reviewed from Cloudflare dashboard notifications.

What problem does it solve?

DEX notifications address the challenge of proactively identifying issues affecting the digital experience of your end users. By monitoring device health and conducting synthetic tests from WARP clients deployed on your fleet's end-user devices, DEX provides valuable insights. These notifications empower IT administrators to quickly identify and address connectivity and application performance problems before they impact a wide range of users.

By proactively notifying administrators when problems arise, DEX helps minimize user disruption and provides Continue reading

CJ Desai: Why I joined Cloudflare as President of Product and Engineering

I am thrilled to embark on this journey to run Product and Engineering at Cloudflare, driving forward the mission of helping build a better Internet. 

A little about me

While I was a graduate student at University of Illinois, the university introduced the Mosaic web browser to students. In addition to being super easy to install and use, it displayed pictures next to text for the first time. This may not seem impressive today, but back then it felt like a magical step forward.

This simple but powerful upgrade opened up the once niche user base from academics to the masses, transforming the world wide web to become an Internet phenomenon. Since then, I’ve always sought to be part of teams that worked on transformational technologies, including Software-as-a-Service, cloud computing, and AI. Innovation is the life blood of every technology company. To this day, I’m inspired by building products and technology that get adopted at mass scale.

Why Cloudflare

The world is in a very interesting moment for technological innovation: the AI landscape is uncharted and developing at an exponential rate; the urgency for enterprises to reduce tech debt and reliance on legacy applications is at an all time Continue reading

What’s new in Cloudflare One: Digital Experience (DEX) monitoring notifications and seamless access to Cloudflare Gateway with China Express

At Cloudflare, we are constantly innovating and launching new features and capabilities across our product portfolio. We are introducing roundup blog posts to ensure that you never miss the latest updates across our platform. In this post, we are excited to share two new ways that our customers can continue to keep their web properties performant and secure with Cloudflare One: new Digital Experience Monitoring (DEX) notifications help proactively identify issues that can affect the end-user digital experience, and integration with China Express enables secure access to China-hosted sites for Cloudflare Gateway customers.   

Using DEX Notifications for proactive monitoring with Cloudflare Zero Trust

Digital Experience Monitoring (DEX) offers device, application, and network performance monitoring, providing IT administrators with insights to quickly identify and resolve issues. With DEX notifications , account administrators can create configurable alert rules based on available algorithms (z-score, SLO) and existing DEX filters. When notification criteria are satisfied, customers are notified via email, Pagerduty, or Webhooks

As with other notification types, DEX notifications can be configured and reviewed from Cloudflare dashboard notifications.

What problem does it solve?

DEX notifications address the challenge of proactively identifying issues affecting the digital experience of your end users. Continue reading

Improving platform resilience at Cloudflare through automation

Failure is an expected state in production systems, and no predictable failure of either software or hardware components should result in a negative experience for users. The exact failure mode may vary, but certain remediation steps must be taken after detection. A common example is when an error occurs on a server, rendering it unfit for production workloads, and requiring action to recover.

When operating at Cloudflare’s scale, it is important to ensure that our platform is able to recover from faults seamlessly. It can be tempting to rely on the expertise of world-class engineers to remediate these faults, but this would be manual, repetitive, unlikely to produce enduring value, and not scaling. In one word: toil; not a viable solution at our scale and rate of growth.

In this post we discuss how we built the foundations to enable a more scalable future, and what problems it has immediately allowed us to solve.

Growing pains

The Cloudflare Site Reliability Engineering (SRE) team builds and manages the platform that helps product teams deliver our extensive suite of offerings to customers. One important component of this platform is the collection of servers that power critical products such as Durable Objects, Workers, Continue reading