PIPEFAIL: How a missing shell option slowed Cloudflare down


At Cloudflare, we’re used to being the fastest in the world. However, for approximately 30 minutes last December, Cloudflare was slow. Between 20:10 and 20:40 UTC on December 16, web requests served by Cloudflare were artificially delayed by up to five seconds before being processed. This post tells the story of how a missing shell option called “pipefail” slowed Cloudflare down.
Background
Before we can tell this story, we need to introduce you to some of its characters.

Cloudflare’s Front Line protects millions of users from some of the largest attacks ever recorded. This protection is orchestrated by a sidecar service called dosd
, which analyzes traffic and looks for attacks. When dosd
detects an attack, it provides Front Line with a list of attack fingerprints that describe how Front Line can match and block the attack traffic.
Instances of dosd
run on every Cloudflare server, and they communicate with each other using a peer-to-peer mesh to identify malicious traffic patterns. This decentralized design allows dosd
to perform analysis with much higher fidelity than is possible with a centralized system, but its scale also imposes some strict performance requirements. To meet these requirements, we need to provide dosd
with very Continue reading