Analysis of Today’s CenturyLink/Level(3) Outage

Today CenturyLink/Level(3), a major ISP and Internet bandwidth provider, experienced a significant outage that impacted some of Cloudflare’s customers as well as a significant number of other services and providers across the Internet. While we’re waiting for a post mortem from CenturyLink/Level(3), I wanted to write up the timeline of what we saw, how Cloudflare’s systems routed around the problem, why some of our customers were still impacted in spite of our mitigations, and what appears to be the likely root cause of the issue.
Increase In Errors
At 10:03 UTC our monitoring systems started to observe an increased number of errors reaching our customers’ origin servers. These show up as “522 Errors” and indicate that there is an issue connecting from Cloudflare’s network to wherever our customers’ applications are hosted.
Cloudflare is connected to CenturyLink/Level(3) among a large and diverse set of network providers. When we see an increase in errors from one network provider, our systems automatically attempt to reach customers’ applications across alternative providers. Given the number of providers we have access to, we are generally able to continue to route traffic even when one provider has an issue.



