Docker Hub Incident Review – 5 July 2020
Background
This is Docker’s first time publishing an incident report publicly. While we have always done detailed post mortems on incidents internally, as part of the changing culture at Docker, we want to be more open externally as well. For example, this year we have started publishing our roadmap publicly and asking our users for their input. You should expect to see us continue publishing reports for most significant incidents.
In publishing these reports, we hope others can learn from the issues we have faced and how we have dealt with them. We hope it builds trust in our services and our teams. We also think this one is pretty interesting due to the complex interaction between multiple services and stakeholders.
Incident Summary
Amazon Linux users in several regions encountered intermittent hanging downloads of Docker images from the Docker Hub registry between roughly July 5 19:00 UTC and July 6 06:30 UTC. The issue stemmed from an anti-botnet protection mechanism our CDN provider Cloudflare had deployed. Teams from Docker, Cloudflare, and AWS worked together to pinpoint the issue and the mechanism in question was disabled, leading to full service restoration.
What Happened
At about 01:45 UTC on Monday July 6th Continue reading