Brett Inman

Author Archives: Brett Inman

Docker Hub Incident Reviews – April 3rd and 15th 2021

In line with our promise last year to continue publishing incident reviews for Docker Hub, we have two to discuss from April. While many users were unaffected, it is important for us to be transparent with our community, and we hope it is both informative and instructive.

April 3rd 2021

Starting at about 07:30 UTC, a small proportion of registry requests (under 3%) against Docker Hub began failing. Initial investigation pointed towards several causes, including overloaded internal DNS services and significant and unusual load from several users and IPs. Changes were made to address all of these (scaling, blocking, etc), and while the issue seemed to resolve for several hours at a time, it continued coming back.

The issue re-occurred intermittently into the next day, at which point the actual root cause was determined to be under-scaled load balancers doing service discovery and routing for our applications. 

In the past, the bottleneck for the load balancing system was network bandwidth on the nodes, and auto scaling rules were thus tied to bandwidth metrics. Over time and across some significant changes to this system, the load balancing application had become more CPU intensive, and thus the current auto scaling setup Continue reading

Docker Hub Incident Review – 5 July 2020

Background

This is Docker’s first time publishing an incident report publicly. While we have always done detailed post mortems on incidents internally, as part of the changing culture at Docker, we want to be more open externally as well. For example, this year we have started publishing our roadmap publicly and asking our users for their input. You should expect to see us continue publishing reports for most significant incidents.

In publishing these reports, we hope others can learn from the issues we have faced and how we have dealt with them. We hope it builds trust in our services and our teams. We also think this one is pretty interesting due to the complex interaction between multiple services and stakeholders.

Incident Summary

Amazon Linux users in several regions encountered intermittent hanging downloads of Docker images from the Docker Hub registry between roughly July 5 19:00 UTC and July 6 06:30 UTC. The issue stemmed from an anti-botnet protection mechanism our CDN provider Cloudflare had deployed. Teams from Docker, Cloudflare, and AWS worked together to pinpoint the issue and the mechanism in question was disabled, leading to full service restoration.

What Happened

At about 01:45 UTC on Monday July 6th Continue reading

Running Docker on Docker Enterprise Edition 2.0

Did you know that Docker Hub has millions of users pulling roughly one billion container images every two weeks — and it all runs on Docker Enterprise Edition?

Docker Enterprise Edition 2.0 may now be available to commercial customers who require an enterprise-ready container platform, but the Docker operations team has already been using it in production for some time. As part of our commitment to delivering high quality software that is ready to support your mission-critical applications, we leverage Docker Enterprise Edition 2.0 as the platform behind Docker Hub and our other SaaS services, Docker Store, and Docker Cloud.

Some organizations call it “dogfooding;” some call it “drinking your own champagne.” Whatever you call it, the importance of this program is to be fully invested in our own container platform and share in the same operational experiences as our customers.

Our Migration to Kubernetes

One of the main features of this latest release is the integration of Kubernetes so we wanted to make sure we are leveraging this capability. Working closely with our SaaS team leads, we chose a few services to migrate to Kubernetes while keeping others on Swarm.

For people already running Docker EE, Continue reading