Docker Hub Incident Reviews – April 3rd and 15th 2021
In line with our promise last year to continue publishing incident reviews for Docker Hub, we have two to discuss from April. While many users were unaffected, it is important for us to be transparent with our community, and we hope it is both informative and instructive.
April 3rd 2021
Starting at about 07:30 UTC, a small proportion of registry requests (under 3%) against Docker Hub began failing. Initial investigation pointed towards several causes, including overloaded internal DNS services and significant and unusual load from several users and IPs. Changes were made to address all of these (scaling, blocking, etc), and while the issue seemed to resolve for several hours at a time, it continued coming back.

The issue re-occurred intermittently into the next day, at which point the actual root cause was determined to be under-scaled load balancers doing service discovery and routing for our applications.
In the past, the bottleneck for the load balancing system was network bandwidth on the nodes, and auto scaling rules were thus tied to bandwidth metrics. Over time and across some significant changes to this system, the load balancing application had become more CPU intensive, and thus the current auto scaling setup Continue reading