Millions of tiny databases, Brooker et al., NSDI’20
This paper is a real joy to read. It takes you through the thinking processes and engineering practices behind the design of a key part of the control plane for AWS Elastic Block Storage (EBS): the Physalia database that stores configuration information.
In the same spirit as Paxos Made Live, this paper describes the details, choices and tradeoffs that are required to put a consensus system into production.
The core algorithms (chain-replication, Paxos-based consensus) aren’t the stars of the show here, instead the paper focuses on how these algorithms are deployed, and the software engineering practices behind the creation of a mission-critical production system employing them.
A guiding principle
Engineering decisions involve making lots of trade-offs. If you want to emerge with a coherent design, then it’s well worth spending some time thinking about the principle(s) by which you’re going to make them. For Physalia, and for AWS more generally, the guiding principle is minimise the blast radius.
Over the decade since [the introduction of Availability Zones], our thinking on failure and availability has continued to evolve, and we paid increasing attention to blast radius and correlation of failure. Continue reading