For decades, distributed databases have been built around the assumption that storage will live close to compute.
The farther data travels over the network, the reasoning goes, the greater the potential for delay. Local RAID (redundant array of independent disks) arrays, network-attached storage (NAS), and cluster file systems keep data close, making it quick and easy to access.
But in a distributed system, keeping the entire data store close to compute makes scaling slow, cumbersome, and expensive. Each time a node or cluster is replicated, its associated data must be replicated as well.
It isn’t ideal, but until recently, there wasn’t any reasonable alternative. Databases had to scale. Service-level agreements (SLAs) had to be met. Wide-area networks weren’t reliable enough to support high-performance databases at scale. Database designers accordingly spent a great deal of energy solving problems related to coordination, consistency, and replication logic.
But imagine things were different. What if they didn’t have to worry about the network, where their data lived, or how to get it from Point A to Point B? How would they design a database then?
That’s the intriguing question raised by the advent of cloud object storage services like AWS S3, Google Cloud Continue reading