Google: Addressing Cascading Failures

Like the Spanish Inquisition, nobody expects cascading failures. Here's how Google handles them.
This excerpt is a particularly interesting and comprehensive chapter—Chapter 22 - Addressing Cascading Failures—from Google's awesome book on Site Reliability Engineering. Worth reading if it hasn't been on your radar. And it's free!
If at first you don't succeed, back off exponentially."
Why do people always forget that you need to add a little jitter?"
A cascading failure is a failure that grows over time as a result of positive feedback.107 It can occur when a portion of an overall system fails, increasing the probability that other portions of the system fail. For example, a single replica for a service can fail due to overload, increasing load on remaining replicas and increasing their probability of failing, causing a domino effect that takes down all the replicas for a service.
We’ll use the Shakespeare search service discussed in Shakespeare: A Sample Service as an example throughout this chapter. Its production configuration might look something like Figure 22-1.
Figure 22-1. Example production configuration for the Shakespeare search service
The investment firm was part of the funding behind VeloCloud, Perspica, and Appcito.


The software integrates with public cloud APIs so that backups scale automatically as cloud workloads are added or deleted.
HPE says that Huawei's claim that is is partnering with HPE is not true. This is just the latest public relations problem for Huawei in the U.S.