Archive

Category Archives for "Russ White"

Random Thoughts on Grey Failures and Scale

I have used the example of increasing paths to the point where the control plane converges more slowly, impacting convergence, hence increasing the Mean Time to Repair, to show that too much redundancy can actually reduce overall network availability. Many engineers I’ve talked to balk at this idea, because it seems hard to believe that adding another link could, in fact, impact routing protocol convergence in such a way. I ran across a paper a while back that provides a different kind of example about the trade-off around redundancy in a network, but I never got around to actually reading the entire paper and trying to figure out how it fits in.

In Gray Failure: The Achilles’ Heel of Cloud-Scale Systems, the authors argue that one of the main problems with building a cloud system is with grey failures—when a router fails only some of the time, or drops (or delays) only some small percentage of the traffic. The example given is—

  • A single service must collect information from many other services on the network to complete a particular operation
  • Each of these information collection operations represent a single transaction carried across the network
  • The more transactions there are, the Continue reading
1 85 86 87 88 89 164