Engineering dependability and fault tolerance in a distributed system

This is a guest post by Paddy Byers, Co-founder and CTO at Ably, a realtime data delivery platform. You can view the original article on Ably's blog.
Users need to know that they can depend on the service that is provided to them. In practice, because from time to time individual elements will inevitably fail, this means you have to be able to continue in spite of those failures.
In this article, we discuss the concepts of dependability and fault tolerance in detail and explain how the Ably platform is designed with fault tolerant approaches to uphold its dependability guarantees.
As a basis for that discussion, first some definitions:
Dependability
The degree to which a product or service can be relied upon. Availability and Reliability are forms of dependability.
Availability
The degree to which a product or service is available for use when required. This often boils down to provisioning sufficient redundancy of resources with statistically independent failures.
Reliability
The degree to which the product or service conforms to its specification when in use. This means a system that is not merely available but is also engineered with extensive redundant measures to continue to work as its Continue reading






