0
Whenever you build a complex system, you need to test that it works as expected, including properly handling failures. It’s easy enough to do simple component failure testing, but harder to do rapid automated failure tests. Big Switch is showing that it can be done though. Hopefully we can keep improving our testing to pick up some more of the software failures.
Testing is hard
Over the course of my career I’ve built many clustered systems – HP-UX Serviceguard, firewalls, routers, load balancers, RedHat Clusters, etc. Good clusters have redundant everything – servers, power supplies, disks, NICs, etc.
The commissioning process always included testing. We’d go through each of the components, trying to simulate failures. Unplug each of the power cables, the network cables, unseat a hard drive, remove a hot-swappable fan, etc. That would test out the redundant components within each server, and then of course you’d simulate a complete system failure, forcing full failover.
This is all important stuff, but it doesn’t pick up all the failures – e.g. What happens if you’ve got a faulty patch lead, and the link starts flapping? Sometimes a simple failure gets messy when it happens repeatedly over a short Continue reading