When correlation (or lack of it) can be causation
Rex: preventing bugs and misconfiguration in large services using correlated change analysis, Mehta et al., NSDI’20
and
Check before you change: preventing correlated failures in service updates, Zhai et al., NSDI’20
Today’s post is a double header. I’ve chosen two papers from NSDI’20 that are both about correlation. Rex is a tool widely deployed across Microsoft that checks for correlations you don’t have but probably should have: it looks at files changed in commits and warns developers if files frequently changed with them have not been changed. CloudCanary on the other hand is about detecting correlations you do have, but probably don’t want: it looks for potential causes of correlated failures across a system, and can make targeted recommendations for improving your system reliability.
Improving system reliability through correlation
"If you change the foo setting, don’t forget that you also need to update all the clients…"
Large-scale services run on a foundation of very large codebases and configuration repositories. To run uninterrupted a service not only depends on correct code, but also on correct network and security configuration, and suitable deployment specification. This causes various dependencies both within and across components/sources of the service which emerge Continue reading







