Who monitors the monitoring systems?
Adrian Cockroft poses an interesting question in, Who monitors the monitoring systems? He states, The first thing that would be useful is to have a monitoring system that has failure modes which are uncorrelated with the infrastructure it is monitoring. ... I don’t know of a specialized monitor-of-monitors product, which is one reason I wrote this blog post.This article offers a response, describing how to introduce an uncorrelated monitor-of-monitors into the data center to provide real-time visibility that survives when the primary monitoring systems fail.
Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region, This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it. December 10th, 2021
Standardizing on a small set of communication primitives (gRPC, Thrift, Kafka, etc.) simplifies the creation of large scale distributed services. The communication primitives abstract the physical network to provide reliable communication to support distributed services running on compute nodes. Monitoring is typically regarded as a distributed service that is part of the compute infrastructure, relying on agents on compute nodes to transmit measurements to scale out analysis, storage, automation, and Continue reading