Root Cause Analysis – It’s Not Perfect
Automated Root Cause Analysis promises a lot. High-end network monitoring systems promise that they can automatically isolate network problems, and only tell you about the thing that needs fixing. This sounds very enticing. Who wants a flood of alarms, when we could get just one alarm, telling us what we need to fix? But it’s not perfect, and you do need to pay attention to it.
Consider this contrived network:
What happens if the upstream link from the router fails?
From the perspective of the NMS, all systems at that site are unreachable. A simple NMS that is unaware of topology will create 4 alarms – one for each of the router, the switches and the server. A smarter NMS will recognise that it only needs one alarm, for the router WAN link being unreachable (and therefore the whole site is offline). It will know that the switches and server are unreachable, but those alarms will be suppressed by the key incident.
This all sounds like a good idea. Why wouldn’t you want that?
But what if the NMS view of the network is incomplete? What might happen then?
Consider the same network as above, but this time a new WAN router has been Continue reading