BRKDCT-2333 – Data Center Network Failure Detection
Presenter: Arkadiy Shapiro, Manager Technical Marketing (Nexus 2000 – 7000) @ArkadiyShapiro
You could say I’m obsessed with BFD –Arkadiy
The focus on this session is around failure detection (not reconergence, protocol tuning, etc). This session will not go over user-driven failure detection methods (ping, traceroutes, etc).
Fast failure detection is the key to fast convergence.
Routing convergence steps:
- Detect
- Propagate (tell my neighbors)
- Process (routing recalc, SPF, DUAL, etc)
- Update (update RIB/FIB, program hardware tables)
Failure detection tools: a layered approach: Layer 1, 2, MPLS, 3, application.
Interconnect options:
- Point to point – failure detection is really easy here; event driven; fast
- Layer 3 with Layer 1 (DWDM) bump in the wire
- Layer 3 with Layer 2 (ethernet) bump in the wire
- Layer 3 with Layer 3 (firewall/router) bump in the wire
Think about this: moving to higher speeds (1G -> 10G -> 40G -> beyond) means that more data is lost as you move to higher speeds without changing the failure detection/reconvergence characteristics of the network. 1 second reconvergence time at 1G is way different than 1 second at 40G.
Be aware: ISSU may not support aggressive timers on various protocols. Another reason to be wary of timer cranking.