0
Data Center Quantized Congestion Notification (DCQCN) is a hybrid congestion control method. DCQCN brings together both Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) so that we can get high throughput, low latency, and lossless delivery across our AI fabric. In this approach, each mechanism plays a specific role in addressing different aspects of congestion, and together they create a robust flow-control system for RDMA traffic.
DCQCN tackles two main issues in large-scale RDMA networks:
1. Head-of-Line Blocking and Congestion Spreading: This is caused by PFC’s pause frames, which stop traffic across switches.
2. Throughput Reduction with ECN Alone: When the ECN feedback is too slow, packet loss may occur despite the rate adjustments.
DCQCN uses a two-tiered approach. It applies ECN early on to gently reduce the sending rate at the GPU NICs, and it uses PFC as a backup to quickly stop traffic on upstream switches (hop-by-hop) when congestion becomes severe.
How DCQCN Combines ECN and PFC
DCQCN carefully combines Explicit Congestion Notification (ECN) and Priority Flow Control (PFC) in the right sequence:
Early Action with ECN: When congestion begins to build up, the switch uses WRED thresholds (minimum and maximum) to mark packets. This signals the Continue reading