NVIDIA Cumulus Linux 5.11 for AI / ML
NVIDIA Cumulus Linux 5.11 includes major upgrades to the sFlow agent that fully exposes the advanced instrumentation built into NVIDIA Spectrum-X silicon. The enhanced real-time telemetry is particularly relevant to the AI / machine learning workloads that Spectrum-X is designed to handle.
With Cumulus Linux 5.11, the sFlow agent is easily configured using nvue commands, see Monitoring System Statistics and Network Traffic with sFlow:
nv set system sflow dropmon hw nv set system sflow poll-interval 20 nv set system sflow collector 192.0.2.1 nv set system sflow state enabled nv config apply
Note: In this case, enabling dropmon ensures that every dropped packet is captured, along with ingress port and drop reason (e.g. ttl_exceeded).
The same commands should be applied to every switch in the fabric for comprehensive visibility.
RDMA over Converged Ethernet (RoCE) describes how sFlow provides detailed visibility into RoCE flows used to move data between GPUs in an AI / ML data center fabric. The chart above from the RDMA network visibility demonstration at the SC22 conference shows that sFlow monitoring easily scales to the 400/800G speeds needed for machine learning. In this example, the sFlow-RT real-time analytics engine receives sFlow Continue reading





