Monitoring RoCEv2 with sFlow
The talk Seeing Through the RDMA Fog: Monitoring RoCEv2 with sFlow at the recent North American Network Operator's Group (NANOG) conference describes how leveraging industry standard sFlow telemetry from data center switches provides visibility into RDMA activity in AI / ML networks.
Note: Slides are available from the talk link.
The live SDSC Expanse cluster live AI/ML metrics dashboard described in the talk can be accesses by clicking on the dashboard link. The San Diego Supercomputer Center (SDSC) Expanse cluster specifications: 5 Pflop/s peak; 93,184 CPU cores; 208 NVIDIA GPUs; 220 TB total DRAM; 810 TB total NVMe.Note: AI Metrics with Prometheus and Grafana shows how to set up the monitoring stack.
More recently, Expanse heatmap provides a publicly accessible real-time visualization live traffic flowing between nodes in the Expanse cluster, see Real-time visualization of AI / ML traffic matrix for more information.
