AI network performance monitoring using containerlab
AI Metrics is available on GitHub. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.The screen capture is from a containerlab topology that emulates a AI compute cluster connected by a leaf and spine network. The metrics include:
- Total Traffic Total traffic entering fabric
- Operations Total RoCEv2 operations broken out by type
- Core Link Traffic Histogram of load on fabric links
- Edge Link Traffic Histogram of load on access ports
- RDMA Operations Total RDMA operations
- RDMA Bytes Average RDMA operation size
- Credits Average number of credits in RoCEv2 acknowledgements
- Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
- Congestion Total ECN / CNP congestion messages
- Errors Total ingress / egress errors
- Discards Total ingress / egress discards
- Drop Reasons Packet drop reasons
Note: Clicking on peaks in the charts shows values at that time.
This article gives step-by-step instructions to run the demonstration.
git clone https://github.com/sflow-rt/containerlab.gitDownload the sflow-rt/containerlab project from GitHub.
git clone https://github.com/sflow-rt/containerlab.git cd containerlab ./run-clabRun the above commands Continue reading
