AI Metrics with Grafana Cloud
The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.The metrics include:
- Total Traffic Total traffic entering fabric
- Operations Total RoCEv2 operations broken out by type
- Core Link Traffic Histogram of load on fabric links
- Edge Link Traffic Histogram of load on access ports
- RDMA Operations Total RDMA operations
- RDMA Bytes Average RDMA operation size
- Credits Average number of credits in RoCEv2 acknowledgements
- Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
- Congestion Total ECN / CNP congestion messages
- Errors Total ingress / egress errors
- Discards Total ingress / egress discards
- Drop Reasons Packet drop reasons
Note: Grafana Cloud has a free service tier that can be used to test this example.





