AI Metrics with Prometheus and Grafana
The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.The metrics include:
- Total Traffic Total traffic entering fabric
- Operations Total RoCEv2 operations broken out by type
- Core Link Traffic Histogram of load on fabric links
- Edge Link Traffic Histogram of load on access ports
- RDMA Operations Total RDMA operations
- RDMA Bytes Average RDMA operation size
- Credits Average number of credits in RoCEv2 acknowledgements
- Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
- Congestion Total ECN / CNP congestion messages
- Errors Total ingress / egress errors
- Discards Total ingress / egress discards
- Drop Reasons Packet drop reasons
This article gives step-by-step instructions to set up the dashboard in a production environment.
git clone https://github.com/sflow-rt/prometheus-grafana.git sed -i -e 's/prometheus/ai-metrics/g' prometheus-grafana/env_vars ./prometheus-grafana/start.sh
The easiest way to get started is to use Docker, see Deploy real-time network dashboards using Docker compose, and deploy the sflow/ai-metrics image bundling the AI Metrics application to generate metrics.
scrape_configs:
- job_name: 'sflow-rt-ai-metrics'
metrics_path: /app/ai-metrics/scripts/metrics.js/prometheus/txt
scheme: Continue reading
