0
The Grafana
AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network
traffic, for example, large scale CUDA compute tasks using NVIDIA Collective
Communication Library (NCCL) operations for inter-GPU communications: AllReduce,
Broadcast, Reduce, AllGather, and ReduceScatter.
The metrics include:
- Total Traffic Total traffic entering fabric
-
Operations Total RoCEv2 operations broken out by type
- Core Link Traffic Histogram of load on fabric links
- Edge Link Traffic Histogram of load on access ports
- RDMA Operations Total RDMA operations
- RDMA Bytes Average RDMA operation size
-
Credits Average number of credits in RoCEv2
acknowledgements
-
Period Detected period of compute / exchange activity on
fabric (in this case just over 0.5 seconds)
- Congestion Total ECN / CNP congestion messages
- Errors Total ingress / egress errors
- Discards Total ingress / egress discards
- Drop Reasons Packet drop reasons
This article gives step-by-step instructions to set up the dashboard in a production environment.
git clone https://github.com/sflow-rt/prometheus-grafana.git
sed -i -e 's/prometheus/ai-metrics/g' prometheus-grafana/env_vars
./prometheus-grafana/start.sh
The easiest way to get started is to use Docker, see
Deploy real-time network dashboards using Docker compose, and deploy the
sflow/ai-metrics image
bundling the AI Metrics application to generate metrics.
scrape_configs:
- job_name: 'sflow-rt-ai-metrics'
metrics_path: /app/ai-metrics/scripts/metrics.js/prometheus/txt
scheme: Continue reading