0
AI Metrics
describes how to deploy the open source
ai-metrics application. The
application provides performance metrics for AI/ML RoCEv2 network traffic, for
example, large scale CUDA compute tasks using NVIDIA Collective Communication
Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast,
Reduce, AllGather, and ReduceScatter. The screen capture from the article
(above) shows results from a simulated 48,000 GPU cluster.
This article goes beyond simulation to demonstrate the AI Metrics dashboard
by comparing live traffic seen in two production AI clusters.
Cluster 1
This cluster consists of 250 GPUs connected via 100G ports to single large
switch. The results are pretty consistent with simulation from the original
article. In this case there is no
Core Link Traffic because the cluster
consists of a single switch. The
Discards chart shows a burst of
Out (egress) discards and the
Drop Reasons chart gives the reason
as
ingress_vlan_filter. The
Total Traffic,
Operations,
Edge Link Traffic, and
RDMA Operations charts all show a transient
drop in throughput coincident with the discard spike. Further details of the
dropped packets, such as source/destination address, operation, ingress / egress
port, QP pair, etc. can be extracted from the
sFlow Dropped Packet Notifications
that are populating
Continue reading