By default, the dashboard shows the Last 24 Hours of traffic. Explore the data: select Last 30 Days to get a long term view, select Last 5 Minutes to get an up to the second view, click on items in a chart legend to show selected metric, drag to select an interval and zoom in.
The Expanse cluster at the San Diego Supercomputer Center is a batch-oriented science computing gateway serving thousands of users and a wide range of research projects, see Google News for examples.The SDSC Expanse cluster live AI/ML metrics dashboard displays real-time metrics for workloads running on the cluster:
The sFlow instrumentation embedded as a standard feature of data center switch hardware from all leading vendors (Arista, Cisco, Dell, Juniper, NVIDIA, etc.) provides a cost effective solution for gaining visibility into UET traffic in large production AI / ML fabrics.
docker run -p 8008:8008 -p 6343:6343/udp sflow/prometheusThe easiest way to get started is to use the pre-built sflow/prometheus Docker image to analyze the sFlow telemetry. The chart at the top of this page shows an up to the second view of UET operations using the included Flow Browser application, see Defining Flows for a list of available UET attributes. Getting Started describes how to set up the sFlow monitoring system.
Flow metrics with Prometheus and Grafana describes how collect custom network traffic flow metrics using the Prometheus time series database and include the metrics in Grafana dashboards. Use the Flow Browser to explore Continue reading
sflow sampling-rate 10000 sflow polling-interval 20 sflow header-bytes 128 sflow direction both sflow drop-monitoring enable sflow enable GigabitEthernet0/8/0 sflow enable GigabitEthernet0/9/0 sflow enable GigabitEthernet0/a/0The above VPP configuration commands enable sFlow monitoring of the VPP dataplane, randomly sampling packets, periodically polling counters, and capturing dropped packets and reason codes. The measurements are send via Linux netlink messages to an instance of the open source Host sFlow agent (hsflowd) which combines the measurements and streams standard sFlow telemetry to a remote collector.
sflow {
collector { ip=192.0.2.1 udpport=6343 }
psample { group=1 egress=on }
dropmon { start=on limit=50 }
vpp { }
}The /etc/hsflowd.conf file above enables the modules needed to receive netlink messages from VPP and send the resulting sFlow telemetry to a collector at 192.0.2.1. See vpp-sflow for detailed instructions.
docker run -p 6343:6343/udp sflow/sflowtoolRun sflowtool on the sFlow collector host to verify verify that the data is being received and Continue reading
The latest version of the AI Metrics dashboard uses industry standard sFlow telemetry from network switches to monitor the number of trimmed packets to use as a congestion metric.
Ultra Ethernet Specification Update describes how the Ultra Ethernet Transport (UET) Protocol has the ability to leverage optional “packet trimming” in network switches, which allows packets to be truncated rather than dropped in the fabric during congestion events. As packet spraying causes reordering, it becomes more complicated to detect loss. Packet trimming gives the receiver and the sender an early explicit indication of congestion, allowing immediate loss recovery in spite of reordering, and is a critical feature in the low-RTT environments where UET is designed to operate.
cumulus@switch:~$ nv set system forwarding packet-trim profile packet-trim-default cumulus@switch:~$ nv config apply
NVIDIA Cumulus Linux release 5.14 for NVIDA Spectrum Ethernet Switches includes support for Packet Trimming. The above command enables packet trimming, sets the DSCP remark value to 11, sets the truncation size to 256 bytes, sets the switch priority to 4, and sets the eligibility to all ports on the switch with traffic class 1, 2, and 3. NVIDA BlueField host adapters respond to trimmed packets to ensure fast congestion recovery.
Linux 6.11+ kernels provide TCX attachment points for eBPF programs to efficiently examine packets as they ingress and egress the host. The latest version of the open source Host sFlow agent includes support for TCX packet sampling to stream industry standard sFlow telemetry to a central collector for network wide visibility, e.g. Deploy real-time network dashboards using Docker compose describes how to quickly set up a Prometheus database and use Grafana to build network dashboards.
static __always_inline void sample_packet(struct __sk_buff *skb, __u8 direction) {
__u32 key = skb->ifindex;
__u32 *rate = bpf_map_lookup_elem(&sampling, &key);
if (!rate || (*rate > 0 && bpf_get_prandom_u32() % *rate != 0))
return;
struct packet_event_t pkt = {};
pkt.timestamp = bpf_ktime_get_ns();
pkt.ifindex = skb->ifindex;
pkt.sampling_rate = *rate;
pkt.ingress_ifindex = skb->ingress_ifindex;
pkt.routed_ifindex = direction ? 0 : get_route(skb);
pkt.pkt_len = skb->len;
pkt.direction = direction;
__u32 hdr_len = skb->len < MAX_PKT_HDR_LEN ? skb->len : MAX_PKT_HDR_LEN;
if (hdr_len > 0 && bpf_skb_load_bytes(skb, 0, pkt.hdr, hdr_len) < 0)
return;
bpf_perf_event_output(skb, &events, BPF_F_CURRENT_CPU, &pkt, sizeof(pkt));
}
SEC("tcx/ingress")
int tcx_ingress(struct __sk_buff *skb) {
sample_packet(skb, 0);
return TCX_NEXT;
}
SEC("tcx/egress")
int tcx_egress(struct __sk_buff *skb) {
sample_packet(skb, 1);
return TCX_NEXT;
}
The sample.bpf.c file Continue reading
In this article we will use Multipass to create a virtual machine to experiment with pwru. Multipass is a command line tool for running Ubuntu virtual machines on Mac or Windows. Multipass uses the native virtualization capabilities of the host operating system to simplify the creation of virtual machines.
multipass launch --name=ebpf noble multipass exec ebpf -- sudo apt update multipass exec ebpf -- sudo apt -y install git clang llvm make libbpf-dev flex bison golang multipass exec ebpf -- git clone https://github.com/cilium/pwru.git multipass exec ebpf --working-directory pwru -- make multipass exec ebpf -- sudo ./pwru/pwru -hRun the commands above to create the virtual machine and build pwru from sources.
multipass exec ebpf -- sudo ./pwru/pwru port httpsRun pwru to trace https traffic on the virtual machine.
multipass exec ebpf -- curl https://sflow-rt.comIn a second window, run the above command to generate an https request from the virtual machine.
SKB CPU PROCESS NETNS MARK/x IFACE PROTO MTU LEN TUPLE FUNC 0xffff9fc40335a0e8 0 ~r/bin/curl:8966 4026531840 0 0 Continue reading
The metrics include:
Note: InfluxDB Cloud has a free service tier that can be used to test this example.
Save the following compose.yml file on a system running Docker.
configs:
config.telegraf:
content: |
[agent]
interval = '15s'
round_interval = true
omit_hostname = true
[[outputs.influxdb_v2]]
urls = ['https://<INFLUXDB_CLOUD_INSTANCE>.cloud2.influxdata.com']
Continue reading
The screen capture is from a containerlab topology that emulates a AI compute cluster connected by a leaf and spine network. The metrics include:
Note: Clicking on peaks in the charts shows values at that time.
This article gives step-by-step instructions to run the demonstration.
git clone https://github.com/sflow-rt/containerlab.gitDownload the sflow-rt/containerlab project from GitHub.
git clone https://github.com/sflow-rt/containerlab.git cd containerlab ./run-clabRun the above commands Continue reading
The metrics include:
Note: Grafana Cloud has a free service tier that can be used to test this example.
Today the following network operating systems include support for the drop notification extension in their sFlow agent implementations:
Two additional sFlow dropped packet notification implementations are in the pipeline and should be available later this year:
The metrics include:
This article gives step-by-step instructions to set up the dashboard in a production environment.
git clone https://github.com/sflow-rt/prometheus-grafana.git sed -i -e 's/prometheus/ai-metrics/g' prometheus-grafana/env_vars ./prometheus-grafana/start.sh
The easiest way to get started is to use Docker, see Deploy real-time network dashboards using Docker compose, and deploy the sflow/ai-metrics image bundling the AI Metrics application to generate metrics.
scrape_configs:
- job_name: 'sflow-rt-ai-metrics'
metrics_path: /app/ai-metrics/scripts/metrics.js/prometheus/txt
scheme: Continue reading
The availability of the Cisco IOS XR Release 25.1.1 brings sFlow dropped packet notification support to Cisco 8000 series routers, making it easy to capture and analyze packets dropped at router ingress, aiding in understanding blocked traffic types, identifying potential security threats, and optimizing network performance.
sFlow Configuration for Traffic Monitoring and Analysis describes the steps to enable sFlow and configure packet sampling and interface counter export from a Cisco 8000 Series router to a remote sFlow analyzer.
Note: Devices using NetFlow or IPFIX must transition to sFlow for regular sampling before utilizing the dropped packet feature, ensuring compatibility and consistency in data analysis.
Router(config)#monitor-session monitor1 Router(config)#destination sflow EXP-MAP Router(config)#forward-drops rx
Configure a monitor-session with the new destination sflow option to export dropped packet notifications (which include ingress interface, drop reason, and header of dropped packet) to the configured sFlow analyzer.
Cisco lists the following benefits of streaming dropped packets in the configuration guide:
This article goes beyond simulation to demonstrate the AI Metrics dashboard by comparing live traffic seen in two production AI clusters.
docker run --rm -p 6343:6343/udp sflow/sflowtool -M > sflow.pcapEither compile the latest version of sflowtool or, as shown above, use Docker to run the pre-built sflow/sflowtool image. The -M option outputs whole UDP datagrams received to standard output. In either case, type CNTRL + C to end the capture.
The dashboard shown above is from a simulated network 1,000 switches, each with 48 ports access ports connected to a host. Activity occurs in a 256mS on / off cycle to emulate an AI learning run. The metrics include:
Note: Clicking on peaks in the charts shows values at that time.
This article gives step-by-step instructions to run the AI Metrics application in a production environment and integrate Continue reading
sudo tcpdump -i any -s 0 -w sflow.pcap udp port 6343Run the command above on the system you are using to collect sFlow data (if you aren't yet collecting sFlow, see Agents for suggested configuration settings). Type Control-C to end the capture after 5 to 10 minutes. Copy the resulting sflow.pcap file to your laptop.
docker run --rm -it -v $PWD/sflow.pcap:/sflow.pcap sflow/sflowtool \ -r /sflow.pcap -P 1Either compile the latest version of sflowtool or, as shown above, use Docker to run the pre-built sflow/sflowtool image. The -P (Playback) option replays the trace in real-time and displays the contents of each sFlow message. Running sflowtool using Docker provides additional examples, including converting the sFlow messages into JSON format for processing by a Python script.
docker run --rm -it -v $PWD/sflow.pcap:/sflow.pcap sflow/sflowtool \ -r /sflow.pcap -f 192.168.4.198/6343 -P 1The -f (forwarding) option takes an IP address and UDP port number as arguments, in this Continue reading
NVIDIA Cumulus Linux 5.11 for AI / ML describes how NVIDIA 400/800G Spectrum-X switches combined with the latest Cumulus Linux release deliver enhanced real-time telemetry that is particularly relevant to the AI / machine learning workloads that Spectrum-X switches are designed to handle.
This article shows how to extract Topology from an NVIDIA fabric in order to perform advanced fabric aware analytics, for example: detect flow collisions, trace flow paths, and de-duplicate traffic.
In this example, we will use NVIDIA NetQ, a highly scalable, modern network operations toolset that provides visibility, troubleshooting, and validation of your Cumulus and SONiC fabrics in real time.
netq show lldp jsonFor example, the NetQ Link Layer Discovery Protocol (LLDP) service simplifies the task of gathering neighbor data from switches in the network, and with the json option, makes the output easy to process with a Python script, for example, lldp-rt.py.
The simplest way to try sFlow-RT is to use the pre-built sflow/topology Docker image that packages sFlow-RT with additional applications that are useful for monitoring network topologies.
docker run -p 6343:6343/udp -p 8008:8008 sflow/topologyConfigure Cumulus Linux to steam sFlow telemetry to sFlow-RT on UDP port 6343 (the default for Continue reading
The SC24 WAN Stress Test chart shows 10.3 Terabits bits per second of WAN traffic to the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference held this week in Atlanta. The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.
SC24 Real-time RoCEv2 traffic visibility describes a demonstration of wide area network bulk data transmission using RDMA over Converged Ethernet (RoCEv2) flows typically seen in AI/ML data centers. In the example, 3.2Tbits/second sustained trasmissions from sources geographically distributed around the United States was demonstrated.
SC24 Dropped packet visibility demonstration shows how the sFlow data model integrates three telemetry streams: counters, packet samples, and packet drop notifications. Each type of data is useful on its own, but together they provide the comprehensive network wide observability needed to drive automation. Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degradation and in this screen capture you can see multiple 400Gbits/s RoCEv2 flows.
SC24 SCinet traffic describes the architecture of the real-time monitoring system used to Continue reading