Peter, Author at NetworkingNexus.net

Peter

Author Archives: Peter

Linux packet sampling using eBPF

Linux 6.11+ kernels provide TCX attachment points for eBPF programs to efficiently examine packets as they ingress and egress the host. The latest version of the open source Host sFlow agent includes support for TCX packet sampling to stream industry standard sFlow telemetry to a central collector for network wide visibility, e.g. Deploy real-time network dashboards using Docker compose describes how to quickly set up a Prometheus database and use Grafana to build network dashboards.

static __always_inline void sample_packet(struct __sk_buff *skb, __u8 direction) {
    __u32 key = skb->ifindex;
    __u32 *rate = bpf_map_lookup_elem(&sampling, &key);
    if (!rate || (*rate > 0 && bpf_get_prandom_u32() % *rate != 0))
        return;

    struct packet_event_t pkt = {};
    pkt.timestamp = bpf_ktime_get_ns();
    pkt.ifindex = skb->ifindex;
    pkt.sampling_rate = *rate;
    pkt.ingress_ifindex = skb->ingress_ifindex;
    pkt.routed_ifindex = direction ? 0 : get_route(skb);
    pkt.pkt_len = skb->len;
    pkt.direction = direction;

    __u32 hdr_len = skb->len < MAX_PKT_HDR_LEN ? skb->len : MAX_PKT_HDR_LEN;
    if (hdr_len > 0 && bpf_skb_load_bytes(skb, 0, pkt.hdr, hdr_len) < 0)
        return;
    bpf_perf_event_output(skb, &events, BPF_F_CURRENT_CPU, &pkt, sizeof(pkt));
}

SEC("tcx/ingress")
int tcx_ingress(struct __sk_buff *skb) {
    sample_packet(skb, 0);

    return TCX_NEXT;
}

SEC("tcx/egress")
int tcx_egress(struct __sk_buff *skb) {
    sample_packet(skb, 1);

    return TCX_NEXT;
}

The sample.bpf.c file Continue reading

Tracing network packets with eBPF and pwru

pwru (packet, where are you?) is an open source tool from Cilium that used eBPF instrumentation in recent Linux kernels to trace network packets through the kernel.

In this article we will use Multipass to create a virtual machine to experiment with pwru. Multipass is a command line tool for running Ubuntu virtual machines on Mac or Windows. Multipass uses the native virtualization capabilities of the host operating system to simplify the creation of virtual machines.

multipass launch --name=ebpf noble
multipass exec ebpf -- sudo apt update
multipass exec ebpf -- sudo apt -y install git clang llvm make libbpf-dev flex bison golang
multipass exec ebpf -- git clone https://github.com/cilium/pwru.git
multipass exec ebpf --working-directory pwru -- make
multipass exec ebpf -- sudo ./pwru/pwru -h

Run the commands above to create the virtual machine and build pwru from sources.

multipass exec ebpf -- sudo ./pwru/pwru port https

Run pwru to trace https traffic on the virtual machine.

multipass exec ebpf -- curl https://sflow-rt.com

In a second window, run the above command to generate an https request from the virtual machine.

SKB                CPU PROCESS          NETNS      MARK/x        IFACE       PROTO  MTU   LEN   TUPLE FUNC
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0               0          Continue reading

AI Metrics with InfluxDB Cloud

The InfluxDB AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

Total Traffic Total traffic entering fabric
Operations Total RoCEv2 operations broken out by type
Core Link Traffic Histogram of load on fabric links
Edge Link Traffic Histogram of load on access ports
RDMA Operations Total RDMA operations
RDMA Bytes Average RDMA operation size
Credits Average number of credits in RoCEv2 acknowledgements
Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
Congestion Total ECN / CNP congestion messages
Errors Total ingress / egress errors
Discards Total ingress / egress discards
Drop Reasons Packet drop reasons

This article shows how to integrate with InfluxDB Cloud instead of running the services locally.

Note: InfluxDB Cloud has a free service tier that can be used to test this example.

Save the following compose.yml file on a system running Docker.

configs:
  config.telegraf:
    content: |
      [agent]
        interval = '15s'
        round_interval = true
        omit_hostname = true
      [[outputs.influxdb_v2]]
        urls = ['https://<INFLUXDB_CLOUD_INSTANCE>.cloud2.influxdata.com']
         Continue reading

AI network performance monitoring using containerlab

AI Metrics is available on GitHub. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The screen capture is from a containerlab topology that emulates a AI compute cluster connected by a leaf and spine network. The metrics include:

Total Traffic Total traffic entering fabric
Operations Total RoCEv2 operations broken out by type
Core Link Traffic Histogram of load on fabric links
Edge Link Traffic Histogram of load on access ports
RDMA Operations Total RDMA operations
RDMA Bytes Average RDMA operation size
Credits Average number of credits in RoCEv2 acknowledgements
Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
Congestion Total ECN / CNP congestion messages
Errors Total ingress / egress errors
Discards Total ingress / egress discards
Drop Reasons Packet drop reasons

Note: Clicking on peaks in the charts shows values at that time.

This article gives step-by-step instructions to run the demonstration.

git clone https://github.com/sflow-rt/containerlab.git

Download the sflow-rt/containerlab project from GitHub.

git clone https://github.com/sflow-rt/containerlab.git
cd containerlab
./run-clab

Run the above commands Continue reading

AI Metrics with Grafana Cloud

The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

Total Traffic Total traffic entering fabric
Operations Total RoCEv2 operations broken out by type
Core Link Traffic Histogram of load on fabric links
Edge Link Traffic Histogram of load on access ports
RDMA Operations Total RDMA operations
RDMA Bytes Average RDMA operation size
Credits Average number of credits in RoCEv2 acknowledgements
Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
Congestion Total ECN / CNP congestion messages
Errors Total ingress / egress errors
Discards Total ingress / egress discards
Drop Reasons Packet drop reasons

AI Metrics with Prometheus and Grafana describes how to stand up an analytics stack with Prometheus and Grafana to track performance metrics for an AI/ML GPU cluster. This article shows how to integrate with Prometheus and Grafana hosted in the cloud, Grafana Cloud, instead of running the services locally.

Note: Grafana Cloud has a free service tier that can be used to test this example.

Multi-vendor support for dropped packet notifications

The sFlow Dropped Packet Notification Structures extension was published in October 2020. Extending sFlow to provide visibility into dropped packets offers significant benefits for network troubleshooting, providing real-time network wide visibility into the specific packets that were dropped as well the reason the packet was dropped. This visibility instantly reveals the root cause of drops and the impacted connections. Packet discard records complement sFlow's existing counter polling and packet sampling mechanisms and share a common data model so that all three sources of data can be correlated, for example, packet sampling reveals the top consumers of bandwidth on a link, helping to get to the root cause of congestion related packet drops reported for the link.

Today the following network operating systems include support for the drop notification extension in their sFlow agent implementations:

Arista Dropped packet notifications with Arista Networks
Cisco Dropped packet notifications with Cisco 8000 Series Routers
Linux Linux as a network operating system Using sFlow to monitor dropped packets
NVIDIA NVIDIA Cumulus Linux 5.11 for AI / ML
VyOS VyOS 1.4 LTS released

Two additional sFlow dropped packet notification implementations are in the pipeline and should be available later this year:

SONiC The Switch Continue reading

AI Metrics with Prometheus and Grafana

The metrics include:

Total Traffic Total traffic entering fabric
Operations Total RoCEv2 operations broken out by type
Core Link Traffic Histogram of load on fabric links
Edge Link Traffic Histogram of load on access ports
RDMA Operations Total RDMA operations
RDMA Bytes Average RDMA operation size
Credits Average number of credits in RoCEv2 acknowledgements
Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
Congestion Total ECN / CNP congestion messages
Errors Total ingress / egress errors
Discards Total ingress / egress discards
Drop Reasons Packet drop reasons

This article gives step-by-step instructions to set up the dashboard in a production environment.

git clone https://github.com/sflow-rt/prometheus-grafana.git
cd prometheus-grafana
env RT_IMAGE=ai-metrics ./start.sh

The easiest way to get started is to use Docker, see Deploy real-time network dashboards using Docker compose, and deploy the sflow/ai-metrics image bundling the AI Metrics application to generate metrics.

scrape_configs:
  - job_name: 'sflow-rt-ai-metrics'
    metrics_path: /app/ai-metrics/scripts/metrics.js/prometheus/txt
    scheme: http
     Continue reading

Dropped packet notifications with Cisco 8000 Series Routers

The availability of the Cisco IOS XR Release 25.1.1 brings sFlow dropped packet notification support to Cisco 8000 series routers, making it easy to capture and analyze packets dropped at router ingress, aiding in understanding blocked traffic types, identifying potential security threats, and optimizing network performance.

sFlow Configuration for Traffic Monitoring and Analysis describes the steps to enable sFlow and configure packet sampling and interface counter export from a Cisco 8000 Series router to a remote sFlow analyzer.

Note: Devices using NetFlow or IPFIX must transition to sFlow for regular sampling before utilizing the dropped packet feature, ensuring compatibility and consistency in data analysis.

Router(config)#monitor-session monitor1
Router(config)#destination sflow EXP-MAP
Router(config)#forward-drops rx

Configure a monitor-session with the new destination sflow option to export dropped packet notifications (which include ingress interface, drop reason, and header of dropped packet) to the configured sFlow analyzer.

Cisco lists the following benefits of streaming dropped packets in the configuration guide:

Enhanced Network Visibility: Captures and forwards dropped packets to an sFlow collector, providing detailed insights into packet loss and improving diagnostic capabilities.
Comprehensive Analysis: Allows for simultaneous analysis of regular and dropped packet flows, offering a holistic view of network performance.
Troubleshooting: Empowers Continue reading

Comparing AI / ML activity from two production networks

AI Metrics describes how to deploy the open source ai-metrics application. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter. The screen capture from the article (above) shows results from a simulated 48,000 GPU cluster.

This article goes beyond simulation to demonstrate the AI Metrics dashboard by comparing live traffic seen in two production AI clusters.

Cluster 1

This cluster consists of 250 GPUs connected via 100G ports to single large switch. The results are pretty consistent with simulation from the original article. In this case there is no Core Link Traffic because the cluster consists of a single switch. The Discards chart shows a burst of Out (egress) discards and the Drop Reasons chart gives the reason as ingress_vlan_filter. The Total Traffic, Operations, Edge Link Traffic, and RDMA Operations charts all show a transient drop in throughput coincident with the discard spike. Further details of the dropped packets, such as source/destination address, operation, ingress / egress port, QP pair, etc. can be extracted from the sFlow Dropped Packet Notifications that are populating Continue reading

Capture to pcap file using sflowtool

Replay pcap files using sflowtool describes how to capture sFlow datagrams using tcpdump and replay them in real time using sflowtool. However, using tcpdump for the capture has the downside of requiring root privileges. A recent update to sflowtool now makes it possible to use sflowtool to capture sFlow datagrams in tcpdump pcap format without the need for root access.

docker run --rm -p 6343:6343/udp sflow/sflowtool -M > sflow.pcap

Either compile the latest version of sflowtool or, as shown above, use Docker to run the pre-built sflow/sflowtool image. The -M option outputs whole UDP datagrams received to standard output. In either case, type CNTRL + C to end the capture.

AI Metrics

The dashboard shown above is from a simulated network 1,000 switches, each with 48 ports access ports connected to a host. Activity occurs in a 256mS on / off cycle to emulate an AI learning run. The metrics include:

Total Traffic Total traffic entering fabric
Operations Total RoCEv2 operations broken out by type
Core Link Traffic Histogram of load on fabric links
Edge Link Traffic Histogram of load on access ports
RDMA Operations Total RDMA operations
RDMA Bytes Average RDMA operation size
Credits Average number of credits in RoCEv2 acknowledgements
Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
Congestion Total ECN / CNP congestion messages
Errors Total ingress / egress errors
Discards Total ingress / egress discards
Drop Reasons Packet drop reasons

Note: Clicking on peaks in the charts shows values at that time.

This article gives step-by-step instructions to run the AI Metrics application in a production environment and integrate Continue reading

Replay pcap files using sflowtool

It can be very useful to capture sFlow telemetry from production networks so that it can be replayed later to perform off-line analysis, or to develop or evaluate sFlow collection tools.

sudo tcpdump -i any -s 0 -w sflow.pcap udp port 6343

Run the command above on the system you are using to collect sFlow data (if you aren't yet collecting sFlow, see Agents for suggested configuration settings). Type Control-C to end the capture after 5 to 10 minutes. Copy the resulting sflow.pcap file to your laptop.

docker run --rm -it -v $PWD/sflow.pcap:/sflow.pcap sflow/sflowtool \
  -r /sflow.pcap -P 1

Either compile the latest version of sflowtool or, as shown above, use Docker to run the pre-built sflow/sflowtool image. The -P (Playback) option replays the trace in real-time and displays the contents of each sFlow message. Running sflowtool using Docker provides additional examples, including converting the sFlow messages into JSON format for processing by a Python script.

docker run --rm -it -v $PWD/sflow.pcap:/sflow.pcap sflow/sflowtool \
  -r /sflow.pcap -f 192.168.4.198/6343 -P 1

The -f (forwarding) option takes an IP address and UDP port number as arguments, in this Continue reading

Topology aware flow analytics with NVIDIA NetQ

NVIDIA Cumulus Linux 5.11 for AI / ML describes how NVIDIA 400/800G Spectrum-X switches combined with the latest Cumulus Linux release deliver enhanced real-time telemetry that is particularly relevant to the AI / machine learning workloads that Spectrum-X switches are designed to handle.

This article shows how to extract Topology from an NVIDIA fabric in order to perform advanced fabric aware analytics, for example: detect flow collisions, trace flow paths, and de-duplicate traffic.

In this example, we will use NVIDIA NetQ, a highly scalable, modern network operations toolset that provides visibility, troubleshooting, and validation of your Cumulus and SONiC fabrics in real time.

netq show lldp json

For example, the NetQ Link Layer Discovery Protocol (LLDP) service simplifies the task of gathering neighbor data from switches in the network, and with the json option, makes the output easy to process with a Python script, for example, lldp-rt.py.

The simplest way to try sFlow-RT is to use the pre-built sflow/topology Docker image that packages sFlow-RT with additional applications that are useful for monitoring network topologies.

docker run -p 6343:6343/udp -p 8008:8008 sflow/topology

Configure Cumulus Linux to steam sFlow telemetry to sFlow-RT on UDP port 6343 (the default for Continue reading

SC24 Over 10 Terabits per Second of WAN Traffic

The SC24 WAN Stress Test chart shows 10.3 Terabits bits per second of WAN traffic to the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference held this week in Atlanta. The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.

SC24 Real-time RoCEv2 traffic visibility describes a demonstration of wide area network bulk data transmission using RDMA over Converged Ethernet (RoCEv2) flows typically seen in AI/ML data centers. In the example, 3.2Tbits/second sustained trasmissions from sources geographically distributed around the United States was demonstrated.

SC24 Dropped packet visibility demonstration shows how the sFlow data model integrates three telemetry streams: counters, packet samples, and packet drop notifications. Each type of data is useful on its own, but together they provide the comprehensive network wide observability needed to drive automation. Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degradation and in this screen capture you can see multiple 400Gbits/s RoCEv2 flows.

SC24 SCinet traffic describes the architecture of the real-time monitoring system used to Continue reading

SC24 Real-time RoCEv2 traffic visibility

The chart shows eight 400Gbits/s RDMA over Converged Ethernet (RoCEv2) flows, typically seen in AI / ML data centers, totaling 3.2 Tbits/s. The unique challenge in this case is that flows are being routed from locations scattered around the United States to Atlanta, the location of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference.

SC24 Network Research Exhibit: The Resiliant, Performant Networks and Distributed Processing demonstration aims to explore performance limitations and enablers for high volume bulk data tranfers. Maintaining stable 400Gbits/s RoCEv2 connections over a wide area network is challenging since the packets have to traverse multiple links, avoid contention on links, and deal with buffering associated with transmission latency that is orders of magnitude higher than data center environments where RoCEv2 is typically deployed (one way latency across the USA is a minimum of 16 milliseconds due to speed of light, but in practice the latency is quite a bit larger, on the other hand latency across a leaf and spine data center fabric is measured in microseconds).

During setup it was noticed that total throughput with 8 concurrent flows was only 2.7Tbits/s (instead of the 3Tbits/second plus expected). Examining a Continue reading

SC24 SCinet traffic

The real-time dashboard shows total network traffic at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference being held this week in Atlanta. The dashboard shows that 31 Petabytes of data have been transferred already and the conference has just started.

The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.

In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from switches, routers, and servers in the SCinet network and creates metrics to drive the real-time charts in the dashboard. Getting Started provides a quick introduction to deploying and using sFlow-RT for real-time network-wide flow analytics.

Finally, check out the SC24 Dropped packet visibility demonstration to learn about one of newest developments in sFlow monitoring and see a live demonstration.

NVIDIA Cumulus Linux 5.11 for AI / ML

NVIDIA Cumulus Linux 5.11 includes major upgrades to the sFlow agent that fully exposes the advanced instrumentation built into NVIDIA Spectrum-X silicon. The enhanced real-time telemetry is particularly relevant to the AI / machine learning workloads that Spectrum-X is designed to handle.

With Cumulus Linux 5.11, the sFlow agent is easily configured using nvue commands, see Monitoring System Statistics and Network Traffic with sFlow:

nv set system sflow dropmon hw
nv set system sflow poll-interval 20
nv set system sflow collector 192.0.2.1
nv set system sflow state enabled
nv config apply

Note: In this case, enabling dropmon ensures that every dropped packet is captured, along with ingress port and drop reason (e.g. ttl_exceeded).

The same commands should be applied to every switch in the fabric for comprehensive visibility.

RDMA over Converged Ethernet (RoCE) describes how sFlow provides detailed visibility into RoCE flows used to move data between GPUs in an AI / ML data center fabric. The chart above from the RDMA network visibility demonstration at the SC22 conference shows that sFlow monitoring easily scales to the 400/800G speeds needed for machine learning.

In this example, the sFlow-RT real-time analytics engine receives sFlow Continue reading

SC24 Dropped packet visibility demonstration

The real-time dashboard is a joint InMon / Arista demonstration at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference being held this week in Atlanta.

The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.

The sFlow Packet Drop Monitoring In High Performance Networks dashboard combines telemetry from all the Arista switches in the SCinet network to provide real-time network-wide view of performance. Each of the three charts demonstrate a different type of measurement in the sFlow telemetry stream:

Counters: Total Traffic shows total traffic calculated from interface counters streamed from all interfaces. Counters provide a useful way of accurately reporting byte, frame, error and discard counters for each network interface. In this case, the chart rolls up data from all interfaces to trend total traffic on the network.
Samples: Top Flows shows the top 5 largest traffic flows traversing the network. The chart is based on sFlow's random packet sampling mechanism, providing a scaleable method of determining the hosts and services responsible for the traffic reported by the counters. Visibility into top flows is essential if one Continue reading

Worldwide deployment of real-time flow analytics

Industry standard sFlow telemetry is widely supported by network equipment vendors and network management platforms. However, the advent of real-time sFlow analytics has opened up a range of new applications for sFlow. The map above shows the proportion of sFlow-RT instances running in each of the over 70 countries in which it is deployed.

The following use cases are driving current deployments:

Augmenting observability dashboards with real-time network analytics, see Flow metrics with Prometheus and Grafana
Monitoring Internet Exchange Points, see Internet eXchange Provider (IXP) Metrics
Automated DDoS mitigation, see DDoS protection quickstart guide

Addressing the challenge of operating AI / ML clusters is the emerging application for sFlow visibility. High speed (400/800G) data center switches needed to handle machine learning traffic flows include sFlow agents and real-time analytics are essential to optimize the network so that expensive GPU and compute resources are fully utilized, see Leveraging open technologies to monitor packet drops in AI cluster fabrics.

If you would like to see how real-time network analytics can transform network operations, Getting Started describes how to download and configure sFlow-RT analytics software for use in your network, or how to try it out using an emulator, or pre-captured data.

Leveraging open technologies to monitor packet drops in AI cluster fabrics

In this talk from the recent OCP Global Summit, Aldrin Isaac, eBay, describes the challenge, AI clusters operate most efficiently over lossless networks for optimum job completion times which can be significantly impacted by dropped packets. Although networks can be designed to minimize packet loss by choosing the right network topology, optimizing network devices and protocols, an effective monitoring and troubleshooting network performance tool is still required. Such tool should capture packet drops, raise notifications and identify various drop reasons and pin point where the drops caused congestions. In turn, it allows the governing management application to tune configurations of relevant infrastructure components, including switches, NICs and GPU servers.

The talk shares the results and best practices of a TAM (Telemetry and Monitoring) solution being prepared for deployment at eBay. It leverages OCP’s SAI and open sFlow drop notification technologies as part of eBay’s ongoing initiatives to adopt open networking hardware and community SONiC for its data centers.

The sFlow Dropped Packet Notification Structures extension mentioned in the talk adds real-time packet drop notifications (including dropped packet header and drop reason) as part of an industry standard sFlow telemetry feed, making the data available to open source and commercial sFlow analytics Continue reading

1 2 3 … 15 Next »