Toni Pasanen, Author at NetworkingNexus.net

Toni Pasanen

Author Archives: Toni Pasanen

Packet trimming Deep Dive – Part IV

Receive Network Processing Unit (Rx NPU)

Figure 9-4 illustrates a simplified receive-side processing pipeline, starting from the moment a Packet Header Vector (PHV), constructed by the Rx IFG, is delivered to the Receive Network Processing Unit (Rx NPU).

When the PHV arrives at the Rx NPU, it is dispatched to one of the Run-to-Completion (RTC) cores in the Packet Processing Array (PPA). Each RTC core processes the packet within a single execution context, allowing parsing, classification, lookup, and queuing decisions to be resolved without intermediate handoffs between processing stages.

The first task of the RTC parser is to perform deep inspection of the packet headers. While the Rx IFG has already extracted basic Layer-2 and Layer-3 information, the RTC parser determines whether the packet is tunneled and whether the switch itself is the tunnel termination point. To demonstrate this behavior, consider a VXLAN-encapsulated packet. The outer Ethernet and IP headers are used to forward the packet through the underlay network. If the outer destination IP address matches one of the local switch IP addresses, the device identifies itself as the tunnel endpoint. The tunneling protocol is recognized by examining the UDP header, where destination port 4789 indicates VXLAN. After the Continue reading

Packet Trimming Deep Dive – Part III

Virtual Output Queue (VOQ)

The Silicon One VOQ Architecture

Instead of using dedicated deep interface buffers for packet queuing, Cisco Silicon One utilizes a Centralized Shared Memory architecture paired with a logical Virtual Output Queue (VOQ) mechanism. Because the VOQ concept is implemented within the Ingress (Rx) NPU entity, this queuing stage occurs after the initial ingress lookups but before the packet is switched across the internal fabric to the egress.

The VOQ model turns the traditional egress queuing model, where packets wait for serialization in a hardware buffer on the specific egress interface, upside down. While a VOQ is physically located on the ingress NPU, its ability to send traffic is controlled by the state of a small hardware Output Queue (OQ) on the egress interface.

Priority Mapping and Default State

As shown in Figure 9-3, a QoS policy can be created where a packet received on interface gi1/0/1 is assigned to Traffic Class 6 if the DSCP bits are set to EF (Expedited Forwarding). This configuration instantiates a VOQ specifically for that traffic class. In this hierarchy:

TC 7 (Control Plane/CS6): Mapped to OQ 1, the highest Strict Priority (Level 1).

TC 6 (DSCP-TRIMMED/EF): Mapped to OQ 2, Continue reading

Packet Trimming Deep Dive – Part II

Receive Interface Group (Rx IFG)

Ingress Pre-Processing and Integrity

The Receive Interface Group (Rx IFG) is the ingress pre-processing stage that handles the incoming Ethernet bitstream before the packet enters the Packet Processing Array (PPA) of the Receive Network Processing Unit (Rx NPU) in the Cisco Silicon One architecture.

Processing begins at the Rx MAC. The Rx MAC reconstructs (“delimits”) the Ethernet frame from the Physical Coding Sublayer (PCS) bitstream and verifies frame integrity by computing a Frame Check Sequence (FCS) using the CRC-32 algorithm. If the computed FCS does not match the received FCS value, the frame is considered corrupted and is dropped immediately at ingress. If the CRC check succeeds, the frame is admitted for further processing.

Shallow classification and Traffic Class mapping

After frame validation, the Rx IFG identifies the Ethernet MAC header and detects the presence of IEEE 802.1Q VLAN tags. The Rx IFG performs shallow classification to efficiently manage hardware resources before deeper protocol parsing and forwarding decisions are executed in the Rx NPU. When an IEEE 802.1Q VLAN tag is present, the Rx IFG extracts the Priority Code Point (PCP) bits from the VLAN tag and maps them to an Internal Continue reading

Packet Trimming Deep Dive – Part I

Introduction

The previous chapter introduced the Ultra Ethernet (UE) Transport Layer and its endpoint-centric congestion control mechanisms: Network Signaled Congestion Control (NSCC) and Receiver Credit-based Congestion Control (RCCC). This chapter moves down to the UE Network Layer and introduces Packet Trimming (PT).

While node-based approaches rely on NIC-to-NIC feedback loops, Packet Trimming allows network switches to actively intervene during periods of high utilization. Instead of silently dropping packets under congestion, the network provides an explicit and fast signal that enables immediate recovery.

The primary goal of Packet Trimming is to prevent incast congestion, a situation in which multiple ingress ports simultaneously overwhelm a single egress port. In AI and HPC workloads, many-to-one traffic patterns are common—for example, when multiple workers send data to a single parameter server. Under these conditions, egress buffers can be exhausted very quickly. In a best-effort network, this typically results in tail drops. The receiver then waits for a retransmission timeout, which introduces long tail latency and disrupts synchronization across distributed workloads. Packet Trimming replaces this silent packet loss with an explicit congestion signal that travels faster than the data itself.

The process begins at the source UE node. The NIC marks outgoing data packets with Continue reading

Ultra Ethernet: Receiver Credit-based Congestion Control (RCCC)

Introduction

Receiver Credit-Based Congestion Control (RCCC) is a cornerstone of the Ultra Ethernet transport architecture, specifically designed to eliminate incast congestion. Incast occurs at the last-hop switch when the aggregate data rate from multiple senders exceeds the egress interface capacity of the target’s link. This mismatch leads to rapid buffer exhaustion on the outgoing interface, resulting in packet drops and severe performance degradation.

The RCCC Mechanism

Figure 8-1 illustrates the operational flow of the RCCC algorithm. In a standard scenario without credit limits, source Rank 0 and Rank 1 might attempt to transmit at their full 100G line rates simultaneously. If the backbone fabric consists of 400G inter-switch links, the core utilization remains a comfortable 50% (200G total traffic). However, because the target host link is only 100G, the last-hop switch (Leaf 1B-1) becomes an immediate bottleneck. The switch is forced to queue packets that cannot be forwarded at the 100G egress rate, eventually triggering incast congestion and buffer overflows.

While "incast" occurs at the egress interface and can resemble head-of-line blocking, it is fundamentally a "fan-in" problem where multiple sources converge on a single receiver. Under RCCC, standard Explicit Congestion Notification (ECN) on the last-hop switch's egress interface is Continue reading

Ultra Ethernet: NSCC Destination Flow Control

Figure 6-14 depicts a demonstrative event where Rank 4 receives seven simultaneous flows (1). As these flows are processed by their respective PDCs and handed over to the Semantic Sublayer (2), the High-Bandwidth Memory (HBM) Controller becomes congested. Because HBM must arbitrate multiple fi_write RMA operations requiring concurrent memory bank access and state updates, the incoming packet rate quickly exceeds HBM’s transactional retirement rate.

This causes internal buffers at the memory interface to fill, creating a local congestion event (3). To prevent buffer overflow, which would lead to dropped packets and expensive RMA retries, the receiver utilizes NSCC to move the queuing "pain" back to the source. This is achieved by using pds.rcv_cwnd_pend parameter of the ACK_CC header (4). The parameter operates on a scale of 0 to 127; while zero is ignored, a value of 127 triggers the maximum possible rate decrement. In this scenario, a value of 64 is utilized, resulting in a 50% penalty relative to the newly acknowledged data.

Rather than directly computing a new transport rate, the mechanism utilizes a three-phase process to define a restricted Congestion Window (CWND). This reduction in CWND inherently forces the source to drain its inflight bucket to Continue reading

Ultra Ethernet: Inflight Bytes and CWND Adjustment

Inflight Packet Adjustment

Figure 6-12 depicts the ACK_CC header structure and fields. When NSCC is enabled in the UET node, the PDS must use the pds.type ACK_CC in the prologue header, which serves as the common header structure for all PDS messages. Within the actual PDS ACK_CC header, the pds.cc_type must be set to CC_NSCC. The pds.ack_cc_state field describes the values and states for service_time, rc (restore congestion CWND), rcv_cwnd_pend, and received_bytes. The source specifically utilizes the received_bytes parameter to calculate the updated state for inflight packets.

The CCC computes the reduction in the inflight state by subtracting the rcvd_bytes value received in previous ACK_CC messages from the rcvd_bytes value carried within the latest ACK_CC message. As illustrated in Figure 6-12, the inflight state is decreased by 4,096 bytes, which is the delta between 16,384 and 12,288 bytes.

Recap: In order to transport data to network, the Inflight bytes must be less than CWND size.

Figure 6-12: NSCC: Inflight Bytes adjustment.

CWND Adjustment

A single, shared Congestion Window (CWND) regulates the total volume of bytes across all PDCs that are permitted for transmission to the backend network. The transport rate and network performance are continuously monitored and Continue reading

Ultra Ethernet: Network-Signaled Congestion Control (NSCC) – Overview

Network-Signaled Congestion Control (NSCC)

The Network-Signaled Congestion Control (NSCC) algorithm operates on the principle that the network fabric itself is the best source of truth regarding congestion. Rather than waiting for packet loss to occur, NSCC relies on proactive feedback from switches to adjust transmission rates in real time. The primary mechanism for this feedback is Explicit Congestion Notification (ECN) marking. When a switch interface's egress queue begins to build up, it employs a Random Early Detection (RED) logic to mark specific packets. Once the buffer’s Minimum Threshold is crossed, the switch begins randomly marking packets by setting the last two bits of the IP header’s Type of Service (ToS) field to the CE (11) state. If the congestion worsens and the Maximum Threshold is reached, every packet passing through that interface is marked, providing a clear and urgent signal to the endpoints.

The practical impact of this mechanism is best illustrated by a hash collision event, such as the one shown in Figure 6-10. In this scenario, multiple GPUs on the left-hand side of the fabric transmit data at line rate. Due to the specific entropy of these flows, the ECMP hashing algorithms on leaf switches 1A-1 and 1A-2 Continue reading

Ultra Ethernet: Congestion Control Context

Ultra Ethernet Transport (UET) uses a vendor-neutral, sender-specific congestion window–based congestion control mechanism together with flow-based, adjustable entropy-value (EV) load balancing to manage incast, outcast, local, link, and network congestion events. Congestion control in UET is implemented through coordinated sender-side and receiver-side functions to enforce end-to-end congestion control behavior.

On the sender side, UET relies on the Network-Signaled Congestion Control (NSCC) algorithm. Its main purpose is to regulate how quickly packets are transmitted by a Packet Delivery Context (PDC). The sender adapts its transmission window based on round-trip time (RTT) measurements and Explicit Congestion Notification (ECN) Congestion Experienced (CE) feedback conveyed through acknowledgments from the receiver.

On the receiver side, Receiver Credit-based Congestion Control (RCCC) limits incast pressure by issuing credits to senders. These credits define how much data a sender is permitted to transmit toward the receiver. The receiver also observes ECN-CE markings in incoming packets to detect path congestion. When congestion is detected, the receiver can instruct the sender to change the entropy value, allowing traffic to be steered away from congested paths.

Both sender-side and receiver-side mechanisms ultimately control congestion by limiting the amount of in-flight data, meaning data that has been sent but not yet acknowledged. Continue reading

UET Congestion Management: CCC Base RTT

Calculating Base RTT

[Edit: January 7 2026, RTT role in CWND adjustment process]

As described in the previous section, the Bandwidth-Delay Product (BDP) is a baseline value used when setting the maximum size (MaxWnd) of the Congestion Window (CWND). The BDP is calculated by multiplying the lowest link speed among the source and destination nodes by the Base Round-Trip Time (Base_RTT).

In addition to its role in BDP calculation, Base_RTT plays a key role in the CWND adjustment process. During operation, the RTT measured for each packet is compared against the Base_RTT. If the measured RTT is significantly higher than the Base_RTT, the CWND is reduced. If the RTT is close to or lower than the Base_RTT, the CWND is allowed to increase.

This adjustment process is described in more detail in the upcoming sections.

The config_base_rtt parameter represents the RTT of the longest path between sender and receiver when no other packets are in flight. In other words, it reflects the minimum RTT under uncongested conditions. Figure 6-7 illustrates the individual delay components that together form the RTT.

Serialization Delay: The network shown in Figure 6-7 supports jumbo frames with an MTU of 9216 bytes. Serialization delay is measured Continue reading

UET Congestion Management: Congestion Control Context

Congestion Control Context

Updated 5.1.2026: Added CWND computation example into figure. Added CWND cmputaiton into text.
Updated 13.1.2026: Deprectade by: Ultra Ethernet: Congestion Control Context

UET Congestion Management: Introduction

Introduction

Figure 6-1 depicts a simple scale-out backend network for an AI data center. The topology follows a modular design, allowing the network to scale out or scale in as needed. The smallest building block in this example is a segment, which consists of two nodes, two rail switches, and one spine switch. Each node in the segment is equipped with a dual-port UET NIC and two GPUs.

Within a segment, GPUs are connected to the leaf switches using a rail-based topology. For example, in Segment 1A, the communication path between GPU 0 on Node A1 and GPU 0 on Node A2 uses Rail A0 (Leaf 1A-1). Similarly, GPU 1 on both nodes is connected to Rail A1 (Leaf 1A-2). In this example, we assume that intra-node GPU collective communication takes place over an internal, high-bandwidth scale-up network (such as NVLink). As a result, intra-segment GPU traffic never reaches the spine layer. Communication between segments is carried over the spine layer.

The example network is a best-effort (that is, PFC is not enabled) two-tier, three-stage non-blocking fat-tree topology, where each leaf and spine switch has four 100-Gbps links. Leaf switches have two host-facing links and two inter-switch links, while spine Continue reading

UET Request–Response Packet Flow Overview

This section brings together the processes described earlier and explains the packet flow from the node perspective. A detailed network-level packet walk is presented in the following sections..

Initiator – SES Request Packet Transmission

After the Work Request Entity (WRE) and the corresponding SES and PDS headers are constructed, they are submitted to the NIC as a Work Element (WE). As part of this process, a Packet Delivery Context (PDC) is created, and the base Packet Sequence Number (PSN) is selected and encoded into the PDS header.

Once the PDC is established, it begins tracking acknowledged PSNs from the target. For example, the PSN 0x12000 is marked as transmitted.

The NIC then fetches the payload data from local memory according to the address and length information in the WRE. The NIC autonomously performs these steps without CPU intervention, illustrating the hardware offload capabilities of UET.

Next, the NIC encapsulates the data with the required protocol headers: Ethernet, IP, optional UDP, PDS, and SES, and computes the Cyclic Redundancy Check (CRC). The fully formed packet is then transmitted toward the target with Traffic Class (TC) set to Low.

Note: The Traffic Class is orthogonal to the PDC; a single PDC Continue reading

UET Protocol: How the NIC constructs packet from the Work Entries (WRE+SES+PDS)

Semantic Sublayer (SES) Operation

[Rewritte 12. Dec-2025]

After a Work Request Entity (WRE) is created, the UET provider generates the parameters needed by the Semantic Sublayer (SES) headers. At this stage, the SES does not construct the actual wire header. Instead, it provides the header parameters, which are later used by the Packet Delivery Context (PDC) state machine to construct the final SES wire header, as explained in the upcoming PDC section. These parameters ensure that all necessary information about the message, including addressing and size, is available for later stages of processing.

Fragmentation Due to Guaranteed Buffer Limits

In our example, the data to be written to the remote GPU is 16 384 bytes. The dual-port NIC in figure 5-5 has a total memory capacity of 16 384 bytes, divided into three regions: a 4 096-byte guaranteed per-port buffer for Eth0 and Eth1, and an 8 192-byte shared memory pool available to both ports. Because gradient synchronization requires lossless delivery, all data must fit within the guaranteed buffer region. The shared memory pool cannot be used, as its buffer space is not guaranteed.

Since the message exceeds the size of the guaranteed buffer, it must be fragmented. The UET provider splits the 16 384-byte Continue reading

UET Relative Addressing and Its Similarities to VXLAN

Relative Addressing

As described in the previous section, applications use endpoint objects as their communication interfaces for data transfer. To write data from local memory to a target memory region on a remote GPU, the initiator must authorize the local UE-NIC to fetch data from local memory and describe where that data should be written on the remote side.

To route the packet to the correct Fabric Endpoint (FEP), the application and the UET provider must supply the FEP’s IP address (its Fabric Address, FA). To determine where in the remote process’s memory the received data belongs, the UE-NIC must also know:

Which job the communication belongs to
Which process within that job owns the target memory
Which Resource Index (RI) table should be used
Which entry in that table describes the exact memory location

This indirection model is called relative addressing.

How Relative Addressing Works

Figure 5-6 illustrates the concept. Two GPUs participate in distributed training. A process on GPU 0 with global rank 0 (PID 0) receives data from GPU 1 with global rank 1 (PID 1). The UE-NIC determines the target Fabric Endpoint (FEP) based on the destination IP address (FA = 10.0.1.11). This Continue reading

UET Data Transfer Operation: Work Request Entity and Semantic Sublayer

Work Request Entity (WRE)

[SES part updated 7-Decembr 2025: text and figure]

The UET provider constructs a Work Request Entity (WRE) from a fi_write RMA operation that has been validated and passed by the libfabric core. The WRE is a software-level representation of the requested transfer and semantically describes both the source memory (local buffer) and the target memory (remote buffer) for the operation. Using the WRE, the UET provider constructs the Semantic Sublayer (SES) header and the Packet Delivery Context (PDC) header.

From the local memory perspective, the WRE specifies the address of the data in registered local memory, the length of the data, and the local memory key (lkey). This information allows the NIC to fetch the data directly from local memory when performing the transmission.

From the target memory perspective, the WRE describes the Resource Index (RI) table, which contains information about the destination memory region, including its base address and the offset within that region where the data should be written. The RI table also defines the allowed operations on the region. Because an RI table may contain multiple entries, the actual memory region is selected using the rkey, which is also included in the WRE. Continue reading

UET Data Transfer Operation: Introduction

Introduction

[Updated 22 November 2025: Handoff Section]

The previous chapter described how an application gathers information about available hardware resources and uses that information to initialize the job environment. During this initialization, hardware resources are abstracted and made accessible to the UET provider as objects.

This chapter explains the data transport process, using gradient synchronization as an example.

Figure 5-1 depicts two GPUs—Rank 0 and Rank 2—participating in the same training job (JobID: 101). Both GPUs belong to the same NCCL topology and are connected to the Scale-Out Backend Network’s rail0.

Because the training model is large, each layer of neural network is split across two GPUs using tensor parallelism, meaning that the computations of a single layer are distributed between GPUs.

During the first forward-pass training iteration, the predicted model output does not match the expected result. This triggers the backward pass process, in which gradients—values indicating how much each weight parameter should be adjusted to improve the next forward-pass prediction—are computed.

Rank 0 computes its gradients, which in Figure 5-1 are stored as a 2D matrix with 3 rows and 1024 columns. The results are stored in a memory space registered for the process in local VRAM. Continue reading

UET Data Transport Part I: Introduction

[Figure updated 13 November 2025]

My previous UET posts explained how an application uses libfabric function API calls to discover available hardware resources and how this information is used to create a hardware abstraction layer composed of Fabric, Domain, and Endpoint objects, along with their child objects — Event Queues, Completion Queues, Completion Counters, Address Vectors, and Memory Regions.

This chapter explains how these objects are used during data transfer operations. It also describes how information is encoded into UET protocol headers, including the Semantic Sublayer (SES) and Packet Delivery Sublayer (PDC). In addition, the chapter covers how the Congestion Management Sublayer (CMS) monitors and controls send queue rates to prevent egress buffer overflows.

Note: In this book, libfabric API calls are divided into two categories for clarity. Functions are used to create and configure fabric objects such as fabrics, domains, endpoints, and memory regions (for example, fi_fabric(), fi_domain(), and fi_mr_reg()). Operations, on the other hand, perform actual data transfer or synchronization between processes (for example, fi_write(), fi_read(), and fi_send()).

Figure 5-1 provides a high-level overview of a libfabric Remote Memory Access (RMA) operation using the fi_write function call. When an application needs to transfer data, such as gradients, from Continue reading

Ultra Ethernet: Memory Region

Memory Registration and Endpoint Binding in UET with libfabric

[updated 25-Oct, 2025 - (RIs in the figure)]

In distributed AI workloads, each process requires memory regions that are visible to the fabric for efficient data transfer. The Job framework or application typically allocates these buffers in GPU VRAM to maximize throughput and enable low-latency direct memory access. These buffers store model parameters, gradients, neuron outputs, and temporary workspace, such as intermediate activations or partial gradients during collective operations in forward and backward passes.

Memory Registration and Key Generation

Once memory is allocated, it must be registered with the fabric domain using fi_mr_reg(). Registration informs the NIC that the memory is pinned and accessible for data transfers initiated by endpoints. The fabric library associates the buffer with a Memory Region handle (fid_mr) and internally generates a remote protection key (fi_mr_key), which uniquely identifies the memory region within the Job and domain context.

The local endpoint binds the fid_mr using fi_mr_bind() to define permitted operations, FI_REMOTE_WRITE in figure 4-10. This allows the NIC to access local memory efficiently and perform zero-copy operations.

The application retrieves the memory key using fi_mr_key(fid_mr) and constructs a Resource Index (RI) entry. The RI entry serves as Continue reading

Ultra Ethernet: Address Resolution with Address Vector Table

Address Vector

Overview

To enable Remote Memory Access (RMA) operations between processes, each endpoint — representing a communication channel much like a TCP socket — must know the destination process’s location within the fabric. This location is represented by the Fabric Address (FA) assigned to a Fabric Endpoint (FEP).

During job initialization, FAs are distributed through a control-plane–like procedure in which the master rank collects FAs from all ranks and then broadcasts the complete Rank-to-FA mapping to every participant (see Chapter 3 for details). Each process stores this Rank–FA mapping locally as a structure, which can then be inserted into the Address Vector (AV) Table.

When FAs from the distributed Rank-to-FA table are inserted into the AV Table, the provider assigns each entry an index number, which is published to the application as an fi_addr_t handle. After an endpoint object is bound to the AV Table, the application uses this handle — rather than the full address — when referencing a destination process. This abstraction hides the underlying address structure from the application and allows fast and efficient lookups during communication.

This mechanism resembles the functionality of a BGP Route Reflector (RR) in IP networks. Each RR client advertises its Continue reading

1 2 3 … 8 Next »