Toni Pasanen

Author Archives: Toni Pasanen

Congestion Avoidance in AI Fabric – Part III: Data Center Quantized Congestion Notification (DCQCN)

Data Center Quantized Congestion Notification (DCQCN) is a hybrid congestion control method. DCQCN brings together both Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) so that we can get high throughput, low latency, and lossless delivery across our AI fabric. In this approach, each mechanism plays a specific role in addressing different aspects of congestion, and together they create a robust flow-control system for RDMA traffic.


DCQCN tackles two main issues in large-scale RDMA networks:

1. Head-of-Line Blocking and Congestion Spreading: This is caused by PFC’s pause frames, which stop traffic across switches.

2. Throughput Reduction with ECN Alone: When the ECN feedback is too slow, packet loss may occur despite the rate adjustments.

DCQCN uses a two-tiered approach. It applies ECN early on to gently reduce the sending rate at the GPU NICs, and it uses PFC as a backup to quickly stop traffic on upstream switches (hop-by-hop) when congestion becomes severe.


How DCQCN Combines ECN and PFC

DCQCN carefully combines Explicit Congestion Notification (ECN) and Priority Flow Control (PFC) in the right sequence:


Early Action with ECN: When congestion begins to build up, the switch uses WRED thresholds (minimum and maximum) to mark packets. This signals the Continue reading

Congestion Avoidance in AI Fabric – Part II: Priority Flow Control (PFC)

Priority Flow Control (PFC) is a mechanism designed to prevent packet loss during network congestion by pausing traffic selectively based on priority levels. While the original IEEE 802.1Qbb standard operates at Layer 2, using the Priority Code Point (PCP) field in Ethernet headers, AI Fabrics rely on Layer 3 forwarding, where traditional Layer 2-based PFC is no longer applicable. To extend lossless behavior across routed (Layer 3) networks, DSCP-based PFC is used.

In DSCP-based PFC, the Differentiated Services Code Point (DSCP) field in the IP header identifies the traffic class or priority. Switches map specific DSCP values to internal traffic classes and queues. If congestion occurs on an ingress interface and a particular priority queue fills beyond a threshold, the switch can send a PFC pause frame back to the sender switch, instructing it to temporarily stop sending traffic of that class—just as in Layer 2 PFC, but now triggered based on Layer 3 classifications.

This behavior differs from Explicit Congestion Notification (ECN), which operates at Layer 3 as well but signals congestion by marking packets instead of stopping traffic. ECN acts on the egress port, informing the receiver to notify the sender to reduce the transmission rate over Continue reading

Congestion Avoidance in AI Fabric – Part I: Explicit Congestion Notification (ECN)

As explained in the preceding chapter, “Egress Interface Congestions,” both the Rail switch links to GPU servers and the inter-switch links can become congested during gradient synchronization. It is essential to implement congestion control mechanisms specifically designed for RDMA workloads in AI fabric back-end networks because congestion slows down the learning process and even a single packet loss may restart the whole training process.

This section begins by introducing Explicit Congestion Notification (ECN) and Priority-based Flow Control (PFC), two foundational technologies used in modern lossless Ethernet networks. ECN allows switches to mark packets, rather than dropping them, when congestion is detected, enabling endpoints to react proactively. PFC, on the other hand, offers per-priority flow control, which can pause selected traffic classes while allowing others to continue flowing.

Finally, we describe how Datacenter Quantized Congestion Notification (DCQCN) combines ECN and PFC to deliver a scalable and lossless transport mechanism for RoCEv2 traffic in AI clusters.

GPU-to-GPU RDMA Write Without Congestion


The figure 11-1 illustrates a standard Remote Direct Memory Access (RDMA) Write operation between two GPUs. This example demonstrates how GPU-0 on Host-1 transfers local gradients (∇₁ and ∇₂) from memory to GPU-0 on Host-2. Both GPUs use RDMA-capable NICs connected Continue reading

AI for Network Engneers: Challenges in AI Fabric Design

 Introduction


Figure 10-1 illustrates a simple distributed GPU cluster consisting of three GPU hosts. Each host has two GPUs and a Network Interface Card (NIC) with two interfaces. Intra-host GPU communication uses high-speed NVLink interfaces, while inter-host communication takes place via NICs over slower PCIe buses.

GPU-0 on each host is connected to Rail Switch A through interface E1. GPU-1 uses interface E2 and connects to Rail Switch B. In this setup, inter-host communication between GPUs connected to the same rail passes through a single switch. However, communication between GPUs on different rails goes over three hops  Rail–Spine–Rail switches.

In Figure 10-1, we use a data parallelization strategy where a training dataset is split into six micro-batches, which are distributed across the GPUs. All GPUs use the shared feedforward neural network model and compute local model outputs. Next, each GPU calculates the model error and begins the backward pass to compute neuron-based gradients. These gradients indicate how much, and in which direction, the weight parameters should be adjusted to improve the training result (see Chapter 2 for details).

Figure 10-1: Rail-Optimized Topology.


Egress Interface Congestions


After computing all gradients, each GPU stores the results in a local memory buffer and Continue reading

Tensor Parallelism

 The previous section described how Pipeline Parallelism distributes entire layers across multiple GPUs. However, Large Language Models (LLMs) based on transformer architectures contain billions of parameters, making this approach insufficient.

For example, GPT-3 has approximately 605 million parameters in a single self-attention layer and about 1.2 billion parameters in a feedforward layer, and these figures apply to just one transformer block. Since GPT-3 has 96 transformer blocks, the total parameter count reaches approximately 173 billion. When adding embedding and normalization parameters, the total increases to roughly 175 billion parameters.

The number of parameters in a single layer alone often exceeds the memory capacity of a single GPU, making Pipeline Parallelism insufficient. Additionally, performing large matrix multiplications on a single GPU would be extremely slow and inefficient. Tensor Parallelism addresses this challenge by splitting computations within individual layers across multiple GPUs rather than assigning whole layers to separate GPUs, as done in Pipeline Parallelism.

Chapter 7 introduces Transformer architecture but for memory refreshing, figure 8-15 illustrates a stack of decoder modules in a transformer architecture. Each decoder module consists of a Self-Attention layer and a Feedforward layer. The figure also shows how an input word, represented by x1, is first Continue reading

Model Parallelism with Pipeline Parallelism

 

In Model Parallelism, the neural network is partitioned across multiple GPUs, with each GPU responsible for specific layers of the model. This strategy is particularly beneficial for large-scale models that surpass the memory limitations of a single GPU.

Conversely, Pipeline Parallelism involves dividing the model into consecutive stages, assigning each stage to a different GPU. This setup allows data to be processed in a pipeline fashion, akin to an assembly line, enabling simultaneous processing of multiple training samples. Without pipeline parallelism, each GPU would process its inputs sequentially from the complete dataset, while all other GPUs remain idle.

Our example neural network in Figure 8-3 consists of three hidden layers and an output layer. The first hidden layer is assigned to GPU A1, while the second and third hidden layers are assigned to GPU A2 and GPU B1, respectively. The output layer is placed on GPU B2. The training dataset is divided into four micro-batches and stored on the GPUs. These micro-batches are fed sequentially into the first hidden layer on GPU A1. 

Note 8-1. In this example, we use a small training dataset. However, if the dataset is too large to fit on a Continue reading

Parallelism Strategies in Deep Learning

Introduction

Figure 8-1 depicts some of the model parameters that need to be stored in GPU memory: a) Weight matrices associated with connections to the preceding layer, b) Weighted sum (z), c) Activation values (y), d) Errors (E), e) Local gradients (local ∇), f) Gradients received from peer GPUs (remote ∇), g) Learning rates (LR), and h) Weight adjustment values (Δw).

In addition, the training and test datasets, along with the model code, must also be stored in GPU memory. However, a single GPU may not have enough memory to accommodate all these elements. To address this limitation, an appropriate parallelization strategy must be chosen to efficiently distribute computations across multiple GPUs.

This chapter introduces the most common strategies include data parallelism, model parallelism, pipeline parallelism, and tensor parallelism.


Figure 8-1: Overview of Neural Networks Parameters.


Data Parallelism


In data parallelization, each GPU has an identical copy of the complete model but processes different mini-batches of data. Gradients from all GPUs are averaged and synchronized before updating the model. This approach is effective when the model fits within a single GPU’s memory.

In Figure 8-2, the batch of training data is split into eight micro-batches. The first four micro-batches are Continue reading

Training Neural Networks: Backpropagation Algorithm

 Introduction 


The previous chapter explained the operation of a single artificial neuron. It covered how input values are multiplied by their respective weight parameters, summed together, and combined with a bias term. The resulting value, z, is then passed through a non-linear sigmoid function, which squeezed a neuron’s output value y ̂ between 0 and 1.

In this chapter, we form the smallest possible Feed Forward Neural Network (FFNN) model using only two neurons. While this is far from a Deep Neural Network (DNN), a simple NN with two neurons is sufficient to explain the Backpropagation algorithm, which is the focus of this chapter.

The goal is to demonstrate the training process and illustrate how the Forward Pass (computation phase) first generates a model output, y ̂. The algorithm then evaluates the model’s accuracy by computing the error term using Mean Squared Error (MSE). The first training iteration rarely, if ever, produces a perfect output. To gradually bring the training result closer to the expected value, the Backward Pass (adjustment and communication phase) calculates the magnitude and direction by which the weight values should be adjusted. The Backward Pass is repeated as many times as necessary until an acceptable model Continue reading

Introduction of an Artificial Neuron

 Introduction 


Before diving into the somewhat complex world of Artificial Intelligence (AI), let’s first consider what intelligence means from a human perspective. Judo, as a martial art, serves as a good—though not an obvious—example. I trained in judo for over 20 years. During that time, I learned which throwing techniques to use to take down an opponent efficiently by leveraging their movement energy and reactions. But how did I learn that? Through a supervised training process, where our coach first taught us the throwing techniques and the situations in which they work best. Then, we practiced them ourselves. Mastering these techniques requires thousands of repetitions before achieving perfection. Ultimately, timing and reaction to the opponent’s movements play a significant role in determining whether a throw is successful or not. After mastering several throwing technics, I was capable of apply them in the situation not necessarily to seen before.

How does this relate to Artificial Intelligence (AI)? AI is a broad term encompassing solutions that aim to mimic human brain functions. A subset of AI is Machine Learning (ML), which enables systems to make decisions based on input data without being explicitly programmed for each scenario. The driving force behind this Continue reading

Large Language Model (LLM) – Part 2/2: Transformer Architecture

 Introduction

Sequence-to-sequence (seq2seq) language translation and Generative Pretrained Transformer (GPT) models are subcategories of Natural Language Processing (NLP) that utilize the Transformer architecture. Seq2seq models are typically using Long Short-Term Memory (LSTM) networks or encoder-decored based Transformers. In contrast, GPT is an autoregressive language model that uses decoder-only Transformer mechanism. The purpose of this chapter is to provide an overview of the decoder-only Transformer architecture.

The Transformer consists of stacks of decoder modules. A word embedding vector, a result of the word tokenization and embbeding, is fed as input to the first decoder module. After processing, the resulting context vector is passed to the next decodeer, and so on. After the final decoder, a softmax layer evaluates the output against the complete vocabulary to predict the next word. As an autoregressive model, the predicted word vector from the softmax layer is converted into a token before being fed back into the subsequent decoder layer. This process involves a token-to-word vector transformation prior to re-entering the decoder.

Each decoder module consists of an attention layer, Add & Normalization layer and a feedforward neural network (FFNN). Rather than feeding the embedded word vector (i.e., token embedding plus positional encoding) directly Continue reading

Large Language Models (LLM) – Part 1/2: Word Embedding

Introduction

This chapter introduces the basic operations of Transformer-based Large Language Models (LLMs), focusing on fundamental concepts rather than any specific LLM, such as OpenAI’s GPT (Generative Pretrained Transformer).The chapter begins with an introduction to tokenization and word embeddings, which convert input words into a format the model can process. Next, it explains how the transformer component leverages decoder architecture for input processing and prediction. 

This chapter has two main goals. First, it explains how an LLM understands the context of a word. For example, the word “clear” can be used as a verb (Please, clear the table.) or as an adjective (The sky was clear.), depending on the context. Second, it discusses why LLMs require parallelization across hundreds or even thousands of GPUs due to the large model size, massive datasets, and the computational complexity involved.


Tokenizer and Word Embedding Matrix

As a first step, we import a vocabulary into the model. The vocabulary used for training large language models (LLMs) typically consists of a mix of general and domain-specific terms, including basic vocabulary, technical terminology, academic and formal language, idiomatic expressions, cultural references, as well as synonyms and antonyms. Each word and character is Continue reading

AI for Network Engineers: LSTM-Based RNN


Recap of the Operation of an LSTM Cell

The previous section introduced the construction and operation of a single Long Short-Term Memory (LSTM) cell. This section briefly discusses an LSTM-based Recurrent Neural Network (RNN). Before diving into the details, let’s recap how an individual LSTM cell operates with a theoretical, non-mathematical example.

Suppose we want our model to produce the sentence: “It was cloudy, but it is raining now.” The first part of it refers to the past, and one of the LSTM cells has stored the tense “was” in its internal cell state. However, the last portion of the sentence refers to the present. Naturally, we want the model to forget the previous tense “was” and update its state to reflect the current tense “is.”

The Forget Gate plays a role in discarding unnecessary information. In this case, the forget gate suppresses the word “was” by closing its gate (outputting 0). The Input Gate  is responsible for providing a new candidate cell state, which in this example is the word “is.” The input gate is fully open (outputting 1) to allow the latest information to be introduced.

The Identification function computes the updated cell state by Continue reading

AI for Network Engineers: Long Short-Term Memory (LSTM)

 Introduction


As mentioned in the previous chapter, Recurrent Neural Networks (RNNs) can have hundreds or even thousands of time steps. These basic RNNs often suffer from the gradient vanishing problem, where the network struggles to retain historical information across all time steps. In other words, the network gradually "forgets" historical information as it progresses through the time steps.

One solution to address the horizontal gradient vanishing problem between time steps is the use of Long Short-Term Memory (LSTM) based RNN instead of basic RNN. LSTM cells can preserve historical information across all time steps, whether the model contains ten or several thousand time steps. 

Figure 6-1 illustrates the overall architecture of an LSTM cell. It includes three gates: the Forget gate, the Input gate (a.k.a. Remember gate), and the Output gate. Each gate contains input neurons that use the Sigmoid activation function. The reason for employing the Sigmoid function, as shown in Figure 5-4 of the previous chapter, is its ability to produce outputs in the range of 0 to 1. An output of 0 indicates that the gate is "closed," meaning the information is excluded from contributing to the cell's internal state calculations. An output of Continue reading

AI for Network Engineers: Recurrent Neural Network (RNN) – Part II

 Challenges of a RNN Modell


Figure 5-3 shows the last two time steps of our Recurrent Neural Network (RNN). At the time step n (on the left side), there are two inputs for the weighted sum calculation: Xn  (the input at the current time step) and ht−1 (the hidden state from the previous time step).

First, the model calculates the weighted sum of these inputs. The result is then passed through the neuron’s activation function (Sigmoid in this example). The output of the activation function, ht , is fed back into the recurrent layer on the next time step, n+1. At time step n+1, the ht  is combined with the input Xn to calculate weighted sum. This result is then passed through the activation function, which now produces the model's prediction, y ̂ (y hat). These steps are part of the Forward Pass process.

As the final step in the forward pass, we calculate the model's accuracy using the Mean Square Error (MSE) function (explained in Chapter 2).

If the model's accuracy is not close enough to the expected result, it begins the Backward Pass to improve its performance. The most used optimization algorithm for minimizing the loss function during Continue reading

AI for Network Engineers: Recurrent Neural Network (RNN)

 Introduction

So far, this book has introduced two neural network architectures. The first one, the Feed-Forward Neural Network (FNN), works well for simple tasks, such as recognizing handwritten digits in small-sized images. The second one, the Convolutional Neural Network (CNN), is designed for processing larger images. CNNs can identify objects in images even when the location or orientation of the object changes.

This chapter introduces the Recurrent Neural Network (RNN). Unlike FNNs and CNNs, an RNN’s inputs include not only the current data but also all the inputs it has processed previously. In other words, an RNN preserves and uses historical data. This is achieved by feeding the output of the previous time step back into the hidden layer along with the current input vector.

Although RNNs can be used for predicting sequential data of variable lengths, such as sales figures or a patient’s historical health records, this chapter focuses on how RNNs can perform character-based text autocompletion. The upcoming chapters will explore word-based text prediction.


Text Datasets

For training the RNN model, we typically use text datasets like IMDB Reviews or the Wikipedia Text Corpus. However, in this chapter, we simplify the process by using a tailored dataset containing Continue reading

AI for Network Engineers: Convolutional Neural Network

 Introduction


The previous chapter explained how Feed-forward Neural Networks (FNNs) can be used for multi-class classification of 28 x 28 pixel handwritten digits from the MNIST dataset. While FNNs work well for this type of task, they have significant limitations when dealing with larger, high-resolution color images.

In neural network terminology, each RGB value of an image is treated as an input feature. For instance, a high-resolution 600 dpi RGB color image with dimensions 3.937 x 3.937 inches contains approximately 5.58 million pixels, resulting in roughly 17 million RGB values.

If we use a fully connected FNN for training, all these 17 million input values are fed into every neuron in the first hidden layer. Each neuron must compute a weighted sum based on these 17 million inputs. The memory required for storing the weights depends on the numerical precision format used. For example, using the 16-bit floating-point (FP16) format, each weight requires 2 bytes. Thus, the memory requirement per neuron would be approximately 32 MB. If the first hidden layer has 10,000 neurons, the total memory required for storing the weights in this layer would be around 316 GB.

In contrast, Convolutional Neural Networks (CNNs) use Continue reading

AI for Network Engineers: Multi-Class Classification

 Introduction 

This chapter explains the multi-class classification training process. It begins with an introduction to the MNIST dataset (Modified National Institute of Standards and Technology dataset). Next, it describes how the SoftMax activation function computes the probability of the image fed into the model during the forward pass and how the weight parameters are adjusted during the backward pass to improve training results. Additionally, the chapter discusses the data parallelization strategy from a network perspective.


MINST Dataset

We will use the MNIST dataset [1], which consists of handwritten digits, to demonstrate the training process. The MNIST dataset includes four files: (1) a training set with 60,000 gray-scale images (28x28 pixels) and their respective (2) labels, and a test set with 10,000 images (28x28 pixels) and their respective labels. Figure 3-1 illustrates the structure and dependencies between the test dataset and the labels.

The file train-images-idx3-ubyte contains metadata describing how the images are ordered, along with the image pixel order. The file train-labels-idx1-ubyte defines which label (the digits 0-9) corresponds to which image in the image file. Since we have ten possible outputs, we use ten output neurons.

Before the training process begins, the labels for each image-label pair are one-hot Continue reading

AI for Network Engineers: Backpropagation Algorithm

 Introduction 


This chapter introduces the training model of a neural network based on the Backpropagation algorithm. The goal is to provide a clear and solid understanding of the process without delving deeply into the mathematical formulas, while still explaining the fundamental operations of the involved functions. The chapter also briefly explains why, and in which phases the training job generates traffic to the network, and why lossless packet transport is required. The Backpropagation algorithm is composed of two phases: the Forward pass (computation phase) and the Backward pass (adjustment and communication phase).

In the Forward pass, neurons in the first hidden layer calculate the weighted sum of input parameters received from the input layer, which is then passed to the neuron's activation function. Note that neurons in the input layer are not computational units; they simply pass the input variables to the connected neurons in the first hidden layer. The output from the activation function of a neuron is then used as input for the connected neurons in the next layer. The result of the activation function in the output layer represents the model's prediction, which is compared to the expected value (ground truth) using the error function. The output Continue reading

AI for Network Engineers: Chapter 2 – Backpropagation Algorithm: Introduction

This chapter introduces the training model of a neural network based on the Backpropagation algorithm. The goal is to provide a clear and solid understanding of the process without delving deeply into the mathematical formulas, while still explaining the fundamental operations of the involved functions. The chapter also briefly explains why, and in which phases the training job generates traffic to the network, and why lossless packet transport is required. The Backpropagation algorithm is composed of two phases: the Forward pass (computation phase) and the Backward pass (adjustment and communication phase).

In the Forward pass, neurons in the first hidden layer calculate the weighted sum of input parameters received from the input layer, which is then passed to the neuron's activation function. Note that neurons in the input layer are not computational units; they simply pass the input variables to the connected neurons in the first hidden layer. The output from the activation function of a neuron is then used as input for the connected neurons in the next layer, whether it is another hidden layer or the output layer. The result of the activation function in the output layer represents the model's prediction, which is compared to the expected Continue reading

Artificial Intelligence for Network Engineers: Introduction

Several books on artificial intelligence (AI) and deep learning (DL) have been published over the past decade. However, I have yet to find a book that explains deep learning from a networking perspective while providing a solid introduction to DL. My goal is to fill this gap by writing a book titled AI for Network Engineers (note that the title name may change during the writing process). Writing about such a complex subject will take time, but I hope to complete and release it within a year.

Part I: Deep Learning and Deep Neural Networks

The first part of the book covers the theory behind Deep Learning. It begins by explaining the construct of a single artificial neuron and its functionality. Then, it explores various Deep Neural Network models, such as Feedforward Neural Networks (FNN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN). Next, the first part discusses data and model parallelization strategies such as Data, Pipeline, and Tensor Parallelism, explaining how input data and/or model sizes that exceed the memory capacity of GPUs within a single server can be distributed across multiple GPU servers.

Part II: AI Data Center Networking - Lossless Ethernet

After a brief Continue reading

1 2 3 6