Toni Pasanen

Author Archives: Toni Pasanen

Deep Learning for Network Engineers: Understanding Traffic Patterns and Network Requirements in the AI Data Center

About This Book

Several excellent books have been published over the past decade on Deep Learning (DL) and Datacenter Networking. However, I have not found a book that covers these topics together—as an integrated deep learning training system—while also highlighting the architecture of the datacenter network, especially the backend network, and the demands it must meet.

This book aims to bridge that gap by offering insights into how Deep Learning workloads interact with and influence datacenter network design.

So, what is Deep Learning?

Deep Learning is a subfield of Machine Learning (ML), which itself is a part of the broader concept of Artificial Intelligence (AI). Unlike traditional software systems where machines follow explicitly programmed instructions, Deep Learning enables machines to learn from data without manual rule-setting.

At its core, Deep Learning is about training artificial neural networks. These networks are mathematical models composed of layers of artificial neurons. Different types of networks suit different tasks—Convolutional Neural Networks (CNNs) for image recognition, and Large Language Models (LLMs) for natural language processing, to name a few.

Training a neural network involves feeding it labeled data and adjusting its internal parameters through a process called backpropagation. During the forward pass, the model Continue reading

AI for Network Engineers: Rail Desings in GPU Fabric

When building a scalable, resilient GPU network fabric, the design of the rail layer, the portion of the topology that interconnects GPU servers via Top-of-Rack (ToR) switches, plays a critical role. This section explores three different models: Multi-rail-per-switch, Dual-rail-per-switch, and Single-rail-per-switch. All three support dual-NIC-per-GPU designs, allowing each GPU to connect redundantly to two separate switches, thereby removing the Rail switch as a single point of failure.

Multi-Rail-per-Switch

In this model, multiple small subnets and VLANs are configured per switch, with each logical rail mapped to a subset of physical interfaces. For example, a single 48-port switch might host four or eight logical rails using distinct Layer 2 and Layer 3 domains. Because all logical rails share the same physical device, isolation is logical. As a result, a hardware or software failure in the switch can impact all rails and their associated GPUs, creating a large failure domain.

This model is not part of NVIDIA’s validated Scalable Unit (SU) architecture but may suit test environments, development clusters, or small-scale GPU fabrics where hardware cost efficiency is a higher priority than strict fault isolation. From a CapEx perspective, multi-rail-per-switch is the most economical, requiring fewer switches.

Figure 13-10 illustrates the Continue reading

Backend Network Topologies for AI Fabrics

Although there are best practices for AI Fabric backend networks, such as Data Center Quantized Congestion Control (DCQCN) for congestion avoidance, rail-optimized routed Clos fabrics, and Layer 2 Rail-Only topologies for small-scale implementations, each vendor offers its own validated design. This approach is beneficial because validated designs are thoroughly tested, and when you build your system based on the vendor’s recommendations, you receive full vendor support and avoid having to reinvent the wheel.

However, instead of focusing on any specific vendor’s design, this chapter explains general design principles for building a resilient, non-blocking, and lossless Ethernet backend network for AI workloads.

Before diving into backend network design, this chapter first provides a high-level overview of a GPU server based on NVIDIA H100 GPUs. The first section introduces a shared NIC architecture, where 8 GPUs share two NICs. The second section covers an architecture where each of the 8 GPUs has a dedicated NIC.

Shared NIC

Figure 13-1 illustrates a shared NIC approach. In this example setup, NVIDIA H100 GPUs 0–3 are connected to NVSwitch chips 1-1, 1-2, 1-3, and 1-4 on baseboard-1, while GPUs 4–7 are connected to NVSwitch chips 2-1, 2-2, 2-3, and 2-4 on baseboard-2. Each GPU connects Continue reading

AI for Network Engineers: Understanding Flow, Flowlet, and Packet-Based Load Balancing

Though BGP supports the traditional Flow-based Layer 3 Equal Cost Multi-Pathing (ECMP) traffic load balancing method, it is not the best fit for a RoCEv2-based AI backend network. This is because GPU-to-GPU communication creates massive elephant flows, which RDMA-capable NICs transmit at line rate. These flows can easily cause congestion in the backend network.

In ECMP, all packets of a single flow follow the same path. If that path becomes congested, ECMP does not adapt or reroute traffic. This leads to uneven bandwidth usage across the network. Some links become overloaded, while others remain idle. In AI workloads, where multiple high-bandwidth flows occur at the same time, this imbalance can degrade performance.

Deep learning models rely heavily on collective operations like all-reduce, all-gather, and broadcast. These generate dense traffic patterns between GPUs, often at terabit-per-second speeds. If these flows are not evenly distributed, a single congested path can slow down the entire training job.

This chapter introduces two alternative load balancing methods to traditional Flow-Based with Layer 3 ECMP: 1) Flowlet-Based Load Balancing with Adaptive Routing, and 2) Packet-Based Load Balancing with Packet Spraying. Both aim to improve traffic distribution in RoCEv2-based AI backend networks, where conventional flow-based routing often Continue reading

Congestion Avoidance in AI Fabric – Part III: Data Center Quantized Congestion Notification (DCQCN)

Data Center Quantized Congestion Notification (DCQCN) is a hybrid congestion control method. DCQCN brings together both Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) so that we can get high throughput, low latency, and lossless delivery across our AI fabric. In this approach, each mechanism plays a specific role in addressing different aspects of congestion, and together they create a robust flow-control system for RDMA traffic.

DCQCN tackles two main issues in large-scale RDMA networks:

1. Head-of-Line Blocking and Congestion Spreading: This is caused by PFC’s pause frames, which stop traffic across switches.

2. Throughput Reduction with ECN Alone: When the ECN feedback is too slow, packet loss may occur despite the rate adjustments.

DCQCN uses a two-tiered approach. It applies ECN early on to gently reduce the sending rate at the GPU NICs, and it uses PFC as a backup to quickly stop traffic on upstream switches (hop-by-hop) when congestion becomes severe.

How DCQCN Combines ECN and PFC

DCQCN carefully combines Explicit Congestion Notification (ECN) and Priority Flow Control (PFC) in the right sequence:

Early Action with ECN: When congestion begins to build up, the switch uses WRED thresholds (minimum and maximum) to mark packets. This signals the Continue reading

Congestion Avoidance in AI Fabric – Part II: Priority Flow Control (PFC)

Priority Flow Control (PFC) is a mechanism designed to prevent packet loss during network congestion by pausing traffic selectively based on priority levels. While the original IEEE 802.1Qbb standard operates at Layer 2, using the Priority Code Point (PCP) field in Ethernet headers, AI Fabrics rely on Layer 3 forwarding, where traditional Layer 2-based PFC is no longer applicable. To extend lossless behavior across routed (Layer 3) networks, DSCP-based PFC is used.

In DSCP-based PFC, the Differentiated Services Code Point (DSCP) field in the IP header identifies the traffic class or priority. Switches map specific DSCP values to internal traffic classes and queues. If congestion occurs on an ingress interface and a particular priority queue fills beyond a threshold, the switch can send a PFC pause frame back to the sender switch, instructing it to temporarily stop sending traffic of that class—just as in Layer 2 PFC, but now triggered based on Layer 3 classifications.

This behavior differs from Explicit Congestion Notification (ECN), which operates at Layer 3 as well but signals congestion by marking packets instead of stopping traffic. ECN acts on the egress port, informing the receiver to notify the sender to reduce the transmission rate over Continue reading

Congestion Avoidance in AI Fabric – Part I: Explicit Congestion Notification (ECN)

As explained in the preceding chapter, “Egress Interface Congestions,” both the Rail switch links to GPU servers and the inter-switch links can become congested during gradient synchronization. It is essential to implement congestion control mechanisms specifically designed for RDMA workloads in AI fabric back-end networks because congestion slows down the learning process and even a single packet loss may restart the whole training process.

This section begins by introducing Explicit Congestion Notification (ECN) and Priority-based Flow Control (PFC), two foundational technologies used in modern lossless Ethernet networks. ECN allows switches to mark packets, rather than dropping them, when congestion is detected, enabling endpoints to react proactively. PFC, on the other hand, offers per-priority flow control, which can pause selected traffic classes while allowing others to continue flowing.

Finally, we describe how Datacenter Quantized Congestion Notification (DCQCN) combines ECN and PFC to deliver a scalable and lossless transport mechanism for RoCEv2 traffic in AI clusters.

GPU-to-GPU RDMA Write Without Congestion

The figure 11-1 illustrates a standard Remote Direct Memory Access (RDMA) Write operation between two GPUs. This example demonstrates how GPU-0 on Host-1 transfers local gradients (∇₁ and ∇₂) from memory to GPU-0 on Host-2. Both GPUs use RDMA-capable NICs connected Continue reading

AI for Network Engneers: Challenges in AI Fabric Design

Introduction

Figure 10-1 illustrates a simple distributed GPU cluster consisting of three GPU hosts. Each host has two GPUs and a Network Interface Card (NIC) with two interfaces. Intra-host GPU communication uses high-speed NVLink interfaces, while inter-host communication takes place via NICs over slower PCIe buses.

GPU-0 on each host is connected to Rail Switch A through interface E1. GPU-1 uses interface E2 and connects to Rail Switch B. In this setup, inter-host communication between GPUs connected to the same rail passes through a single switch. However, communication between GPUs on different rails goes over three hops Rail–Spine–Rail switches.

In Figure 10-1, we use a data parallelization strategy where a training dataset is split into six micro-batches, which are distributed across the GPUs. All GPUs use the shared feedforward neural network model and compute local model outputs. Next, each GPU calculates the model error and begins the backward pass to compute neuron-based gradients. These gradients indicate how much, and in which direction, the weight parameters should be adjusted to improve the training result (see Chapter 2 for details).

Figure 10-1: Rail-Optimized Topology.

Egress Interface Congestions

After computing all gradients, each GPU stores the results in a local memory buffer and Continue reading

Tensor Parallelism

The previous section described how Pipeline Parallelism distributes entire layers across multiple GPUs. However, Large Language Models (LLMs) based on transformer architectures contain billions of parameters, making this approach insufficient.

For example, GPT-3 has approximately 605 million parameters in a single self-attention layer and about 1.2 billion parameters in a feedforward layer, and these figures apply to just one transformer block. Since GPT-3 has 96 transformer blocks, the total parameter count reaches approximately 173 billion. When adding embedding and normalization parameters, the total increases to roughly 175 billion parameters.

The number of parameters in a single layer alone often exceeds the memory capacity of a single GPU, making Pipeline Parallelism insufficient. Additionally, performing large matrix multiplications on a single GPU would be extremely slow and inefficient. Tensor Parallelism addresses this challenge by splitting computations within individual layers across multiple GPUs rather than assigning whole layers to separate GPUs, as done in Pipeline Parallelism.

Chapter 7 introduces Transformer architecture but for memory refreshing, figure 8-15 illustrates a stack of decoder modules in a transformer architecture. Each decoder module consists of a Self-Attention layer and a Feedforward layer. The figure also shows how an input word, represented by x1, is first Continue reading

Model Parallelism with Pipeline Parallelism

In Model Parallelism, the neural network is partitioned across multiple GPUs, with each GPU responsible for specific layers of the model. This strategy is particularly beneficial for large-scale models that surpass the memory limitations of a single GPU.

Conversely, Pipeline Parallelism involves dividing the model into consecutive stages, assigning each stage to a different GPU. This setup allows data to be processed in a pipeline fashion, akin to an assembly line, enabling simultaneous processing of multiple training samples. Without pipeline parallelism, each GPU would process its inputs sequentially from the complete dataset, while all other GPUs remain idle.

Our example neural network in Figure 8-3 consists of three hidden layers and an output layer. The first hidden layer is assigned to GPU A1, while the second and third hidden layers are assigned to GPU A2 and GPU B1, respectively. The output layer is placed on GPU B2. The training dataset is divided into four micro-batches and stored on the GPUs. These micro-batches are fed sequentially into the first hidden layer on GPU A1.

Note 8-1. In this example, we use a small training dataset. However, if the dataset is too large to fit on a Continue reading

Parallelism Strategies in Deep Learning

Introduction

Figure 8-1 depicts some of the model parameters that need to be stored in GPU memory: a) Weight matrices associated with connections to the preceding layer, b) Weighted sum (z), c) Activation values (y), d) Errors (E), e) Local gradients (local ∇), f) Gradients received from peer GPUs (remote ∇), g) Learning rates (LR), and h) Weight adjustment values (Δw).

In addition, the training and test datasets, along with the model code, must also be stored in GPU memory. However, a single GPU may not have enough memory to accommodate all these elements. To address this limitation, an appropriate parallelization strategy must be chosen to efficiently distribute computations across multiple GPUs.

This chapter introduces the most common strategies include data parallelism, model parallelism, pipeline parallelism, and tensor parallelism.

Figure 8-1: Overview of Neural Networks Parameters.

Data Parallelism

In data parallelization, each GPU has an identical copy of the complete model but processes different mini-batches of data. Gradients from all GPUs are averaged and synchronized before updating the model. This approach is effective when the model fits within a single GPU’s memory.

In Figure 8-2, the batch of training data is split into eight micro-batches. The first four micro-batches are Continue reading

Training Neural Networks: Backpropagation Algorithm

Introduction

The previous chapter explained the operation of a single artificial neuron. It covered how input values are multiplied by their respective weight parameters, summed together, and combined with a bias term. The resulting value, z, is then passed through a non-linear sigmoid function, which squeezed a neuron’s output value y ̂ between 0 and 1.

In this chapter, we form the smallest possible Feed Forward Neural Network (FFNN) model using only two neurons. While this is far from a Deep Neural Network (DNN), a simple NN with two neurons is sufficient to explain the Backpropagation algorithm, which is the focus of this chapter.

The goal is to demonstrate the training process and illustrate how the Forward Pass (computation phase) first generates a model output, y ̂. The algorithm then evaluates the model’s accuracy by computing the error term using Mean Squared Error (MSE). The first training iteration rarely, if ever, produces a perfect output. To gradually bring the training result closer to the expected value, the Backward Pass (adjustment and communication phase) calculates the magnitude and direction by which the weight values should be adjusted. The Backward Pass is repeated as many times as necessary until an acceptable model Continue reading

Introduction of an Artificial Neuron

Introduction

Before diving into the somewhat complex world of Artificial Intelligence (AI), let’s first consider what intelligence means from a human perspective. Judo, as a martial art, serves as a good—though not an obvious—example. I trained in judo for over 20 years. During that time, I learned which throwing techniques to use to take down an opponent efficiently by leveraging their movement energy and reactions. But how did I learn that? Through a supervised training process, where our coach first taught us the throwing techniques and the situations in which they work best. Then, we practiced them ourselves. Mastering these techniques requires thousands of repetitions before achieving perfection. Ultimately, timing and reaction to the opponent’s movements play a significant role in determining whether a throw is successful or not. After mastering several throwing technics, I was capable of apply them in the situation not necessarily to seen before.

How does this relate to Artificial Intelligence (AI)? AI is a broad term encompassing solutions that aim to mimic human brain functions. A subset of AI is Machine Learning (ML), which enables systems to make decisions based on input data without being explicitly programmed for each scenario. The driving force behind this Continue reading

Large Language Model (LLM) – Part 2/2: Transformer Architecture

Introduction

Sequence-to-sequence (seq2seq) language translation and Generative Pretrained Transformer (GPT) models are subcategories of Natural Language Processing (NLP) that utilize the Transformer architecture. Seq2seq models are typically using Long Short-Term Memory (LSTM) networks or encoder-decored based Transformers. In contrast, GPT is an autoregressive language model that uses decoder-only Transformer mechanism. The purpose of this chapter is to provide an overview of the decoder-only Transformer architecture.

The Transformer consists of stacks of decoder modules. A word embedding vector, a result of the word tokenization and embbeding, is fed as input to the first decoder module. After processing, the resulting context vector is passed to the next decodeer, and so on. After the final decoder, a softmax layer evaluates the output against the complete vocabulary to predict the next word. As an autoregressive model, the predicted word vector from the softmax layer is converted into a token before being fed back into the subsequent decoder layer. This process involves a token-to-word vector transformation prior to re-entering the decoder.

Each decoder module consists of an attention layer, Add & Normalization layer and a feedforward neural network (FFNN). Rather than feeding the embedded word vector (i.e., token embedding plus positional encoding) directly Continue reading

Large Language Models (LLM) – Part 1/2: Word Embedding

Introduction

This chapter introduces the basic operations of Transformer-based Large Language Models (LLMs), focusing on fundamental concepts rather than any specific LLM, such as OpenAI’s GPT (Generative Pretrained Transformer).The chapter begins with an introduction to tokenization and word embeddings, which convert input words into a format the model can process. Next, it explains how the transformer component leverages decoder architecture for input processing and prediction.

This chapter has two main goals. First, it explains how an LLM understands the context of a word. For example, the word “clear” can be used as a verb (Please, clear the table.) or as an adjective (The sky was clear.), depending on the context. Second, it discusses why LLMs require parallelization across hundreds or even thousands of GPUs due to the large model size, massive datasets, and the computational complexity involved.

Tokenizer and Word Embedding Matrix

As a first step, we import a vocabulary into the model. The vocabulary used for training large language models (LLMs) typically consists of a mix of general and domain-specific terms, including basic vocabulary, technical terminology, academic and formal language, idiomatic expressions, cultural references, as well as synonyms and antonyms. Each word and character is Continue reading

AI for Network Engineers: LSTM-Based RNN

Recap of the Operation of an LSTM Cell

The previous section introduced the construction and operation of a single Long Short-Term Memory (LSTM) cell. This section briefly discusses an LSTM-based Recurrent Neural Network (RNN). Before diving into the details, let’s recap how an individual LSTM cell operates with a theoretical, non-mathematical example.

Suppose we want our model to produce the sentence: “It was cloudy, but it is raining now.” The first part of it refers to the past, and one of the LSTM cells has stored the tense “was” in its internal cell state. However, the last portion of the sentence refers to the present. Naturally, we want the model to forget the previous tense “was” and update its state to reflect the current tense “is.”

The Forget Gate plays a role in discarding unnecessary information. In this case, the forget gate suppresses the word “was” by closing its gate (outputting 0). The Input Gate is responsible for providing a new candidate cell state, which in this example is the word “is.” The input gate is fully open (outputting 1) to allow the latest information to be introduced.

The Identification function computes the updated cell state by Continue reading

AI for Network Engineers: Long Short-Term Memory (LSTM)

Introduction

As mentioned in the previous chapter, Recurrent Neural Networks (RNNs) can have hundreds or even thousands of time steps. These basic RNNs often suffer from the gradient vanishing problem, where the network struggles to retain historical information across all time steps. In other words, the network gradually "forgets" historical information as it progresses through the time steps.

One solution to address the horizontal gradient vanishing problem between time steps is the use of Long Short-Term Memory (LSTM) based RNN instead of basic RNN. LSTM cells can preserve historical information across all time steps, whether the model contains ten or several thousand time steps.

Figure 6-1 illustrates the overall architecture of an LSTM cell. It includes three gates: the Forget gate, the Input gate (a.k.a. Remember gate), and the Output gate. Each gate contains input neurons that use the Sigmoid activation function. The reason for employing the Sigmoid function, as shown in Figure 5-4 of the previous chapter, is its ability to produce outputs in the range of 0 to 1. An output of 0 indicates that the gate is "closed," meaning the information is excluded from contributing to the cell's internal state calculations. An output of Continue reading

AI for Network Engineers: Recurrent Neural Network (RNN) – Part II

Challenges of a RNN Modell

Figure 5-3 shows the last two time steps of our Recurrent Neural Network (RNN). At the time step n (on the left side), there are two inputs for the weighted sum calculation: Xn (the input at the current time step) and ht−1 (the hidden state from the previous time step).

First, the model calculates the weighted sum of these inputs. The result is then passed through the neuron’s activation function (Sigmoid in this example). The output of the activation function, ht , is fed back into the recurrent layer on the next time step, n+1. At time step n+1, the ht is combined with the input Xn to calculate weighted sum. This result is then passed through the activation function, which now produces the model's prediction, y ̂ (y hat). These steps are part of the Forward Pass process.

As the final step in the forward pass, we calculate the model's accuracy using the Mean Square Error (MSE) function (explained in Chapter 2).

If the model's accuracy is not close enough to the expected result, it begins the Backward Pass to improve its performance. The most used optimization algorithm for minimizing the loss function during Continue reading

AI for Network Engineers: Recurrent Neural Network (RNN)

Introduction

So far, this book has introduced two neural network architectures. The first one, the Feed-Forward Neural Network (FNN), works well for simple tasks, such as recognizing handwritten digits in small-sized images. The second one, the Convolutional Neural Network (CNN), is designed for processing larger images. CNNs can identify objects in images even when the location or orientation of the object changes.

This chapter introduces the Recurrent Neural Network (RNN). Unlike FNNs and CNNs, an RNN’s inputs include not only the current data but also all the inputs it has processed previously. In other words, an RNN preserves and uses historical data. This is achieved by feeding the output of the previous time step back into the hidden layer along with the current input vector.

Although RNNs can be used for predicting sequential data of variable lengths, such as sales figures or a patient’s historical health records, this chapter focuses on how RNNs can perform character-based text autocompletion. The upcoming chapters will explore word-based text prediction.

Text Datasets

For training the RNN model, we typically use text datasets like IMDB Reviews or the Wikipedia Text Corpus. However, in this chapter, we simplify the process by using a tailored dataset containing Continue reading

AI for Network Engineers: Convolutional Neural Network

Introduction

The previous chapter explained how Feed-forward Neural Networks (FNNs) can be used for multi-class classification of 28 x 28 pixel handwritten digits from the MNIST dataset. While FNNs work well for this type of task, they have significant limitations when dealing with larger, high-resolution color images.

In neural network terminology, each RGB value of an image is treated as an input feature. For instance, a high-resolution 600 dpi RGB color image with dimensions 3.937 x 3.937 inches contains approximately 5.58 million pixels, resulting in roughly 17 million RGB values.

If we use a fully connected FNN for training, all these 17 million input values are fed into every neuron in the first hidden layer. Each neuron must compute a weighted sum based on these 17 million inputs. The memory required for storing the weights depends on the numerical precision format used. For example, using the 16-bit floating-point (FP16) format, each weight requires 2 bytes. Thus, the memory requirement per neuron would be approximately 32 MB. If the first hidden layer has 10,000 neurons, the total memory required for storing the weights in this layer would be around 316 GB.

In contrast, Convolutional Neural Networks (CNNs) use Continue reading

1 2 3 … 6 Next »