UET Data Transfer Operation: Introduction
Introduction
[Updated 22 November 2025: Handoff Section]
The previous chapter described how an application gathers information about available hardware resources and uses that information to initialize the job environment. During this initialization, hardware resources are abstracted and made accessible to the UET provider as objects.
This chapter explains the data transport process, using gradient synchronization as an example.
Figure 5-1 depicts two GPUs—Rank 0 and Rank 2—participating in the same training job (JobID: 101). Both GPUs belong to the same NCCL topology and are connected to the Scale-Out Backend Network’s rail0.
Because the training model is large, each layer of neural network is split across two GPUs using tensor parallelism, meaning that the computations of a single layer are distributed between GPUs.
During the first forward-pass training iteration, the predicted model output does not match the expected result. This triggers the backward pass process, in which gradients—values indicating how much each weight parameter should be adjusted to improve the next forward-pass prediction—are computed.
Rank 0 computes its gradients, which in Figure 5-1 are stored as a 2D matrix with 3 rows and 1024 columns. The results are stored in a memory space registered for the process in local VRAM. Continue reading


