UET Relative Addressing and Its Similarities to VXLAN
Relative Addressing
As described in the previous section, applications use endpoint objects as their communication interfaces for data transfer. To write data from local memory to a target memory region on a remote GPU, the initiator must authorize the local UE-NIC to fetch data from local memory and describe where that data should be written on the remote side.
To route the packet to the correct Fabric Endpoint (FEP), the application and the UET provider must supply the FEP’s IP address (its Fabric Address, FA). To determine where in the remote process’s memory the received data belongs, the UE-NIC must also know:
- Which job the communication belongs to
- Which process within that job owns the target memory
- Which Resource Index (RI) table should be used
- Which entry in that table describes the exact memory location
This indirection model is called relative addressing.
How Relative Addressing Works
Figure 5-6 illustrates the concept. Two GPUs participate in distributed training. A process on GPU 0 with global rank 0 (PID 0) receives data from GPU 1 with global rank 1 (PID 1). The UE-NIC determines the target Fabric Endpoint (FEP) based on the destination IP address (FA = 10.0.1.11). This Continue reading
