Model Parallelism with Pipeline Parallelism
In Model Parallelism, the
neural network is partitioned across multiple GPUs, with each GPU responsible
for specific layers of the model. This strategy is particularly
beneficial for large-scale models that surpass the memory limitations of a
single GPU.
Conversely, Pipeline Parallelism involves dividing the model into consecutive stages, assigning each stage to a different GPU. This setup allows data to be processed in a pipeline fashion, akin to an assembly line, enabling simultaneous processing of multiple training samples. Without pipeline parallelism, each GPU would process its inputs sequentially from the complete dataset, while all other GPUs remain idle.
Our example neural network in Figure 8-3 consists of three hidden layers and an output layer. The first hidden layer is assigned to GPU A1, while the second and third hidden layers are assigned to GPU A2 and GPU B1, respectively. The output layer is placed on GPU B2. The training dataset is divided into four micro-batches and stored on the GPUs. These micro-batches are fed sequentially into the first hidden layer on GPU A1.
Note 8-1. In this example, we use a small training dataset. However, if the dataset is too large to fit on a Continue reading

