Questions about pipeline parallelism
#103
by
ink0215
- opened
My question is related to the description of the activation memory for pipeline parallelism. As the following suggested:
Now that each GPU only hold part layers of the whole model, the activation should also distributes among them, right? So from my perspective, the pipeline parallelism could save activation memory for each GPU rank, during training.
Am I right?