Questions about pipeline parallelism

#103
by ink0215 - opened

My question is related to the description of the activation memory for pipeline parallelism. As the following suggested:
image.png

Now that each GPU only hold part layers of the whole model, the activation should also distributes among them, right? So from my perspective, the pipeline parallelism could save activation memory for each GPU rank, during training.

Am I right?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment