Widget does not take TP into account for Parameter / Gradient / Optimizer State Sharding

#98
by Turakar - opened

As far as I know, TP not only reduces activation memory, but should also shard the parameters and thus the associated gradients and the optimizer state. This is not reflected in the memory widget. Or am I missing something here? Would love to hear a clarification :)

Anyway, thanks a lot for this great notebook! I enjoyed reading the pros and cons of so many parallelization strategies in comparison, it has been very informative!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment