Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

104

Widget does not take TP into account for Parameter / Gradient / Optimizer State Sharding

#98

by Turakar - opened 12 days ago

Discussion

Turakar

12 days ago

•

edited 12 days ago

As far as I know, TP not only reduces activation memory, but should also shard the parameters and thus the associated gradients and the optimizer state. This is not reflected in the memory widget. Or am I missing something here? Would love to hear a clarification :)

Anyway, thanks a lot for this great notebook! I enjoyed reading the pros and cons of so many parallelization strategies in comparison, it has been very informative!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment