Widget does not take TP into account for Parameter / Gradient / Optimizer State Sharding
#98
by
Turakar
- opened
As far as I know, TP not only reduces activation memory, but should also shard the parameters and thus the associated gradients and the optimizer state. This is not reflected in the memory widget. Or am I missing something here? Would love to hear a clarification :)
Anyway, thanks a lot for this great notebook! I enjoyed reading the pros and cons of so many parallelization strategies in comparison, it has been very informative!