Few Errors

#86
by gordicaleksa - opened

Awesome work! And thanks for linking my Flash Attention blog post :)

Caught few errors while reading (WIP - will add more as I go through the whole thing):

Typos:

  1. Cheatsheet glossary: ep -> "expert parallelism degree" not "context parallelism degree"
  2. "PROFILING THE MEMORY USAGE" -> "througho ut training" -> "throughout training"
  3. "extremely usefull" -> "extremely useful"
  4. "attention module will requires" -> "require"
  5. "the memory savings in activations when using TP with SP helps us fit far bigger batches than TP alone" mentioned twice (in succession) in the summarization section of the TP/SP chapter, i.e. bullet points 2 & 3 are the same
  6. "As you can see, ZeRO-3 and PP sove" -> "solve"
  7. "need to be balanced in Pipaline Parallelism," -> "Pipeline"
  8. "that are actually used to distribute and training larger" -> "train larger"
  9. "Efficiently accessing data from global memory can improve a lot the performance." -> "can improve performance by a lot"
  10. "Let's briefly mentionned" -> "Let's briefly go through"
  11. "For float16 it is ..." -> there is a weird tilda (~) over 10^-3 here
  12. "and when you should should be ready to follow the blog post easily." -> "and you should now be ready to follow the blog post easily."
    (note: maybe just pass it once through grammarly free :) you can just ctrl+f the strings on the left side to find matches for the errors i found)

Logic:

  1. Througput Scaling with TP/SP (3B Model) -> for TP=32 you get 41.4% whereas for TP=16 you get 43.4% (so it gets better :) despite the chart & logic showing the opposite)
  2. in general i'm a bit suspicious of the TP vs TP/SP throughput scaling / maximum batch size plots, it seems like for TP=32 you can have 5x the batch size just due to SP?
  3. "Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers" <- pipeline parallelism, this doesn't make sense? Activations for only a subset of layers now need to be kept on the GPU. Or if assuming act checkpointing it's the same conclusion, assuming we keep 4 layers per GPU now you need 4 @ X memory (assuming simplistically that you store activations at the beginning of each transformer layer) vs 4 @ X @ PP where PP is the number of stages in pipeline parallelism (note: using @ bc of rendering issues with asterisk).
  4. The final table in "5D parallelism in a nutshell" section has errors when it comes to "Disadvantage" and "Parallel/sharding dimension" columns for ZeRO-1, ZeRO-2, and ZeRO-3.
  5. (A2: typical scales in LLM training section): " So total optimizer states will be around (6 x h^2) per weight matrix -> this should be 12 x h^2 given that we need fp32, right?
  6. In the A3 section might be worth mentioning (since the book is meant even for those who lack background) that attn FLOPs are dropped as they're (usually, assuming shorter context length) negligible. E.g. you set the FLOPs for a single transformer layer to 32 x seq x mbs x h^2 -> so per token you have 32 x h^2 and that is 4 x (2 x h^2) [due to 4 matrices Q,K,V,O and each op in matmul taking 2 flops] and then we have 3 x (4x2xh^2) [assuming we have gated unit otherwise would be 2, and assuming the intermediate dim is 4x, that's a lot of assumptions for someone new heh]
Nanotron Research org

Thanks for the feedback! Corrected the typos (except the one in the cheatsheet I don't have access to it). Will let the @nouamanetazi correct/answer for the logic part :)

Nanotron Research org

cc @nouamanetazi if you've missed it

Nanotron Research org

Thank you @gordicaleksa for the detailed review :))

Regarding your Logic questions:

  1. The 2% gain is just noise as you can see in the variance we noticed in our interconnect benchmarks. The lessons to be learn from the charts are summarized right below them (big performance drop when moving from NVLink to EFA, and better memory saving with SP)
  2. for throughput scaling see 1. and regarding the maximum batch size we could fit, there's also some fragmentation issues due to pytorch that I'm hoping to document in a later tweet 🙃 (to get a sense it's a bit similar to some of the issues here)
  3. Good catch. Yeah I remember we had that discussion internally at some point, we kept the answer but the clarification didn't make it to the blog post, sorry 😅 All memory usage plots were made to simulate the max memory usage in a training step. In the case of PP, If you see any pp schedule we covered you'll find that GPU0 will have to do pp forwards before he can start doing backwards. So GPU0 which has num_params / pp number of parameter, will have to store (activs / pp) * pp ~= activs. Which is why we say that PP doesn't really reduce activs memory for a training step, assuming no checkpointing
  4. Can you specify what are the errors? For example ZeRO1 does shard the optimizer states along DP, and he has the Params communication overhead disadvantage (the big AllGather in the overlap diagram)
  5. The 6.h^2 actually comes from 2h^2 (FP32 master weight) + 4h^2 (fp32 optim states). So i need to fix the latter two actually, thanks for noticing!
  6. True! Added a note regarding the 32⋅seq_len⋅mbs⋅h^2(another note that dropped when porting lol)

Hope that answers your qsts! (and sorry for the delay)

Sign up or log in to comment