Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
raw
history blame contribute delete
273 Bytes
Checkpointing
Intermediate checkpoints should be saved with fsdp_state_dict_type: SHARDED_STATE_DICT because saving the full state dict with CPU offloading on rank 0 takes a lot of time and often results in NCCL Timeout errors due to indefinite hanging during broadcasting.