Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
raw
history blame contribute delete
178 Bytes
By sharding the model parameters, optimizer and gradient states, and even offloading them to the CPU when they're inactive, FSDP can reduce the high cost of large-scale training.