Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
CPU offload
You could also offload parameters and gradients when they are not in use to the CPU to save even more GPU memory and help you fit large models where even FSDP may not be sufficient.