File size: 1,236 Bytes
57bdca5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Now you can install this wheel locally or on another machine. pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl Multi-GPU Network Issues Debug When training or inferencing with DistributedDataParallel and multiple GPU, if you run into issue of inter-communication between processes and/or nodes, you can use the following script to diagnose network issues. wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py For example to test how 2 GPUs interact do: python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py If both processes can talk to each and allocate GPU memory each will print an OK status. For more GPUs or nodes adjust the arguments in the script. You will find a lot more details inside the diagnostics script and even a recipe to how you could run it in a SLURM environment. An additional level of debug is to add NCCL_DEBUG=INFO environment variable as follows: NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported. |