|
Now you can install this wheel locally or on another machine. |
|
|
|
pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl |
|
Multi-GPU Network Issues Debug |
|
When training or inferencing with DistributedDataParallel and multiple GPU, if you run into issue of inter-communication between processes and/or nodes, you can use the following script to diagnose network issues. |
|
|
|
wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py |
|
For example to test how 2 GPUs interact do: |
|
|
|
python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py |
|
If both processes can talk to each and allocate GPU memory each will print an OK status. |
|
For more GPUs or nodes adjust the arguments in the script. |
|
You will find a lot more details inside the diagnostics script and even a recipe to how you could run it in a SLURM environment. |
|
An additional level of debug is to add NCCL_DEBUG=INFO environment variable as follows: |
|
|
|
NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py |
|
This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported. |