Spaces:

Ahmadzei
/

RAG

Runtime error

update 1

57bdca5 over 1 year ago

1.24 kB

	Now you can install this wheel locally or on another machine.

	pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl
	Multi-GPU Network Issues Debug
	When training or inferencing with DistributedDataParallel and multiple GPU, if you run into issue of inter-communication between processes and/or nodes, you can use the following script to diagnose network issues.

	wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py
	For example to test how 2 GPUs interact do:

	python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
	If both processes can talk to each and allocate GPU memory each will print an OK status.
	For more GPUs or nodes adjust the arguments in the script.
	You will find a lot more details inside the diagnostics script and even a recipe to how you could run it in a SLURM environment.
	An additional level of debug is to add NCCL_DEBUG=INFO environment variable as follows:

	NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
	This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported.