File size: 1,045 Bytes
5fa1a76 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Here is the full benchmark code and outputs: ```bash DDP w/ NVLink rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 {'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} DDP w/o NVLink rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 {'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2 in nvidia-smi topo -m) Software: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0 |