songy / transformers /docs /source /ko /perf_train_cpu_many.md
trishv's picture
Upload 2383 files
96e9536
|
raw
history blame
6.86 kB

๋‹ค์ค‘ CPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ํ›ˆ๋ จํ•˜๊ธฐ [[efficient-training-on-multiple-cpus]]

ํ•˜๋‚˜์˜ CPU์—์„œ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์ด ๋„ˆ๋ฌด ๋Š๋ฆด ๋•Œ๋Š” ๋‹ค์ค‘ CPU๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ€์ด๋“œ๋Š” PyTorch ๊ธฐ๋ฐ˜์˜ DDP๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์‚ฐ CPU ํ›ˆ๋ จ์„ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

PyTorch์šฉ Intelยฎ oneCCL ๋ฐ”์ธ๋”ฉ [[intel-oneccl-bindings-for-pytorch]]

Intelยฎ oneCCL (collective communications library)์€ allreduce, allgather, alltoall๊ณผ ๊ฐ™์€ ์ง‘ํ•ฉ ํ†ต์‹ (collective communications)์„ ๊ตฌํ˜„ํ•œ ํšจ์œจ์ ์ธ ๋ถ„์‚ฐ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. oneCCL์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์ •๋ณด๋Š” oneCCL ๋ฌธ์„œ์™€ oneCCL ์‚ฌ์–‘์„ ์ฐธ์กฐํ•˜์„ธ์š”.

oneccl_bindings_for_pytorch ๋ชจ๋“ˆ (torch_ccl์€ ๋ฒ„์ „ 1.12 ์ด์ „์— ์‚ฌ์šฉ)์€ PyTorch C10D ProcessGroup API๋ฅผ ๊ตฌํ˜„ํ•˜๋ฉฐ, ์™ธ๋ถ€ ProcessGroup๋กœ ๋™์ ์œผ๋กœ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ ํ˜„์žฌ Linux ํ”Œ๋žซํผ์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

oneccl_bind_pt์—์„œ ๋” ์ž์„ธํ•œ ์ •๋ณด๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

PyTorch์šฉ Intelยฎ oneCCL ๋ฐ”์ธ๋”ฉ ์„ค์น˜: [[intel-oneccl-bindings-for-pytorch-installation]]

๋‹ค์Œ Python ๋ฒ„์ „์— ๋Œ€ํ•œ Wheel ํŒŒ์ผ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Extension Version Python 3.6 Python 3.7 Python 3.8 Python 3.9 Python 3.10
1.13.0 โˆš โˆš โˆš โˆš
1.12.100 โˆš โˆš โˆš โˆš
1.12.0 โˆš โˆš โˆš โˆš
1.11.0 โˆš โˆš โˆš โˆš
1.10.0 โˆš โˆš โˆš โˆš
pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipex-whl-stable-cpu

{pytorch_version}์€ 1.13.0๊ณผ ๊ฐ™์ด PyTorch ๋ฒ„์ „์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. oneccl_bind_pt ์„ค์น˜์— ๋Œ€ํ•œ ๋” ๋งŽ์€ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์„ ํ™•์ธํ•ด ๋ณด์„ธ์š”. oneCCL๊ณผ PyTorch์˜ ๋ฒ„์ „์€ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

oneccl_bindings_for_pytorch 1.12.0 ๋ฒ„์ „์˜ ๋ฏธ๋ฆฌ ๋นŒ๋“œ๋œ Wheel ํŒŒ์ผ์€ PyTorch 1.12.1๊ณผ ํ˜ธํ™˜๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค(PyTorch 1.12.0์šฉ์ž…๋‹ˆ๋‹ค). PyTorch 1.12.1์€ oneccl_bindings_for_pytorch 1.12.10 ๋ฒ„์ „๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Intelยฎ MPI ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ [[intel-mpi-library]]

์ด ํ‘œ์ค€ ๊ธฐ๋ฐ˜ MPI ๊ตฌํ˜„์„ ์‚ฌ์šฉํ•˜์—ฌ Intelยฎ ์•„ํ‚คํ…์ฒ˜์—์„œ ์œ ์—ฐํ•˜๊ณ  ํšจ์œจ์ ์ด๋ฉฐ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํด๋Ÿฌ์Šคํ„ฐ ๋ฉ”์‹œ์ง•์„ ์ œ๊ณตํ•˜์„ธ์š”. ์ด ๊ตฌ์„ฑ ์š”์†Œ๋Š” Intelยฎ oneAPI HPC Toolkit์˜ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค.

oneccl_bindings_for_pytorch๋Š” MPI ๋„๊ตฌ ์„ธํŠธ์™€ ํ•จ๊ป˜ ์„ค์น˜๋ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉํ•˜๊ธฐ ์ „์— ํ™˜๊ฒฝ์„ ์†Œ์Šค๋กœ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Intelยฎ oneCCL ๋ฒ„์ „ 1.12.0 ์ด์ƒ์ธ ๊ฒฝ์šฐ

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh

Intelยฎ oneCCL ๋ฒ„์ „์ด 1.12.0 ๋ฏธ๋งŒ์ธ ๊ฒฝ์šฐ

torch_ccl_path=$(python -c "import torch; import torch_ccl; import os;  print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))")
source $torch_ccl_path/env/setvars.sh

IPEX ์„ค์น˜: [[ipex-installation]]

IPEX๋Š” Float32์™€ BFloat16์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋Š” CPU ํ›ˆ๋ จ์„ ์œ„ํ•œ ์„ฑ๋Šฅ ์ตœ์ ํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. single CPU section์„ ์ฐธ์กฐํ•˜์„ธ์š”.

์ด์–ด์„œ ๋‚˜์˜ค๋Š” "Trainer์—์„œ์˜ ์‚ฌ์šฉ"์€ Intelยฎ MPI ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ mpirun์„ ์˜ˆ๋กœ ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

Trainer์—์„œ์˜ ์‚ฌ์šฉ [[usage-in-trainer]]

Trainer์—์„œ ccl ๋ฐฑ์—”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฉ€ํ‹ฐ CPU ๋ถ„์‚ฐ ํ›ˆ๋ จ์„ ํ™œ์„ฑํ™”ํ•˜๋ ค๋ฉด ๋ช…๋ น ์ธ์ˆ˜์— **--ddp_backend ccl**์„ ์ถ”๊ฐ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์งˆ์˜ ์‘๋‹ต ์˜ˆ์ œ๋ฅผ ์‚ฌ์šฉํ•œ ์˜ˆ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ๋ช…๋ น์€ ํ•œ Xeon ๋…ธ๋“œ์—์„œ 2๊ฐœ์˜ ํ”„๋กœ์„ธ์Šค๋กœ ํ›ˆ๋ จ์„ ํ™œ์„ฑํ™”ํ•˜๋ฉฐ, ๊ฐ ์†Œ์ผ“๋‹น ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์Šค๊ฐ€ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. OMP_NUM_THREADS/CCL_WORKER_COUNT ๋ณ€์ˆ˜๋Š” ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 export CCL_WORKER_COUNT=1
 export MASTER_ADDR=127.0.0.1
 mpirun -n 2 -genv OMP_NUM_THREADS=23 \
 python3 run_qa.py \
 --model_name_or_path bert-large-uncased \
 --dataset_name squad \
 --do_train \
 --do_eval \
 --per_device_train_batch_size 12  \
 --learning_rate 3e-5  \
 --num_train_epochs 2  \
 --max_seq_length 384 \
 --doc_stride 128  \
 --output_dir /tmp/debug_squad/ \
 --no_cuda \
 --ddp_backend ccl \
 --use_ipex

๋‹ค์Œ ๋ช…๋ น์€ ๋‘ ๊ฐœ์˜ Xeon(๋…ธ๋“œ0 ๋ฐ ๋…ธ๋“œ1, ์ฃผ ํ”„๋กœ์„ธ์Šค๋กœ ๋…ธ๋“œ0์„ ์‚ฌ์šฉ)์—์„œ ์ด 4๊ฐœ์˜ ํ”„๋กœ์„ธ์Šค๋กœ ํ›ˆ๋ จ์„ ํ™œ์„ฑํ™”ํ•˜๋ฉฐ, ๊ฐ ์†Œ์ผ“๋‹น ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์Šค๊ฐ€ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. OMP_NUM_THREADS/CCL_WORKER_COUNT ๋ณ€์ˆ˜๋Š” ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋…ธ๋“œ0์—์„œ๋Š” ๊ฐ ๋…ธ๋“œ์˜ IP ์ฃผ์†Œ๋ฅผ ํฌํ•จํ•˜๋Š” ๊ตฌ์„ฑ ํŒŒ์ผ(์˜ˆ: hostfile)์„ ์ƒ์„ฑํ•˜๊ณ  ํ•ด๋‹น ๊ตฌ์„ฑ ํŒŒ์ผ ๊ฒฝ๋กœ๋ฅผ ์ธ์ˆ˜๋กœ ์ „๋‹ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 cat hostfile
 xxx.xxx.xxx.xxx #node0 ip
 xxx.xxx.xxx.xxx #node1 ip

์ด์ œ ๋…ธ๋“œ0์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜๋ฉด 4DDP๊ฐ€ ๋…ธ๋“œ0 ๋ฐ ๋…ธ๋“œ1์—์„œ BF16 ์ž๋™ ํ˜ผํ•ฉ ์ •๋ฐ€๋„๋กœ ํ™œ์„ฑํ™”๋ฉ๋‹ˆ๋‹ค.

 export CCL_WORKER_COUNT=1
 export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
 mpirun -f hostfile -n 4 -ppn 2 \
 -genv OMP_NUM_THREADS=23 \
 python3 run_qa.py \
 --model_name_or_path bert-large-uncased \
 --dataset_name squad \
 --do_train \
 --do_eval \
 --per_device_train_batch_size 12  \
 --learning_rate 3e-5  \
 --num_train_epochs 2  \
 --max_seq_length 384 \
 --doc_stride 128  \
 --output_dir /tmp/debug_squad/ \
 --no_cuda \
 --ddp_backend ccl \
 --use_ipex \
 --bf16