Spaces:
Runtime error
Runtime error
| # Mixed Precision ImageNet Training in PyTorch | |
| `main_amp.py` is based on [https://github.com/pytorch/examples/tree/master/imagenet](https://github.com/pytorch/examples/tree/master/imagenet). | |
| It implements Automatic Mixed Precision (Amp) training of popular model architectures, such as ResNet, AlexNet, and VGG, on the ImageNet dataset. Command-line flags forwarded to `amp.initialize` are used to easily manipulate and switch between various pure and mixed precision "optimization levels" or `opt_level`s. For a detailed explanation of `opt_level`s, see the [updated API guide](https://nvidia.github.io/apex/amp.html). | |
| Three lines enable Amp: | |
| ``` | |
| # Added after model and optimizer construction | |
| model, optimizer = amp.initialize(model, optimizer, flags...) | |
| ... | |
| # loss.backward() changed to: | |
| with amp.scale_loss(loss, optimizer) as scaled_loss: | |
| scaled_loss.backward() | |
| ``` | |
| With the new Amp API **you never need to explicitly convert your model, or the input data, to half().** | |
| ## Requirements | |
| - Download the ImageNet dataset and move validation images to labeled subfolders | |
| - The following script may be helpful: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | |
| ## Training | |
| To train a model, create softlinks to the Imagenet dataset, then run `main.py` with the desired model architecture, as shown in `Example commands` below. | |
| The default learning rate schedule is set for ResNet50. `main_amp.py` script rescales the learning rate according to the global batch size (number of distributed processes \* per-process minibatch size). | |
| ## Example commands | |
| **Note:** batch size `--b 224` assumes your GPUs have >=16GB of onboard memory. You may be able to increase this to 256, but that's cutting it close, so it may out-of-memory for different Pytorch versions. | |
| **Note:** All of the following use 4 dataloader subprocesses (`--workers 4`) to reduce potential | |
| CPU data loading bottlenecks. | |
| **Note:** `--opt-level` `O1` and `O2` both use dynamic loss scaling by default unless manually overridden. | |
| `--opt-level` `O0` and `O3` (the "pure" training modes) do not use loss scaling by default. | |
| `O0` and `O3` can be told to use loss scaling via manual overrides, but using loss scaling with `O0` | |
| (pure FP32 training) does not really make sense, and will trigger a warning. | |
| Softlink training and validation datasets into the current directory: | |
| ``` | |
| $ ln -sf /data/imagenet/train-jpeg/ train | |
| $ ln -sf /data/imagenet/val-jpeg/ val | |
| ``` | |
| ### Summary | |
| Amp allows easy experimentation with various pure and mixed precision options. | |
| ``` | |
| $ python main_amp.py -a resnet50 --b 128 --workers 4 --opt-level O0 ./ | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O3 ./ | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O3 --keep-batchnorm-fp32 True ./ | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 ./ | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 --loss-scale 128.0 ./ | |
| $ python -m torch.distributed.launch --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 ./ | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./ | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 --loss-scale 128.0 ./ | |
| $ python -m torch.distributed.launch --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./ | |
| ``` | |
| Options are explained below. Again, the [updated API guide](https://nvidia.github.io/apex/amp.html) provides more detail. | |
| #### `--opt-level O0` (FP32 training) and `O3` (FP16 training) | |
| "Pure FP32" training: | |
| ``` | |
| $ python main_amp.py -a resnet50 --b 128 --workers 4 --opt-level O0 ./ | |
| ``` | |
| "Pure FP16" training: | |
| ``` | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O3 ./ | |
| ``` | |
| FP16 training with FP32 batchnorm: | |
| ``` | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O3 --keep-batchnorm-fp32 True ./ | |
| ``` | |
| Keeping the batchnorms in FP32 improves stability and allows Pytorch | |
| to use cudnn batchnorms, which significantly increases speed in Resnet50. | |
| The `O3` options might not converge, because they are not true mixed precision. | |
| However, they can be useful to establish "speed of light" performance for | |
| your model, which provides a baseline for comparison with `O1` and `O2`. | |
| For Resnet50 in particular, `--opt-level O3 --keep-batchnorm-fp32 True` establishes | |
| the "speed of light." (Without `--keep-batchnorm-fp32`, it's slower, because it does | |
| not use cudnn batchnorm.) | |
| #### `--opt-level O1` (Official Mixed Precision recipe, recommended for typical use) | |
| `O1` patches Torch functions to cast inputs according to a whitelist-blacklist model. | |
| FP16-friendly (Tensor Core) ops like gemms and convolutions run in FP16, while ops | |
| that benefit from FP32, like batchnorm and softmax, run in FP32. | |
| Also, dynamic loss scaling is used by default. | |
| ``` | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 ./ | |
| ``` | |
| `O1` overridden to use static loss scaling: | |
| ``` | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 --loss-scale 128.0 | |
| ``` | |
| Distributed training with 2 processes (1 GPU per process, see **Distributed training** below | |
| for more detail) | |
| ``` | |
| $ python -m torch.distributed.launch --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O1 ./ | |
| ``` | |
| For best performance, set `--nproc_per_node` equal to the total number of GPUs on the node | |
| to use all available resources. | |
| #### `--opt-level O2` ("Almost FP16" mixed precision. More dangerous than O1.) | |
| `O2` exists mainly to support some internal use cases. Please prefer `O1`. | |
| `O2` casts the model to FP16, keeps batchnorms in FP32, | |
| maintains master weights in FP32, and implements | |
| dynamic loss scaling by default. (Unlike --opt-level O1, --opt-level O2 | |
| does not patch Torch functions.) | |
| ``` | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./ | |
| ``` | |
| "Fast mixed precision" overridden to use static loss scaling: | |
| ``` | |
| $ python main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 --loss-scale 128.0 ./ | |
| ``` | |
| Distributed training with 2 processes (1 GPU per process) | |
| ``` | |
| $ python -m torch.distributed.launch --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./ | |
| ``` | |
| ## Distributed training | |
| `main_amp.py` optionally uses `apex.parallel.DistributedDataParallel` (DDP) for multiprocess training with one GPU per process. | |
| ``` | |
| model = apex.parallel.DistributedDataParallel(model) | |
| ``` | |
| is a drop-in replacement for | |
| ``` | |
| model = torch.nn.parallel.DistributedDataParallel(model, | |
| device_ids=[arg.local_rank], | |
| output_device=arg.local_rank) | |
| ``` | |
| (because Torch DDP permits multiple GPUs per process, with Torch DDP you are required to | |
| manually specify the device to run on and the output device. | |
| With Apex DDP, it uses only the current device by default). | |
| The choice of DDP wrapper (Torch or Apex) is orthogonal to the use of Amp and other Apex tools. It is safe to use `apex.amp` with either `torch.nn.parallel.DistributedDataParallel` or `apex.parallel.DistributedDataParallel`. In the future, I may add some features that permit optional tighter integration between `Amp` and `apex.parallel.DistributedDataParallel` for marginal performance benefits, but currently, there's no compelling reason to use Apex DDP versus Torch DDP for most models. | |
| To use DDP with `apex.amp`, the only gotcha is that | |
| ``` | |
| model, optimizer = amp.initialize(model, optimizer, flags...) | |
| ``` | |
| must precede | |
| ``` | |
| model = DDP(model) | |
| ``` | |
| If DDP wrapping occurs before `amp.initialize`, `amp.initialize` will raise an error. | |
| With both Apex DDP and Torch DDP, you must also call `torch.cuda.set_device(args.local_rank)` within | |
| each process prior to initializing your model or any other tensors. | |
| More information can be found in the docs for the | |
| Pytorch multiprocess launcher module [torch.distributed.launch](https://pytorch.org/docs/stable/distributed.html#launch-utility). | |
| `main_amp.py` is written to interact with | |
| [torch.distributed.launch](https://pytorch.org/docs/master/distributed.html#launch-utility), | |
| which spawns multiprocess jobs using the following syntax: | |
| ``` | |
| python -m torch.distributed.launch --nproc_per_node=NUM_GPUS main_amp.py args... | |
| ``` | |
| `NUM_GPUS` should be less than or equal to the number of visible GPU devices on the node. The use of `torch.distributed.launch` is unrelated to the choice of DDP wrapper. It is safe to use either apex DDP or torch DDP with `torch.distributed.launch`. | |
| Optionally, one can run imagenet with synchronized batch normalization across processes by adding | |
| `--sync_bn` to the `args...` | |
| ## Deterministic training (for debugging purposes) | |
| Running with the `--deterministic` flag should produce bitwise identical outputs run-to-run, | |
| regardless of what other options are used (see [Pytorch docs on reproducibility](https://pytorch.org/docs/stable/notes/randomness.html)). | |
| Since `--deterministic` disables `torch.backends.cudnn.benchmark`, `--deterministic` may | |
| cause a modest performance decrease. | |
| ## Profiling | |
| If you're curious how the network actually looks on the CPU and GPU timelines (for example, how good is the overall utilization? | |
| Is the prefetcher really overlapping data transfers?) try profiling `main_amp.py`. | |
| [Detailed instructions can be found here](https://gist.github.com/mcarilli/213a4e698e4a0ae2234ddee56f4f3f95). | |