Spaces:
Runtime error
Runtime error
# Introduction | |
This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. | |
Some of the code here will be included in upstream Pytorch eventually. | |
The intent of Apex is to make up-to-date utilities available to users as quickly as possible. | |
## Full API Documentation: [https://nvidia.github.io/apex](https://nvidia.github.io/apex) | |
## [GTC 2019](https://github.com/mcarilli/mixed_precision_references/tree/master/GTC_2019) and [Pytorch DevCon 2019](https://github.com/mcarilli/mixed_precision_references/tree/master/Pytorch_Devcon_2019) Slides | |
# Contents | |
## 1. Amp: Automatic Mixed Precision | |
**Deprecated. Use [PyTorch AMP](https://pytorch.org/docs/stable/amp.html)** | |
`apex.amp` is a tool to enable mixed precision training by changing only 3 lines of your script. | |
Users can easily experiment with different pure and mixed precision training modes by supplying | |
different flags to `amp.initialize`. | |
[Webinar introducing Amp](https://info.nvidia.com/webinar-mixed-precision-with-pytorch-reg-page.html) | |
(The flag `cast_batchnorm` has been renamed to `keep_batchnorm_fp32`). | |
[API Documentation](https://nvidia.github.io/apex/amp.html) | |
[Comprehensive Imagenet example](https://github.com/NVIDIA/apex/tree/master/examples/imagenet) | |
[DCGAN example coming soon...](https://github.com/NVIDIA/apex/tree/master/examples/dcgan) | |
[Moving to the new Amp API](https://nvidia.github.io/apex/amp.html#transition-guide-for-old-api-users) (for users of the deprecated "Amp" and "FP16_Optimizer" APIs) | |
## 2. Distributed Training | |
**`apex.parallel.DistributedDataParallel` is deprecated. Use [`torch.nn.parallel.DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel#torch.nn.parallel.DistributedDataParallel)** | |
`apex.parallel.DistributedDataParallel` is a module wrapper, similar to | |
`torch.nn.parallel.DistributedDataParallel`. It enables convenient multiprocess distributed training, | |
optimized for NVIDIA's NCCL communication library. | |
[API Documentation](https://nvidia.github.io/apex/parallel.html) | |
[Python Source](https://github.com/NVIDIA/apex/tree/master/apex/parallel) | |
[Example/Walkthrough](https://github.com/NVIDIA/apex/tree/master/examples/simple/distributed) | |
The [Imagenet example](https://github.com/NVIDIA/apex/tree/master/examples/imagenet) | |
shows use of `apex.parallel.DistributedDataParallel` along with `apex.amp`. | |
### Synchronized Batch Normalization | |
**Deprecated. Use [`torch.nn.SyncBatchNorm`](https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html)** | |
`apex.parallel.SyncBatchNorm` extends `torch.nn.modules.batchnorm._BatchNorm` to | |
support synchronized BN. | |
It allreduces stats across processes during multiprocess (DistributedDataParallel) training. | |
Synchronous BN has been used in cases where only a small | |
local minibatch can fit on each GPU. | |
Allreduced stats increase the effective batch size for the BN layer to the | |
global batch size across all processes (which, technically, is the correct | |
formulation). | |
Synchronous BN has been observed to improve converged accuracy in some of our research models. | |
### Checkpointing | |
To properly save and load your `amp` training, we introduce the `amp.state_dict()`, which contains all `loss_scalers` and their corresponding unskipped steps, | |
as well as `amp.load_state_dict()` to restore these attributes. | |
In order to get bitwise accuracy, we recommend the following workflow: | |
```python | |
# Initialization | |
opt_level = 'O1' | |
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level) | |
# Train your model | |
... | |
with amp.scale_loss(loss, optimizer) as scaled_loss: | |
scaled_loss.backward() | |
... | |
# Save checkpoint | |
checkpoint = { | |
'model': model.state_dict(), | |
'optimizer': optimizer.state_dict(), | |
'amp': amp.state_dict() | |
} | |
torch.save(checkpoint, 'amp_checkpoint.pt') | |
... | |
# Restore | |
model = ... | |
optimizer = ... | |
checkpoint = torch.load('amp_checkpoint.pt') | |
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level) | |
model.load_state_dict(checkpoint['model']) | |
optimizer.load_state_dict(checkpoint['optimizer']) | |
amp.load_state_dict(checkpoint['amp']) | |
# Continue training | |
... | |
``` | |
Note that we recommend restoring the model using the same `opt_level`. Also note that we recommend calling the `load_state_dict` methods after `amp.initialize`. | |
# Installation | |
Each [`apex.contrib`](./apex/contrib) module requires one or more install options other than `--cpp_ext` and `--cuda_ext`. | |
Note that contrib modules do not necessarily support stable PyTorch releases. | |
## Containers | |
NVIDIA PyTorch Containers are available on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch. | |
The containers come with all the custom extensions available at the moment. | |
See [the NGC documentation](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) for details such as: | |
- how to pull a container | |
- how to run a pulled container | |
- release notes | |
## From Source | |
To install Apex from source, we recommend using the nightly Pytorch obtainable from https://github.com/pytorch/pytorch. | |
The latest stable release obtainable from https://pytorch.org should also work. | |
We recommend installing [`Ninja`](https://ninja-build.org/) to make compilation faster. | |
### Linux | |
For performance and full functionality, we recommend installing Apex with | |
CUDA and C++ extensions via | |
```bash | |
git clone https://github.com/NVIDIA/apex | |
cd apex | |
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... | |
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ | |
# otherwise | |
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./ | |
``` | |
APEX also supports a Python-only build via | |
```bash | |
pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./ | |
``` | |
A Python-only build omits: | |
- Fused kernels required to use `apex.optimizers.FusedAdam`. | |
- Fused kernels required to use `apex.normalization.FusedLayerNorm` and `apex.normalization.FusedRMSNorm`. | |
- Fused kernels that improve the performance and numerical stability of `apex.parallel.SyncBatchNorm`. | |
- Fused kernels that improve the performance of `apex.parallel.DistributedDataParallel` and `apex.amp`. | |
`DistributedDataParallel`, `amp`, and `SyncBatchNorm` will still be usable, but they may be slower. | |
### [Experimental] Windows | |
`pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" .` may work if you were able to build Pytorch from source | |
on your system. A Python-only build via `pip install -v --no-cache-dir .` is more likely to work. | |
If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment. | |
## Custom C++/CUDA Extensions and Install Options | |
If a requirement of a module is not met, then it will not be built. | |
| Module Name | Install Option | Misc | | |
|---------------|------------------|--------| | |
| `apex_C` | `--cpp_ext` | | | |
| `amp_C` | `--cuda_ext` | | | |
| `syncbn` | `--cuda_ext` | | | |
| `fused_layer_norm_cuda` | `--cuda_ext` | [`apex.normalization`](./apex/normalization) | | |
| `mlp_cuda` | `--cuda_ext` | | | |
| `scaled_upper_triang_masked_softmax_cuda` | `--cuda_ext` | | | |
| `generic_scaled_masked_softmax_cuda` | `--cuda_ext` | | | |
| `scaled_masked_softmax_cuda` | `--cuda_ext` | | | |
| `fused_weight_gradient_mlp_cuda` | `--cuda_ext` | Requires CUDA>=11 | | |
| `permutation_search_cuda` | `--permutation_search` | [`apex.contrib.sparsity`](./apex/contrib/sparsity) | | |
| `bnp` | `--bnp` | [`apex.contrib.groupbn`](./apex/contrib/groupbn) | | |
| `xentropy` | `--xentropy` | [`apex.contrib.xentropy`](./apex/contrib/xentropy) | | |
| `focal_loss_cuda` | `--focal_loss` | [`apex.contrib.focal_loss`](./apex/contrib/focal_loss) | | |
| `fused_index_mul_2d` | `--index_mul_2d` | [`apex.contrib.index_mul_2d`](./apex/contrib/index_mul_2d) | | |
| `fused_adam_cuda` | `--deprecated_fused_adam` | [`apex.contrib.optimizers`](./apex/contrib/optimizers) | | |
| `fused_lamb_cuda` | `--deprecated_fused_lamb` | [`apex.contrib.optimizers`](./apex/contrib/optimizers) | | |
| `fast_layer_norm` | `--fast_layer_norm` | [`apex.contrib.layer_norm`](./apex/contrib/layer_norm). different from `fused_layer_norm` | | |
| `fmhalib` | `--fmha` | [`apex.contrib.fmha`](./apex/contrib/fmha) | | |
| `fast_multihead_attn` | `--fast_multihead_attn` | [`apex.contrib.multihead_attn`](./apex/contrib/multihead_attn) | | |
| `transducer_joint_cuda` | `--transducer` | [`apex.contrib.transducer`](./apex/contrib/transducer) | | |
| `transducer_loss_cuda` | `--transducer` | [`apex.contrib.transducer`](./apex/contrib/transducer) | | |
| `cudnn_gbn_lib` | `--cudnn_gbn` | Requires cuDNN>=8.5, [`apex.contrib.cudnn_gbn`](./apex/contrib/cudnn_gbn) | | |
| `peer_memory_cuda` | `--peer_memory` | [`apex.contrib.peer_memory`](./apex/contrib/peer_memory) | | |
| `nccl_p2p_cuda` | `--nccl_p2p` | Requires NCCL >= 2.10, [`apex.contrib.nccl_p2p`](./apex/contrib/nccl_p2p) | | |
| `fast_bottleneck` | `--fast_bottleneck` | Requires `peer_memory_cuda` and `nccl_p2p_cuda`, [`apex.contrib.bottleneck`](./apex/contrib/bottleneck) | | |
| `fused_conv_bias_relu` | `--fused_conv_bias_relu` | Requires cuDNN>=8.4, [`apex.contrib.conv_bias_relu`](./apex/contrib/conv_bias_relu) | | |