Spaces:
Runtime error
Runtime error
.. role:: hidden | |
:class: hidden-section | |
apex.amp | |
=================================== | |
This page documents the updated API for Amp (Automatic Mixed Precision), | |
a tool to enable Tensor Core-accelerated training in only 3 lines of Python. | |
A `runnable, comprehensive Imagenet example`_ demonstrating good practices can be found | |
on the Github page. | |
GANs are a tricky case that many people have requested. A `comprehensive DCGAN example`_ | |
is under construction. | |
If you already implemented Amp based on the instructions below, but it isn't behaving as expected, | |
please review `Advanced Amp Usage`_ to see if any topics match your use case. If that doesn't help, | |
`file an issue`_. | |
.. _`file an issue`: | |
https://github.com/NVIDIA/apex/issues | |
``opt_level``\ s and Properties | |
------------------------------- | |
Amp allows users to easily experiment with different pure and mixed precision modes. | |
Commonly-used default modes are chosen by | |
selecting an "optimization level" or ``opt_level``; each ``opt_level`` establishes a set of | |
properties that govern Amp's implementation of pure or mixed precision training. | |
Finer-grained control of how a given ``opt_level`` behaves can be achieved by passing values for | |
particular properties directly to ``amp.initialize``. These manually specified values | |
override the defaults established by the ``opt_level``. | |
Example:: | |
# Declare model and optimizer as usual, with default (FP32) precision | |
model = torch.nn.Linear(D_in, D_out).cuda() | |
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) | |
# Allow Amp to perform casts as required by the opt_level | |
model, optimizer = amp.initialize(model, optimizer, opt_level="O1") | |
... | |
# loss.backward() becomes: | |
with amp.scale_loss(loss, optimizer) as scaled_loss: | |
scaled_loss.backward() | |
... | |
Users **should not** manually cast their model or data to ``.half()``, regardless of what ``opt_level`` | |
or properties are chosen. Amp intends that users start with an existing default (FP32) script, | |
add the three lines corresponding to the Amp API, and begin training with mixed precision. | |
Amp can also be disabled, in which case the original script will behave exactly as it used to. | |
In this way, there's no risk adhering to the Amp API, and a lot of potential performance benefit. | |
.. note:: | |
Because it's never necessary to manually cast your model (aside from the call ``amp.initialize``) | |
or input data, a script that adheres to the new API | |
can switch between different ``opt-level``\ s without having to make any other changes. | |
.. _`runnable, comprehensive Imagenet example`: | |
https://github.com/NVIDIA/apex/tree/master/examples/imagenet | |
.. _`comprehensive DCGAN example`: | |
https://github.com/NVIDIA/apex/tree/master/examples/dcgan | |
.. _`Advanced Amp Usage`: | |
https://nvidia.github.io/apex/advanced.html | |
Properties | |
********** | |
Currently, the under-the-hood properties that govern pure or mixed precision training are the following: | |
- ``cast_model_type``: Casts your model's parameters and buffers to the desired type. | |
- ``patch_torch_functions``: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32. | |
- ``keep_batchnorm_fp32``: To enhance precision and enable cudnn batchnorm (which improves performance), it's often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16. | |
- ``master_weights``: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients. | |
- ``loss_scale``: If ``loss_scale`` is a float value, use this value as the static (fixed) loss scale. If ``loss_scale`` is the string ``"dynamic"``, adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically. | |
Again, you often don't need to specify these properties by hand. Instead, select an ``opt_level``, | |
which will set them up for you. After selecting an ``opt_level``, you can optionally pass property | |
kwargs as manual overrides. | |
If you attempt to override a property that does not make sense for the selected ``opt_level``, | |
Amp will raise an error with an explanation. For example, selecting ``opt_level="O1"`` combined with | |
the override ``master_weights=True`` does not make sense. ``O1`` inserts casts | |
around Torch functions rather than model weights. Data, activations, and weights are recast | |
out-of-place on the fly as they flow through patched functions. Therefore, the model weights themselves | |
can (and should) remain FP32, and there is no need to maintain separate FP32 master weights. | |
``opt_level``\ s | |
**************** | |
Recognized ``opt_level``\ s are ``"O0"``, ``"O1"``, ``"O2"``, and ``"O3"``. | |
``O0`` and ``O3`` are not true mixed precision, but they are useful for establishing accuracy and | |
speed baselines, respectively. | |
``O1`` and ``O2`` are different implementations of mixed precision. Try both, and see | |
what gives the best speedup and accuracy for your model. | |
``O0``: FP32 training | |
^^^^^^^^^^^^^^^^^^^^^^ | |
Your incoming model should be FP32 already, so this is likely a no-op. | |
``O0`` can be useful to establish an accuracy baseline. | |
| Default properties set by ``O0``: | |
| ``cast_model_type=torch.float32`` | |
| ``patch_torch_functions=False`` | |
| ``keep_batchnorm_fp32=None`` (effectively, "not applicable," everything is FP32) | |
| ``master_weights=False`` | |
| ``loss_scale=1.0`` | |
| | |
| | |
``O1``: Mixed Precision (recommended for typical use) | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
Patch all Torch functions and Tensor methods to cast their inputs according to a whitelist-blacklist | |
model. Whitelist ops (for example, Tensor Core-friendly ops like GEMMs and convolutions) are performed | |
in FP16. Blacklist ops that benefit from FP32 precision (for example, softmax) | |
are performed in FP32. ``O1`` also uses dynamic loss scaling, unless overridden. | |
| Default properties set by ``O1``: | |
| ``cast_model_type=None`` (not applicable) | |
| ``patch_torch_functions=True`` | |
| ``keep_batchnorm_fp32=None`` (again, not applicable, all model weights remain FP32) | |
| ``master_weights=None`` (not applicable, model weights remain FP32) | |
| ``loss_scale="dynamic"`` | |
| | |
| | |
``O2``: "Almost FP16" Mixed Precision | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
``O2`` casts the model weights to FP16, | |
patches the model's ``forward`` method to cast input | |
data to FP16, keeps batchnorms in FP32, maintains FP32 master weights, | |
updates the optimizer's ``param_groups`` so that the ``optimizer.step()`` | |
acts directly on the FP32 weights (followed by FP32 master weight->FP16 model weight | |
copies if necessary), | |
and implements dynamic loss scaling (unless overridden). | |
Unlike ``O1``, ``O2`` does not patch Torch functions or Tensor methods. | |
| Default properties set by ``O2``: | |
| ``cast_model_type=torch.float16`` | |
| ``patch_torch_functions=False`` | |
| ``keep_batchnorm_fp32=True`` | |
| ``master_weights=True`` | |
| ``loss_scale="dynamic"`` | |
| | |
| | |
``O3``: FP16 training | |
^^^^^^^^^^^^^^^^^^^^^^ | |
``O3`` may not achieve the stability of the true mixed precision options ``O1`` and ``O2``. | |
However, it can be useful to establish a speed baseline for your model, against which | |
the performance of ``O1`` and ``O2`` can be compared. If your model uses batch normalization, | |
to establish "speed of light" you can try ``O3`` with the additional property override | |
``keep_batchnorm_fp32=True`` (which enables cudnn batchnorm, as stated earlier). | |
| Default properties set by ``O3``: | |
| ``cast_model_type=torch.float16`` | |
| ``patch_torch_functions=False`` | |
| ``keep_batchnorm_fp32=False`` | |
| ``master_weights=False`` | |
| ``loss_scale=1.0`` | |
| | |
| | |
Unified API | |
----------- | |
.. automodule:: apex.amp | |
.. currentmodule:: apex.amp | |
.. autofunction:: initialize | |
.. autofunction:: scale_loss | |
.. autofunction:: master_params | |
Checkpointing | |
------------- | |
To properly save and load your amp training, we introduce the ``amp.state_dict()``, which contains all ``loss_scaler``\ s and their corresponding unskipped steps, as well as ``amp.load_state_dict()`` to restore these attributes. | |
In order to get bitwise accuracy, we recommend the following workflow:: | |
# Initialization | |
opt_level = 'O1' | |
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level) | |
# Train your model | |
... | |
# Save checkpoint | |
checkpoint = { | |
'model': model.state_dict(), | |
'optimizer': optimizer.state_dict(), | |
'amp': amp.state_dict() | |
} | |
torch.save(checkpoint, 'amp_checkpoint.pt') | |
... | |
# Restore | |
model = ... | |
optimizer = ... | |
checkpoint = torch.load('amp_checkpoint.pt') | |
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level) | |
model.load_state_dict(checkpoint['model']) | |
optimizer.load_state_dict(checkpoint['optimizer']) | |
amp.load_state_dict(checkpoint['amp']) | |
# Continue training | |
... | |
Note that we recommend restoring the model using the same ``opt_level``. Also note that we recommend calling the ``load_state_dict`` methods after ``amp.initialize``. | |
Advanced use cases | |
------------------ | |
The unified Amp API supports gradient accumulation across iterations, | |
multiple backward passes per iteration, multiple models/optimizers, | |
custom/user-defined autograd functions, and custom data batch classes. Gradient clipping and GANs also | |
require special treatment, but this treatment does not need to change | |
for different ``opt_level``\ s. Further details can be found here: | |
.. toctree:: | |
:maxdepth: 1 | |
advanced | |
Transition guide for old API users | |
---------------------------------- | |
We strongly encourage moving to the new Amp API, because it's more versatile, easier to use, and future proof. The original :class:`FP16_Optimizer` and the old "Amp" API are deprecated, and subject to removal at at any time. | |
For users of the old "Amp" API | |
****************************** | |
In the new API, ``opt-level O1`` performs the same patching of the Torch namespace as the old thing | |
called "Amp." | |
However, the new API allows static or dynamic loss scaling, while the old API only allowed dynamic loss scaling. | |
In the new API, the old call to ``amp_handle = amp.init()``, and the returned ``amp_handle``, are no | |
longer exposed or necessary. The new ``amp.initialize()`` does the duty of ``amp.init()`` (and more). | |
Therefore, any existing calls to ``amp_handle = amp.init()`` should be deleted. | |
The functions formerly exposed through ``amp_handle`` are now free | |
functions accessible through the ``amp`` module. | |
The backward context manager must be changed accordingly:: | |
# old API | |
with amp_handle.scale_loss(loss, optimizer) as scaled_loss: | |
scaled_loss.backward() | |
-> | |
# new API | |
with amp.scale_loss(loss, optimizer) as scaled_loss: | |
scaled_loss.backward() | |
For now, the deprecated "Amp" API documentation can still be found on the Github README: https://github.com/NVIDIA/apex/tree/master/apex/amp. The old API calls that `annotate user functions`_ to run | |
with a particular precision are still honored by the new API. | |
.. _`annotate user functions`: | |
https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions | |
For users of the old FP16_Optimizer | |
*********************************** | |
``opt-level O2`` is equivalent to :class:`FP16_Optimizer` with ``dynamic_loss_scale=True``. | |
Once again, the backward pass must be changed to the unified version:: | |
optimizer.backward(loss) | |
-> | |
with amp.scale_loss(loss, optimizer) as scaled_loss: | |
scaled_loss.backward() | |
One annoying aspect of FP16_Optimizer was that the user had to manually convert their model to half | |
(either by calling ``.half()`` on it, or using a function or module wrapper from | |
``apex.fp16_utils``), and also manually call ``.half()`` on input data. **Neither of these are | |
necessary in the new API. No matter what --opt-level | |
you choose, you can and should simply build your model and pass input data in the default FP32 format.** | |
The new Amp API will perform the right conversions during | |
``model, optimizer = amp.initialize(model, optimizer, opt_level=....)`` based on the ``--opt-level`` | |
and any overridden flags. Floating point input data may be FP32 or FP16, but you may as well just | |
let it be FP16, because the ``model`` returned by ``amp.initialize`` will have its ``forward`` | |
method patched to cast the input data appropriately. | |