Spaces:
Runtime error
Runtime error
| .. role:: hidden | |
| :class: hidden-section | |
| apex.amp | |
| =================================== | |
| This page documents the updated API for Amp (Automatic Mixed Precision), | |
| a tool to enable Tensor Core-accelerated training in only 3 lines of Python. | |
| A `runnable, comprehensive Imagenet example`_ demonstrating good practices can be found | |
| on the Github page. | |
| GANs are a tricky case that many people have requested. A `comprehensive DCGAN example`_ | |
| is under construction. | |
| If you already implemented Amp based on the instructions below, but it isn't behaving as expected, | |
| please review `Advanced Amp Usage`_ to see if any topics match your use case. If that doesn't help, | |
| `file an issue`_. | |
| .. _`file an issue`: | |
| https://github.com/NVIDIA/apex/issues | |
| ``opt_level``\ s and Properties | |
| ------------------------------- | |
| Amp allows users to easily experiment with different pure and mixed precision modes. | |
| Commonly-used default modes are chosen by | |
| selecting an "optimization level" or ``opt_level``; each ``opt_level`` establishes a set of | |
| properties that govern Amp's implementation of pure or mixed precision training. | |
| Finer-grained control of how a given ``opt_level`` behaves can be achieved by passing values for | |
| particular properties directly to ``amp.initialize``. These manually specified values | |
| override the defaults established by the ``opt_level``. | |
| Example:: | |
| # Declare model and optimizer as usual, with default (FP32) precision | |
| model = torch.nn.Linear(D_in, D_out).cuda() | |
| optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) | |
| # Allow Amp to perform casts as required by the opt_level | |
| model, optimizer = amp.initialize(model, optimizer, opt_level="O1") | |
| ... | |
| # loss.backward() becomes: | |
| with amp.scale_loss(loss, optimizer) as scaled_loss: | |
| scaled_loss.backward() | |
| ... | |
| Users **should not** manually cast their model or data to ``.half()``, regardless of what ``opt_level`` | |
| or properties are chosen. Amp intends that users start with an existing default (FP32) script, | |
| add the three lines corresponding to the Amp API, and begin training with mixed precision. | |
| Amp can also be disabled, in which case the original script will behave exactly as it used to. | |
| In this way, there's no risk adhering to the Amp API, and a lot of potential performance benefit. | |
| .. note:: | |
| Because it's never necessary to manually cast your model (aside from the call ``amp.initialize``) | |
| or input data, a script that adheres to the new API | |
| can switch between different ``opt-level``\ s without having to make any other changes. | |
| .. _`runnable, comprehensive Imagenet example`: | |
| https://github.com/NVIDIA/apex/tree/master/examples/imagenet | |
| .. _`comprehensive DCGAN example`: | |
| https://github.com/NVIDIA/apex/tree/master/examples/dcgan | |
| .. _`Advanced Amp Usage`: | |
| https://nvidia.github.io/apex/advanced.html | |
| Properties | |
| ********** | |
| Currently, the under-the-hood properties that govern pure or mixed precision training are the following: | |
| - ``cast_model_type``: Casts your model's parameters and buffers to the desired type. | |
| - ``patch_torch_functions``: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32. | |
| - ``keep_batchnorm_fp32``: To enhance precision and enable cudnn batchnorm (which improves performance), it's often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16. | |
| - ``master_weights``: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients. | |
| - ``loss_scale``: If ``loss_scale`` is a float value, use this value as the static (fixed) loss scale. If ``loss_scale`` is the string ``"dynamic"``, adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically. | |
| Again, you often don't need to specify these properties by hand. Instead, select an ``opt_level``, | |
| which will set them up for you. After selecting an ``opt_level``, you can optionally pass property | |
| kwargs as manual overrides. | |
| If you attempt to override a property that does not make sense for the selected ``opt_level``, | |
| Amp will raise an error with an explanation. For example, selecting ``opt_level="O1"`` combined with | |
| the override ``master_weights=True`` does not make sense. ``O1`` inserts casts | |
| around Torch functions rather than model weights. Data, activations, and weights are recast | |
| out-of-place on the fly as they flow through patched functions. Therefore, the model weights themselves | |
| can (and should) remain FP32, and there is no need to maintain separate FP32 master weights. | |
| ``opt_level``\ s | |
| **************** | |
| Recognized ``opt_level``\ s are ``"O0"``, ``"O1"``, ``"O2"``, and ``"O3"``. | |
| ``O0`` and ``O3`` are not true mixed precision, but they are useful for establishing accuracy and | |
| speed baselines, respectively. | |
| ``O1`` and ``O2`` are different implementations of mixed precision. Try both, and see | |
| what gives the best speedup and accuracy for your model. | |
| ``O0``: FP32 training | |
| ^^^^^^^^^^^^^^^^^^^^^^ | |
| Your incoming model should be FP32 already, so this is likely a no-op. | |
| ``O0`` can be useful to establish an accuracy baseline. | |
| | Default properties set by ``O0``: | |
| | ``cast_model_type=torch.float32`` | |
| | ``patch_torch_functions=False`` | |
| | ``keep_batchnorm_fp32=None`` (effectively, "not applicable," everything is FP32) | |
| | ``master_weights=False`` | |
| | ``loss_scale=1.0`` | |
| | | |
| | | |
| ``O1``: Mixed Precision (recommended for typical use) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| Patch all Torch functions and Tensor methods to cast their inputs according to a whitelist-blacklist | |
| model. Whitelist ops (for example, Tensor Core-friendly ops like GEMMs and convolutions) are performed | |
| in FP16. Blacklist ops that benefit from FP32 precision (for example, softmax) | |
| are performed in FP32. ``O1`` also uses dynamic loss scaling, unless overridden. | |
| | Default properties set by ``O1``: | |
| | ``cast_model_type=None`` (not applicable) | |
| | ``patch_torch_functions=True`` | |
| | ``keep_batchnorm_fp32=None`` (again, not applicable, all model weights remain FP32) | |
| | ``master_weights=None`` (not applicable, model weights remain FP32) | |
| | ``loss_scale="dynamic"`` | |
| | | |
| | | |
| ``O2``: "Almost FP16" Mixed Precision | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| ``O2`` casts the model weights to FP16, | |
| patches the model's ``forward`` method to cast input | |
| data to FP16, keeps batchnorms in FP32, maintains FP32 master weights, | |
| updates the optimizer's ``param_groups`` so that the ``optimizer.step()`` | |
| acts directly on the FP32 weights (followed by FP32 master weight->FP16 model weight | |
| copies if necessary), | |
| and implements dynamic loss scaling (unless overridden). | |
| Unlike ``O1``, ``O2`` does not patch Torch functions or Tensor methods. | |
| | Default properties set by ``O2``: | |
| | ``cast_model_type=torch.float16`` | |
| | ``patch_torch_functions=False`` | |
| | ``keep_batchnorm_fp32=True`` | |
| | ``master_weights=True`` | |
| | ``loss_scale="dynamic"`` | |
| | | |
| | | |
| ``O3``: FP16 training | |
| ^^^^^^^^^^^^^^^^^^^^^^ | |
| ``O3`` may not achieve the stability of the true mixed precision options ``O1`` and ``O2``. | |
| However, it can be useful to establish a speed baseline for your model, against which | |
| the performance of ``O1`` and ``O2`` can be compared. If your model uses batch normalization, | |
| to establish "speed of light" you can try ``O3`` with the additional property override | |
| ``keep_batchnorm_fp32=True`` (which enables cudnn batchnorm, as stated earlier). | |
| | Default properties set by ``O3``: | |
| | ``cast_model_type=torch.float16`` | |
| | ``patch_torch_functions=False`` | |
| | ``keep_batchnorm_fp32=False`` | |
| | ``master_weights=False`` | |
| | ``loss_scale=1.0`` | |
| | | |
| | | |
| Unified API | |
| ----------- | |
| .. automodule:: apex.amp | |
| .. currentmodule:: apex.amp | |
| .. autofunction:: initialize | |
| .. autofunction:: scale_loss | |
| .. autofunction:: master_params | |
| Checkpointing | |
| ------------- | |
| To properly save and load your amp training, we introduce the ``amp.state_dict()``, which contains all ``loss_scaler``\ s and their corresponding unskipped steps, as well as ``amp.load_state_dict()`` to restore these attributes. | |
| In order to get bitwise accuracy, we recommend the following workflow:: | |
| # Initialization | |
| opt_level = 'O1' | |
| model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level) | |
| # Train your model | |
| ... | |
| # Save checkpoint | |
| checkpoint = { | |
| 'model': model.state_dict(), | |
| 'optimizer': optimizer.state_dict(), | |
| 'amp': amp.state_dict() | |
| } | |
| torch.save(checkpoint, 'amp_checkpoint.pt') | |
| ... | |
| # Restore | |
| model = ... | |
| optimizer = ... | |
| checkpoint = torch.load('amp_checkpoint.pt') | |
| model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level) | |
| model.load_state_dict(checkpoint['model']) | |
| optimizer.load_state_dict(checkpoint['optimizer']) | |
| amp.load_state_dict(checkpoint['amp']) | |
| # Continue training | |
| ... | |
| Note that we recommend restoring the model using the same ``opt_level``. Also note that we recommend calling the ``load_state_dict`` methods after ``amp.initialize``. | |
| Advanced use cases | |
| ------------------ | |
| The unified Amp API supports gradient accumulation across iterations, | |
| multiple backward passes per iteration, multiple models/optimizers, | |
| custom/user-defined autograd functions, and custom data batch classes. Gradient clipping and GANs also | |
| require special treatment, but this treatment does not need to change | |
| for different ``opt_level``\ s. Further details can be found here: | |
| .. toctree:: | |
| :maxdepth: 1 | |
| advanced | |
| Transition guide for old API users | |
| ---------------------------------- | |
| We strongly encourage moving to the new Amp API, because it's more versatile, easier to use, and future proof. The original :class:`FP16_Optimizer` and the old "Amp" API are deprecated, and subject to removal at at any time. | |
| For users of the old "Amp" API | |
| ****************************** | |
| In the new API, ``opt-level O1`` performs the same patching of the Torch namespace as the old thing | |
| called "Amp." | |
| However, the new API allows static or dynamic loss scaling, while the old API only allowed dynamic loss scaling. | |
| In the new API, the old call to ``amp_handle = amp.init()``, and the returned ``amp_handle``, are no | |
| longer exposed or necessary. The new ``amp.initialize()`` does the duty of ``amp.init()`` (and more). | |
| Therefore, any existing calls to ``amp_handle = amp.init()`` should be deleted. | |
| The functions formerly exposed through ``amp_handle`` are now free | |
| functions accessible through the ``amp`` module. | |
| The backward context manager must be changed accordingly:: | |
| # old API | |
| with amp_handle.scale_loss(loss, optimizer) as scaled_loss: | |
| scaled_loss.backward() | |
| -> | |
| # new API | |
| with amp.scale_loss(loss, optimizer) as scaled_loss: | |
| scaled_loss.backward() | |
| For now, the deprecated "Amp" API documentation can still be found on the Github README: https://github.com/NVIDIA/apex/tree/master/apex/amp. The old API calls that `annotate user functions`_ to run | |
| with a particular precision are still honored by the new API. | |
| .. _`annotate user functions`: | |
| https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions | |
| For users of the old FP16_Optimizer | |
| *********************************** | |
| ``opt-level O2`` is equivalent to :class:`FP16_Optimizer` with ``dynamic_loss_scale=True``. | |
| Once again, the backward pass must be changed to the unified version:: | |
| optimizer.backward(loss) | |
| -> | |
| with amp.scale_loss(loss, optimizer) as scaled_loss: | |
| scaled_loss.backward() | |
| One annoying aspect of FP16_Optimizer was that the user had to manually convert their model to half | |
| (either by calling ``.half()`` on it, or using a function or module wrapper from | |
| ``apex.fp16_utils``), and also manually call ``.half()`` on input data. **Neither of these are | |
| necessary in the new API. No matter what --opt-level | |
| you choose, you can and should simply build your model and pass input data in the default FP32 format.** | |
| The new Amp API will perform the right conversions during | |
| ``model, optimizer = amp.initialize(model, optimizer, opt_level=....)`` based on the ``--opt-level`` | |
| and any overridden flags. Floating point input data may be FP32 or FP16, but you may as well just | |
| let it be FP16, because the ``model`` returned by ``amp.initialize`` will have its ``forward`` | |
| method patched to cast the input data appropriately. | |