Optimum documentation

Quantization

You are viewing v1.5.0 version. A newer version v1.27.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Quantization

Quantization is a technique to reduce the computational and memory cost of running inference by representing the weights and activations with low-precision data types like 8-bit integer (INT8) instead of the usual 32-bit floating point (FP32). Reducing the number of bits means the resulting model requires less memory storage, and operations like matrix multiplication can be performed much faster with integer arithmetic. Remarkably, these performance gains can be realized with little to no loss in accuracy!

The basic idea behind quantization is that we can “discretize” the floating-point values in each tensor by mapping their range into a smaller one of fixed-point numbers, and linearly distributing all values in between.