Optimum

You are viewing v1.5.0 version. A newer version v1.27.0 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Quantization

Quantization is a technique to reduce the computational and memory cost of running inference by representing the weights and activations with low-precision data types like 8-bit integer (INT8) instead of the usual 32-bit floating point (FP32). Reducing the number of bits means the resulting model requires less memory storage, and operations like matrix multiplication can be performed much faster with integer arithmetic. Remarkably, these performance gains can be realized with little to no loss in accuracy!

The basic idea behind quantization is that we can “discretize” the floating-point values in each tensor by mapping their range into a smaller one of fixed-point numbers, and linearly distributing all values in between.

←Notebooks Optimized models→