File size: 563 Bytes
57bdca5
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
Ideally, pre-compiled kernels are included in the pip package.
The method can run on commonly-used hardware (CPU, GPU, ).
The method is wrapped in a nn.Module (e.g., Linear8bitLt, Linear4bit), and the quantized linear layer should have the following definition:

class Linear4bit(nn.Module):
    def init(self, ):
        
def forward(self, x):
    return my_4bit_kernel(x, self.weight, self.bias)

This way, Transformers models can be easily quantized by replacing some instances of nn.Linear with a target class.

The quantization method should be serializable.