| The method is wrapped in a nn.Module (e.g., Linear8bitLt, Linear4bit), and the quantized linear layer should have the following definition: | |
| class Linear4bit(nn.Module): | |
| def init(self, ): | |
| def forward(self, x): | |
| return my_4bit_kernel(x, self.weight, self.bias) | |
| This way, Transformers models can be easily quantized by replacing some instances of nn.Linear with a target class. |