Spaces:
Runtime error
Runtime error
<!--Copyright 2020 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# I-BERT | |
## Overview | |
The I-BERT model was proposed in [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by | |
Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer. It's a quantized version of RoBERTa running | |
inference up to four times faster. | |
The abstract from the paper is the following: | |
*Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language | |
Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for | |
efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this, | |
previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot | |
efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM | |
processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes | |
the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for | |
nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT | |
inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using | |
RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to | |
the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4 - 4.0x for | |
INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has | |
been open-sourced.* | |
This model was contributed by [kssteven](https://huggingface.co/kssteven). The original code can be found [here](https://github.com/kssteven418/I-BERT). | |
## Documentation resources | |
- [Text classification task guide](../tasks/sequence_classification) | |
- [Token classification task guide](../tasks/token_classification) | |
- [Question answering task guide](../tasks/question_answering) | |
- [Masked language modeling task guide](../tasks/masked_language_modeling) | |
- [Multiple choice task guide](../tasks/masked_language_modeling) | |
## IBertConfig | |
[[autodoc]] IBertConfig | |
## IBertModel | |
[[autodoc]] IBertModel | |
- forward | |
## IBertForMaskedLM | |
[[autodoc]] IBertForMaskedLM | |
- forward | |
## IBertForSequenceClassification | |
[[autodoc]] IBertForSequenceClassification | |
- forward | |
## IBertForMultipleChoice | |
[[autodoc]] IBertForMultipleChoice | |
- forward | |
## IBertForTokenClassification | |
[[autodoc]] IBertForTokenClassification | |
- forward | |
## IBertForQuestionAnswering | |
[[autodoc]] IBertForQuestionAnswering | |
- forward | |