Spaces:
Runtime error
Runtime error
<!--Copyright 2022 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# LayoutLMv3 | |
## Overview | |
The LayoutLMv3 model was proposed in [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. | |
LayoutLMv3 simplifies [LayoutLMv2](layoutlmv2) by using patch embeddings (as in [ViT](vit)) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM) | |
and word-patch alignment (WPA). | |
The abstract from the paper is the following: | |
*Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.* | |
Tips: | |
- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that: | |
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format. | |
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece. | |
Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3FeatureExtractor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model. | |
- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor. | |
- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3). | |
- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3). | |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/layoutlmv3_architecture.png" | |
alt="drawing" width="600"/> | |
<small> LayoutLMv3 architecture. Taken from the <a href="https://arxiv.org/abs/2204.08387">original paper</a>. </small> | |
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [chriskoo](https://huggingface.co/chriskoo), [tokec](https://huggingface.co/tokec), and [lre](https://huggingface.co/lre). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/layoutlmv3). | |
## Resources | |
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. | |
<Tip> | |
LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 resources you can adapt for LayoutLMv3 tasks. For these notebooks, take care to use [`LayoutLMv2Processor`] instead when preparing data for the model! | |
</Tip> | |
<PipelineTag pipeline="text-classification"/> | |
- [`LayoutLMv2ForSequenceClassification`] is supported by this [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb). | |
- [Text classification task guide](../tasks/sequence_classification) | |
<PipelineTag pipeline="token-classification"/> | |
- [`LayoutLMv3ForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3) and [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb). | |
- A [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Inference_with_LayoutLMv2ForTokenClassification.ipynb) for how to perform inference with [`LayoutLMv2ForTokenClassification`] and a [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/True_inference_with_LayoutLMv2ForTokenClassification_%2B_Gradio_demo.ipynb) for how to perform inference when no labels are available with [`LayoutLMv2ForTokenClassification`]. | |
- A [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Fine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD_using_HuggingFace_Trainer.ipynb) for how to finetune [`LayoutLMv2ForTokenClassification`] with the 🤗 Trainer. | |
- [Token classification task guide](../tasks/token_classification) | |
<PipelineTag pipeline="question-answering"/> | |
- [`LayoutLMv2ForQuestionAnswering`] is supported by this [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/DocVQA/Fine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb). | |
- [Question answering task guide](../tasks/question_answering) | |
**Document question answering** | |
- [Document question answering task guide](../tasks/document_question_answering) | |
## LayoutLMv3Config | |
[[autodoc]] LayoutLMv3Config | |
## LayoutLMv3FeatureExtractor | |
[[autodoc]] LayoutLMv3FeatureExtractor | |
- __call__ | |
## LayoutLMv3ImageProcessor | |
[[autodoc]] LayoutLMv3ImageProcessor | |
- preprocess | |
## LayoutLMv3Tokenizer | |
[[autodoc]] LayoutLMv3Tokenizer | |
- __call__ | |
- save_vocabulary | |
## LayoutLMv3TokenizerFast | |
[[autodoc]] LayoutLMv3TokenizerFast | |
- __call__ | |
## LayoutLMv3Processor | |
[[autodoc]] LayoutLMv3Processor | |
- __call__ | |
## LayoutLMv3Model | |
[[autodoc]] LayoutLMv3Model | |
- forward | |
## LayoutLMv3ForSequenceClassification | |
[[autodoc]] LayoutLMv3ForSequenceClassification | |
- forward | |
## LayoutLMv3ForTokenClassification | |
[[autodoc]] LayoutLMv3ForTokenClassification | |
- forward | |
## LayoutLMv3ForQuestionAnswering | |
[[autodoc]] LayoutLMv3ForQuestionAnswering | |
- forward | |
## TFLayoutLMv3Model | |
[[autodoc]] TFLayoutLMv3Model | |
- call | |
## TFLayoutLMv3ForSequenceClassification | |
[[autodoc]] TFLayoutLMv3ForSequenceClassification | |
- call | |
## TFLayoutLMv3ForTokenClassification | |
[[autodoc]] TFLayoutLMv3ForTokenClassification | |
- call | |
## TFLayoutLMv3ForQuestionAnswering | |
[[autodoc]] TFLayoutLMv3ForQuestionAnswering | |
- call | |