Spaces:
Runtime error
Runtime error
<!--Copyright 2022 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the | |
License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. --> | |
# Donut | |
## Overview | |
The Donut model was proposed in [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by | |
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. | |
Donut consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform document understanding | |
tasks such as document image classification, form understanding and visual question answering. | |
The abstract from the paper is the following: | |
*Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains.* | |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg" | |
alt="drawing" width="600"/> | |
<small> Donut high-level overview. Taken from the <a href="https://arxiv.org/abs/2111.15664">original paper</a>. </small> | |
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found | |
[here](https://github.com/clovaai/donut). | |
Tips: | |
- The quickest way to get started with Donut is by checking the [tutorial | |
notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut), which show how to use the model | |
at inference time as well as fine-tuning on custom data. | |
- Donut is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework. | |
## Inference | |
Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of | |
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image. | |
The [`DonutFeatureExtractor`] class is responsible for preprocessing the input image and | |
[`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] decodes the generated target tokens to the target string. The | |
[`DonutProcessor`] wraps [`DonutFeatureExtractor`] and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] | |
into a single instance to both extract the input features and decode the predicted token ids. | |
- Step-by-step Document Image Classification | |
```py | |
>>> import re | |
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel | |
>>> from datasets import load_dataset | |
>>> import torch | |
>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip") | |
>>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip") | |
>>> device = "cuda" if torch.cuda.is_available() else "cpu" | |
>>> model.to(device) # doctest: +IGNORE_RESULT | |
>>> # load document image | |
>>> dataset = load_dataset("hf-internal-testing/example-documents", split="test") | |
>>> image = dataset[1]["image"] | |
>>> # prepare decoder inputs | |
>>> task_prompt = "<s_rvlcdip>" | |
>>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids | |
>>> pixel_values = processor(image, return_tensors="pt").pixel_values | |
>>> outputs = model.generate( | |
... pixel_values.to(device), | |
... decoder_input_ids=decoder_input_ids.to(device), | |
... max_length=model.decoder.config.max_position_embeddings, | |
... early_stopping=True, | |
... pad_token_id=processor.tokenizer.pad_token_id, | |
... eos_token_id=processor.tokenizer.eos_token_id, | |
... use_cache=True, | |
... num_beams=1, | |
... bad_words_ids=[[processor.tokenizer.unk_token_id]], | |
... return_dict_in_generate=True, | |
... ) | |
>>> sequence = processor.batch_decode(outputs.sequences)[0] | |
>>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") | |
>>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token | |
>>> print(processor.token2json(sequence)) | |
{'class': 'advertisement'} | |
``` | |
- Step-by-step Document Parsing | |
```py | |
>>> import re | |
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel | |
>>> from datasets import load_dataset | |
>>> import torch | |
>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") | |
>>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") | |
>>> device = "cuda" if torch.cuda.is_available() else "cpu" | |
>>> model.to(device) # doctest: +IGNORE_RESULT | |
>>> # load document image | |
>>> dataset = load_dataset("hf-internal-testing/example-documents", split="test") | |
>>> image = dataset[2]["image"] | |
>>> # prepare decoder inputs | |
>>> task_prompt = "<s_cord-v2>" | |
>>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids | |
>>> pixel_values = processor(image, return_tensors="pt").pixel_values | |
>>> outputs = model.generate( | |
... pixel_values.to(device), | |
... decoder_input_ids=decoder_input_ids.to(device), | |
... max_length=model.decoder.config.max_position_embeddings, | |
... early_stopping=True, | |
... pad_token_id=processor.tokenizer.pad_token_id, | |
... eos_token_id=processor.tokenizer.eos_token_id, | |
... use_cache=True, | |
... num_beams=1, | |
... bad_words_ids=[[processor.tokenizer.unk_token_id]], | |
... return_dict_in_generate=True, | |
... ) | |
>>> sequence = processor.batch_decode(outputs.sequences)[0] | |
>>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") | |
>>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token | |
>>> print(processor.token2json(sequence)) | |
{'menu': {'nm': 'CINNAMON SUGAR', 'unitprice': '17,000', 'cnt': '1 x', 'price': '17,000'}, 'sub_total': {'subtotal_price': '17,000'}, 'total': {'total_price': '17,000', 'cashprice': '20,000', 'changeprice': '3,000'}} | |
``` | |
- Step-by-step Document Visual Question Answering (DocVQA) | |
```py | |
>>> import re | |
>>> from transformers import DonutProcessor, VisionEncoderDecoderModel | |
>>> from datasets import load_dataset | |
>>> import torch | |
>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa") | |
>>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa") | |
>>> device = "cuda" if torch.cuda.is_available() else "cpu" | |
>>> model.to(device) # doctest: +IGNORE_RESULT | |
>>> # load document image from the DocVQA dataset | |
>>> dataset = load_dataset("hf-internal-testing/example-documents", split="test") | |
>>> image = dataset[0]["image"] | |
>>> # prepare decoder inputs | |
>>> task_prompt = "<s_docvqa><s_question>{user_input}</s_question><s_answer>" | |
>>> question = "When is the coffee break?" | |
>>> prompt = task_prompt.replace("{user_input}", question) | |
>>> decoder_input_ids = processor.tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids | |
>>> pixel_values = processor(image, return_tensors="pt").pixel_values | |
>>> outputs = model.generate( | |
... pixel_values.to(device), | |
... decoder_input_ids=decoder_input_ids.to(device), | |
... max_length=model.decoder.config.max_position_embeddings, | |
... early_stopping=True, | |
... pad_token_id=processor.tokenizer.pad_token_id, | |
... eos_token_id=processor.tokenizer.eos_token_id, | |
... use_cache=True, | |
... num_beams=1, | |
... bad_words_ids=[[processor.tokenizer.unk_token_id]], | |
... return_dict_in_generate=True, | |
... ) | |
>>> sequence = processor.batch_decode(outputs.sequences)[0] | |
>>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") | |
>>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token | |
>>> print(processor.token2json(sequence)) | |
{'question': 'When is the coffee break?', 'answer': '11-14 to 11:39 a.m.'} | |
``` | |
See the [model hub](https://huggingface.co/models?filter=donut) to look for Donut checkpoints. | |
## Training | |
We refer to the [tutorial notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut). | |
## DonutSwinConfig | |
[[autodoc]] DonutSwinConfig | |
## DonutImageProcessor | |
[[autodoc]] DonutImageProcessor | |
- preprocess | |
## DonutFeatureExtractor | |
[[autodoc]] DonutFeatureExtractor | |
- __call__ | |
## DonutProcessor | |
[[autodoc]] DonutProcessor | |
- __call__ | |
- from_pretrained | |
- save_pretrained | |
- batch_decode | |
- decode | |
## DonutSwinModel | |
[[autodoc]] DonutSwinModel | |
- forward | |