Spaces:
Sleeping
Sleeping
File size: 7,806 Bytes
6fc683c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
# LayoutReader
LayoutReader captures the text and layout information for reading order prediction using the seq2seq model. It significantly improves both open-source and commercial OCR engines in ordering text lines in their results in our experiments.
Our paper "[LayoutReader: Pre-training of Text and Layout for Reading Order Detection](https://arxiv.org/pdf/2108.11591.pdf)" has been accepted by EMNLP 2021.
**ReadingBank** is a benchmark dataset for reading order detection built with weak supervision from WORD documents, which contains 500K document images with a wide range of document types as well as the corresponding reading order information. For more details, please refer to [ReadingBank](https://aka.ms/readingbank).
## Installation
~~~
conda create -n LayoutReader python=3.7
conda activate LayoutReader
conda install pytorch==1.7.1 -c pytorch
pip install nltk
python -c "import nltk; nltk.download('punkt')"
git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext
pip install transformers==2.10.0
git clone https://github.com/microsoft/unilm.git
cd unilm/layoutreader
pip install -e .
~~~
## Run
1. Download the [pre-processed data](https://layoutlm.blob.core.windows.net/readingbank/dataset/ReadingBank.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D
). For more details of the dataset, please refer to [ReadingBank](https://aka.ms/readingbank).
2. (Optional) Download our [pre-trained model](https://layoutlm.blob.core.windows.net/readingbank/model/layoutreader-base-readingbank.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D
) and evaluate it refer to step 4.
3. Training
~~~
export CUDA_VISIBLE_DEVICE=0,1,2,3
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
--model_type layoutlm \
--model_name_or_path layoutlm-base-uncased \
--train_folder /path/to/ReadingBank/train \
--output_dir /path/to/output/LayoutReader/layoutlm \
--do_lower_case \
--fp16 \
--fp16_opt_level O2 \
--max_source_seq_length 513 \
--max_target_seq_length 511 \
--per_gpu_train_batch_size 2 \
--gradient_accumulation_steps 1 \
--learning_rate 7e-5 \
--num_warmup_steps 500 \
--num_training_steps 75000 \
--cache_dir /path/to/output/LayoutReader/cache \
--label_smoothing 0.1 \
--save_steps 5000 \
--cached_train_features_file /path/to/ReadingBank/features_train.pt
~~~
4. Decoding
~~~
export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
python decode_seq2seq.py --fp16 \
--model_type layoutlm \
--tokenizer_name bert-base-uncased \
--input_folder /path/to/ReadingBank/test \
--cached_feature_file /path/to/ReadingBank/features_test.pt \
--output_file /path/to/output/LayoutReader/layoutlm/output.txt \
--split test \
--do_lower_case \
--model_path /path/to/output/LayoutReader/layoutlm/ckpt-75000 \
--cache_dir /path/to/output/LayoutReader/cache \
--max_seq_length 1024 \
--max_tgt_length 511 \
--batch_size 32 \
--beam_size 1 \
--length_penalty 0 \
--forbid_duplicate_ngrams \
--mode s2s \
--forbid_ignore_word "."
~~~
## Results
Our released [pre-trained model](https://layoutlm.blob.core.windows.net/readingbank/dataset/layoutreader-base-readingbank.zip) achieves 98.2% Average Page-level BLEU score. Detailed results are reported as follow:
* Evaluation results of the LayoutReader on the reading order detection task, where the source-side of training/testing data is in the left-to-right and top-to-bottom order
| Method | Encoder | Avg. Page-level BLEU β | ARD β |
| -------------------------- | ---------------------- | ---------------------- | ----- |
| Heuristic Method | - | 0.6972 | 8.46 |
| LayoutReader (text only) | BERT | 0.8510 | 12.08 |
| LayoutReader (text only) | UniLM | 0.8765 | 10.65 |
| LayoutReader (layout only) | LayoutLM (layout only) | 0.9732 | 2.31 |
| LayoutReader | LayoutLM | 0.9819 | 1.75 |
* Input order study with left-to-right and top-to-bottom inputs in evaluation, where r is the proportion of
shuffled samples in training.
| Method | Avg. Page-level BLEU β | Avg. Page-level BLEU β | Avg. Page-level BLEU β | ARD β | ARD β | ARD β |
|---------------------------------|------------------------|------------------------|------------------------|--------|-------|-------|
| | r=100% | r=50% | r=0% | r=100% | r=50% | r=0% |
| LayoutReader (text only, BERT) | 0.3355 | 0.8397 | 0.8510 | 77.97 | 15.62 | 12.08 |
| LayoutReader (text only, UniLM) | 0.3440 | 0.8588 | 0.8765 | 78.67 | 13.65 | 10.65 |
| LayoutReader (layout only) | 0.9701 | 0.9729 | 0.9732 | 2.85 | 2.61 | 2.31 |
| LayoutReader | 0.9765 | 0.9788 | 0.9819 | 2.50 | 2.24 | 1.75 |
* Input order study with token-shuffled inputs in evaluation, where r is the proportion of shuffled samples in training.
| Method | Avg. Page-level BLEU β | Avg. Page-level BLEU β | Avg. Page-level BLEU β | ARD β | ARD β | ARD β |
|---------------------------------|------------------------|------------------------|------------------------|--------|-------|--------|
| | r=100% | r=50% | r=0% | r=100% | r=50% | r=0% |
| LayoutReader (text only, BERT) | 0.3085 | 0.2730 | 0.1711 | 78.69 | 85.44 | 67.96 |
| LayoutReader (text only, UniLM) | 0.3119 | 0.2855 | 0.1728 | 80.00 | 85.60 | 71.13 |
| LayoutReader (layout only) | 0.9718 | 0.9714 | 0.1331 | 2.72 | 2.82 | 105.40 |
| LayoutReader | 0.9772 | 0.9770 | 0.1783 | 2.48 | 2.46 | 72.94 |
## Citation
If you find LayoutReader helpful, please cite us:
```
@misc{wang2021layoutreader,
title={LayoutReader: Pre-training of Text and Layout for Reading Order Detection},
author={Zilong Wang and Yiheng Xu and Lei Cui and Jingbo Shang and Furu Wei},
year={2021},
eprint={2108.11591},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) and [s2s-ft](../s2s-ft) projects.
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
## Contact
For help or issues using LayoutReader, please submit a GitHub issue.
For other communications related to LayoutLM, please contact Lei Cui (`[email protected]`), Furu Wei (`[email protected]`).
|