File size: 7,806 Bytes
6fc683c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# LayoutReader

LayoutReader captures the text and layout information for reading order prediction using the seq2seq model. It significantly improves both open-source and commercial OCR engines in ordering text lines in their results in our experiments.


Our paper "[LayoutReader: Pre-training of Text and Layout for Reading Order Detection](https://arxiv.org/pdf/2108.11591.pdf)" has been accepted by EMNLP 2021.

**ReadingBank** is a benchmark dataset for reading order detection built with weak supervision from WORD documents, which contains 500K document images with a wide range of document types as well as the corresponding reading order information. For more details, please refer to [ReadingBank](https://aka.ms/readingbank).

## Installation
~~~
conda create -n LayoutReader python=3.7
conda activate LayoutReader
conda install pytorch==1.7.1 -c pytorch
pip install nltk
python -c "import nltk; nltk.download('punkt')"
git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext
pip install transformers==2.10.0
git clone https://github.com/microsoft/unilm.git
cd unilm/layoutreader
pip install -e .
~~~

## Run
1. Download the [pre-processed data](https://layoutlm.blob.core.windows.net/readingbank/dataset/ReadingBank.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D
). For more details of the dataset, please refer to [ReadingBank](https://aka.ms/readingbank).
2. (Optional) Download our [pre-trained model](https://layoutlm.blob.core.windows.net/readingbank/model/layoutreader-base-readingbank.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D
) and evaluate it refer to step 4.
3. Training
    ~~~
    export CUDA_VISIBLE_DEVICE=0,1,2,3
    export OMP_NUM_THREADS=4
    export MKL_NUM_THREADS=4
    
    python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
        --model_type layoutlm \
        --model_name_or_path layoutlm-base-uncased \
        --train_folder /path/to/ReadingBank/train \
        --output_dir /path/to/output/LayoutReader/layoutlm \
        --do_lower_case \
        --fp16 \
        --fp16_opt_level O2 \
        --max_source_seq_length 513 \
        --max_target_seq_length 511 \
        --per_gpu_train_batch_size 2 \
        --gradient_accumulation_steps 1 \
        --learning_rate 7e-5 \
        --num_warmup_steps 500 \
        --num_training_steps 75000 \
        --cache_dir /path/to/output/LayoutReader/cache \
        --label_smoothing 0.1 \
        --save_steps 5000 \
        --cached_train_features_file /path/to/ReadingBank/features_train.pt
    ~~~
4. Decoding
    ~~~
    export CUDA_VISIBLE_DEVICES=0
    export OMP_NUM_THREADS=4
    export MKL_NUM_THREADS=4
    
    python decode_seq2seq.py --fp16 \
        --model_type layoutlm \
        --tokenizer_name bert-base-uncased \
        --input_folder /path/to/ReadingBank/test \
        --cached_feature_file /path/to/ReadingBank/features_test.pt \
        --output_file /path/to/output/LayoutReader/layoutlm/output.txt \
        --split test \
        --do_lower_case \
        --model_path /path/to/output/LayoutReader/layoutlm/ckpt-75000 \
        --cache_dir /path/to/output/LayoutReader/cache \
        --max_seq_length 1024 \
        --max_tgt_length 511 \
        --batch_size 32 \
        --beam_size 1 \
        --length_penalty 0 \
        --forbid_duplicate_ngrams \
        --mode s2s \
        --forbid_ignore_word "."
    ~~~

## Results
Our released [pre-trained model](https://layoutlm.blob.core.windows.net/readingbank/dataset/layoutreader-base-readingbank.zip) achieves 98.2% Average Page-level BLEU score. Detailed results are reported as follow:

* Evaluation results of the LayoutReader on the reading order detection task, where the source-side of training/testing data is in the left-to-right and top-to-bottom order

  | Method                     | Encoder                | Avg. Page-level BLEU ↑ | ARD ↓ |
  | -------------------------- | ---------------------- | ---------------------- | ----- |
  | Heuristic Method           | -                      | 0.6972                 | 8.46  |
  | LayoutReader (text only)   | BERT                   | 0.8510                 | 12.08 |
  | LayoutReader (text only)   | UniLM                  | 0.8765                 | 10.65 |
  | LayoutReader (layout only) | LayoutLM (layout only) | 0.9732                 | 2.31  |
  | LayoutReader               | LayoutLM               | 0.9819                 | 1.75  |

* Input order study with left-to-right and top-to-bottom inputs in evaluation, where r is the proportion of
shuffled samples in training.

  | Method                          | Avg. Page-level BLEU ↑ | Avg. Page-level BLEU ↑ | Avg. Page-level BLEU ↑ | ARD ↓  | ARD ↓ | ARD ↓ |
  |---------------------------------|------------------------|------------------------|------------------------|--------|-------|-------|
  |                                 | r=100%                 | r=50%                  | r=0%                   | r=100% | r=50% | r=0%  |
  | LayoutReader (text only, BERT)  | 0.3355                 | 0.8397                 | 0.8510                 | 77.97  | 15.62 | 12.08 |
  | LayoutReader (text only, UniLM) | 0.3440                 | 0.8588                 | 0.8765                 | 78.67  | 13.65 | 10.65 |
  | LayoutReader (layout only)      | 0.9701                 | 0.9729                 | 0.9732                 | 2.85   | 2.61  | 2.31  |
  | LayoutReader                    | 0.9765                 | 0.9788                 | 0.9819                 | 2.50   | 2.24  | 1.75  |

* Input order study with token-shuffled inputs in evaluation, where r is the proportion of shuffled samples in training.

  | Method                          | Avg. Page-level BLEU ↑ | Avg. Page-level BLEU ↑ | Avg. Page-level BLEU ↑ | ARD ↓  | ARD ↓ | ARD ↓  |
  |---------------------------------|------------------------|------------------------|------------------------|--------|-------|--------|
  |                                 | r=100%                 | r=50%                  | r=0%                   | r=100% | r=50% | r=0%   |
  | LayoutReader (text only, BERT)  | 0.3085                 | 0.2730                 | 0.1711                 | 78.69  | 85.44 | 67.96  |
  | LayoutReader (text only, UniLM) | 0.3119                 | 0.2855                 | 0.1728                 | 80.00  | 85.60 | 71.13  |
  | LayoutReader (layout only)      | 0.9718                 | 0.9714                 | 0.1331                 | 2.72   | 2.82  | 105.40 |
  | LayoutReader                    | 0.9772                 | 0.9770                 | 0.1783                 | 2.48   | 2.46  | 72.94  |

## Citation

If you find LayoutReader helpful, please cite us:
```
@misc{wang2021layoutreader,
      title={LayoutReader: Pre-training of Text and Layout for Reading Order Detection}, 
      author={Zilong Wang and Yiheng Xu and Lei Cui and Jingbo Shang and Furu Wei},
      year={2021},
      eprint={2108.11591},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```


## License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) and [s2s-ft](../s2s-ft) projects.
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

## Contact

For help or issues using LayoutReader, please submit a GitHub issue.

For other communications related to LayoutLM, please contact Lei Cui (`[email protected]`), Furu Wei (`[email protected]`).