Spaces:
Sleeping
Sleeping
Dit-document-layout-analysis
/
unilm
/decoding
/IAD
/fairseq
/examples
/multilingual
/data_scripts
/README.md
A newer version of the Gradio SDK is available:
5.43.1
Install dependency
pip install -r requirement.txt
Download the data set
export WORKDIR_ROOT=<a directory which will hold all working files>
The downloaded data will be at $WORKDIR_ROOT/ML50
preprocess the data
Install SPM here
export WORKDIR_ROOT=<a directory which will hold all working files>
export SPM_PATH=<a path pointing to sentencepice spm_encode.py>
- $WORKDIR_ROOT/ML50/raw: extracted raw data
- $WORKDIR_ROOT/ML50/dedup: dedup data
- $WORKDIR_ROOT/ML50/clean: data with valid and test sentences removed from the dedup data