# LM-Combiner All the code and model are released [link](https://github.com/wyxstriker/LM-Combiner). Thank you for your patience! # Model Weight - cbart_large.zip - Weight of Bart baseline model. - lm_combiner.zip - Weight of LM-Combiner for Bart baseline on FCGEC dataset. # Requirements The part of the model is implemented using the huggingface framework and the required environment is as follows: - Python - torch - transformers - datasets - tqdm For the evaluation, we refer to the relevant environment configurations of [ChERRANT](https://github.com/HillZhang1999/MuCGEC/tree/main/scorers/ChERRANT). # Training Stage ## Preprocessing ### Baseline Model - Firstly, we train a baseline model (Chinese-Bart-large) for LM-Combiner on the FCGEC dataset using the Seq2Seq format. ```bash sh ./script/run_bart_baseline.sh ``` ### Candidate Datasets 1. Candidate Sentence Generation - We use the baseline model to generate candidate sentences for the training and test sets - On tasks where the model fits better (spelling correction, etc.), we recommend using the K-fold cross-inference from the paper to generate candidate sentences separately. ```bash python ./src/predict_bl_tsv.py ``` 2. Golden Labels Merging - We use the ChERRANT tool to fully decouple the error correction task and the rewriting task by merging the correct labels. ```bash python ./scorer_wapper/golden_label_merging.py ``` ## LM-combiner (gpt2) - Subsequently, we train LM-Combiner on the constructed candidate dataset - In particular, we supplement the gpt2 vocab (mainly **double quotes**) to better fit the FCGEC dataset, see ```./pt_model/gpt2-base/vocab.txt``` for details. ```bash sh ./script/run_lm_combiner.py ``` # Evaluation - We use the official ChERRANT script to evaluate the model on the FCGEC-dev. ```shell sh ./script/compute_score.sh ``` |method|Prec|Rec|F0.5| |-|-|-|-| | bart_baseline|28.88|**38.95**|40.46| |+lm_combiner|**52.15**|37.41|**48.34**| # Citation If you find this work is useful for your research, please cite our paper: ``` @inproceedings{wang-etal-2024-lm-combiner, title = "{LM}-Combiner: A Contextual Rewriting Model for {C}hinese Grammatical Error Correction", author = "Wang, Yixuan and Wang, Baoxin and Liu, Yijun and Wu, Dayong and Che, Wanxiang", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.934", pages = "10675--10685", } ```