DecoderImmortal commited on
Commit
c0c6002
·
verified ·
1 Parent(s): a30b942

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -3
README.md CHANGED
@@ -1,3 +1,76 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LM-Combiner
2
+ All the code and model are released [link](https://github.com/wyxstriker/LM-Combiner). Thank you for your patience!
3
+
4
+ # Requirements
5
+
6
+ The part of the model is implemented using the huggingface framework and the required environment is as follows:
7
+ - Python
8
+ - torch
9
+ - transformers
10
+ - datasets
11
+ - tqdm
12
+
13
+ For the evaluation, we refer to the relevant environment configurations of [ChERRANT](https://github.com/HillZhang1999/MuCGEC/tree/main/scorers/ChERRANT).
14
+
15
+ # Training Stage
16
+ ## Preprocessing
17
+ ### Baseline Model
18
+ - Firstly, we train a baseline model (Chinese-Bart-large) for LM-Combiner on the FCGEC dataset using the Seq2Seq format.
19
+ ```bash
20
+ sh ./script/run_bart_baseline.sh
21
+ ```
22
+ ### Candidate Datasets
23
+ 1. Candidate Sentence Generation
24
+ - We use the baseline model to generate candidate sentences for the training and test sets
25
+ - On tasks where the model fits better (spelling correction, etc.), we recommend using the K-fold cross-inference from the paper to generate candidate sentences separately.
26
+ ```bash
27
+ python ./src/predict_bl_tsv.py
28
+ ```
29
+ 2. Golden Labels Merging
30
+ - We use the ChERRANT tool to fully decouple the error correction task and the rewriting task by merging the correct labels.
31
+ ```bash
32
+ python ./scorer_wapper/golden_label_merging.py
33
+ ```
34
+ ## LM-combiner (gpt2)
35
+ - Subsequently, we train LM-Combiner on the constructed candidate dataset
36
+ - In particular, we supplement the gpt2 vocab (mainly **double quotes**) to better fit the FCGEC dataset, see ```./pt_model/gpt2-base/vocab.txt``` for details.
37
+ ```bash
38
+ sh ./script/run_lm_combiner.py
39
+ ```
40
+
41
+ # Evaluation
42
+ - We use the official ChERRANT script to evaluate the model on the FCGEC-dev.
43
+ ```shell
44
+ sh ./script/compute_score.sh
45
+ ```
46
+ |method|Prec|Rec|F0.5|
47
+ |-|-|-|-|
48
+ | bart_baseline|28.88|**38.95**|40.46|
49
+ |+lm_combiner|**52.15**|37.41|**48.34**|
50
+ # Citation
51
+
52
+ If you find this work is useful for your research, please cite our paper:
53
+
54
+ ```
55
+ @inproceedings{wang-etal-2024-lm-combiner,
56
+ title = "{LM}-Combiner: A Contextual Rewriting Model for {C}hinese Grammatical Error Correction",
57
+ author = "Wang, Yixuan and
58
+ Wang, Baoxin and
59
+ Liu, Yijun and
60
+ Wu, Dayong and
61
+ Che, Wanxiang",
62
+ editor = "Calzolari, Nicoletta and
63
+ Kan, Min-Yen and
64
+ Hoste, Veronique and
65
+ Lenci, Alessandro and
66
+ Sakti, Sakriani and
67
+ Xue, Nianwen",
68
+ booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
69
+ month = may,
70
+ year = "2024",
71
+ address = "Torino, Italia",
72
+ publisher = "ELRA and ICCL",
73
+ url = "https://aclanthology.org/2024.lrec-main.934",
74
+ pages = "10675--10685",
75
+ }
76
+ ```