cecilemacaire commited on
Commit
bda666c
·
verified ·
1 Parent(s): 915bfcb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -3
README.md CHANGED
@@ -1,3 +1,173 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - fr
5
+ library_name: transformers
6
+ tags:
7
+ - t5
8
+ - commonvoice
9
+ - pytorch
10
+ - pictograms
11
+ - translation
12
+ metrics:
13
+ - bleu
14
+ widget:
15
+ - text: "je mange une pomme"
16
+ example_title: "A simple sentence"
17
+ - text: "je ne pense pas à toi"
18
+ example_title: "Sentence with a negation"
19
+ - text: "il y a 2 jours, les gendarmes ont vérifié ma licence"
20
+ example_title: "Sentence with a polylexical term"
21
+ ---
22
+
23
+ # t2p-t5-large-commonvoice
24
+
25
+ *t2p-t5-large-commonvoice* is a text-to-pictograms translation model built by fine-tuning the [t5-large](https://huggingface.co/google-t5/t5-large) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
26
+ The model is used only for **inference**.
27
+
28
+ ## Training details
29
+
30
+ ### Datasets
31
+
32
+ The [Propicto-commonvoice dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CommmonVoice v.15.0 corpus.
33
+ This dataset was built with the method presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
34
+ | **Split** | **Number of utterances** |
35
+ |:-----------:|:-----------------------:|
36
+ | train | 527,390 |
37
+ | valid | 16,124 |
38
+ | test | 16,120 |
39
+
40
+ ### Parameters
41
+
42
+ A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline :
43
+
44
+ ```python
45
+ training_args = Seq2SeqTrainingArguments(
46
+ output_dir="checkpoints_commonvoice/",
47
+ evaluation_strategy="epoch",
48
+ save_strategy="epoch",
49
+ learning_rate=2e-5,
50
+ per_device_train_batch_size=32,
51
+ per_device_eval_batch_size=32,
52
+ weight_decay=0.01,
53
+ save_total_limit=3,
54
+ num_train_epochs=40,
55
+ predict_with_generate=True,
56
+ fp16=True,
57
+ load_best_model_at_end=True
58
+ )
59
+ ```
60
+
61
+ ### Evaluation
62
+
63
+ The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu/blob/d94719691d29f7adf7151c8b1471de579a78a280/sacrebleu.py), where we compared the reference pictogram translation with the model hypothesis.
64
+
65
+ ### Results
66
+
67
+ Comparison to other translation models :
68
+ | **Model** | **validation** | **test** |
69
+ |:-----------:|:-----------------------:|:-----------------------:|
70
+ | **t2p-t5-large-commonvoice** | 86.3 | 86.5 |
71
+ | t2p-nmt-commonvoice | 86.0 | 82.6 |
72
+ | t2p-mbart-large-cc25-commonvoice | 72.3 | 72.3 |
73
+ | t2p-nllb-200-distilled-600M-commonvoice | **87.4** | **87.6** |
74
+
75
+ ### Environmental Impact
76
+
77
+ Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took around 30 hours in total.
78
+
79
+ ## Using t2p-t5-large-orféo model with HuggingFace transformers
80
+
81
+ ```python
82
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
83
+
84
+ source_lang = "fr"
85
+ target_lang = "frp"
86
+ max_input_length = 128
87
+ max_target_length = 128
88
+
89
+ tokenizer = AutoTokenizer.from_pretrained("Propicto/t2p-t5-large-commonvoice")
90
+ model = AutoModelForSeq2SeqLM.from_pretrained("Propicto/t2p-t5-large-commonvoice")
91
+
92
+ inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids
93
+ outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
94
+ pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
95
+ ```
96
+
97
+ ## Linking the predicted sequence of tokens to the corresponding ARASAAC pictograms
98
+
99
+ ```python
100
+ import pandas as pd
101
+
102
+ def process_output_trad(pred):
103
+ return pred.split()
104
+
105
+ def read_lexicon(lexicon):
106
+ df = pd.read_csv(lexicon, sep='\t')
107
+ df['keyword_no_cat'] = df['lemma'].str.split(' #').str[0].str.strip().str.replace(' ', '_')
108
+ return df
109
+
110
+ def get_id_picto_from_predicted_lemma(df_lexicon, lemma):
111
+ id_picto = df_lexicon.loc[df_lexicon['keyword_no_cat'] == lemma, 'id_picto'].tolist()
112
+ return (id_picto[0], lemma) if id_picto else (0, lemma)
113
+
114
+ lexicon = read_lexicon("lexicon.csv")
115
+ sentence_to_map = process_output_trad(pred)
116
+ pictogram_ids = [get_id_picto_from_predicted_lemma(lexicon, lemma) for lemma in sentence_to_map]
117
+ ```
118
+
119
+ ## Viewing the predicted sequence of ARASAAC pictograms in a HTML file
120
+
121
+ ```python
122
+ def generate_html(ids):
123
+ html_content = '<html><body>'
124
+ for picto_id, lemma in ids:
125
+ if picto_id != 0: # ignore invalid IDs
126
+ img_url = f"https://static.arasaac.org/pictograms/{picto_id}/{picto_id}_500.png"
127
+ html_content += f'''
128
+ <figure style="display:inline-block; margin:1px;">
129
+ <img src="{img_url}" alt="{lemma}" width="200" height="200" />
130
+ <figcaption>{lemma}</figcaption>
131
+ </figure>
132
+ '''
133
+ html_content += '</body></html>'
134
+ return html_content
135
+
136
+ html = generate_html(pictogram_ids)
137
+ with open("pictograms.html", "w") as file:
138
+ file.write(html)
139
+ ```
140
+
141
+ ## Information
142
+
143
+ - **Language(s):** French
144
+ - **License:** Apache-2.0
145
+ - **Developed by:** Cécile Macaire
146
+ - **Funded by**
147
+ - GENCI-IDRIS (Grant 2023-AD011013625R1)
148
+ - PROPICTO ANR-20-CE93-0005
149
+ - **Authors**
150
+ - Cécile Macaire
151
+ - Chloé Dion
152
+ - Emmanuelle Esperança-Rodier
153
+ - Benjamin Lecouteux
154
+ - Didier Schwab
155
+
156
+
157
+ ## Citation
158
+
159
+ If you use this model for your own research work, please cite as follows:
160
+
161
+ ```bibtex
162
+ @inproceedings{macaire_jeptaln2024,
163
+ title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
164
+ author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
165
+ url = {https://inria.hal.science/hal-04623007},
166
+ booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
167
+ address = {Toulouse, France},
168
+ publisher = {{ATALA \& AFPC}},
169
+ volume = {1 : articles longs et prises de position},
170
+ pages = {22-35},
171
+ year = {2024}
172
+ }
173
+ ```