File size: 5,060 Bytes
3f48476 37828e3 3f48476 37828e3 3f48476 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
license: apache-2.0
language:
- fr
library_name: transformers
tags:
- mbart
- orfeo
- pytorch
- pictograms
- translation
metrics:
- sacrebleu
inference: false
---
# t2p-mbart-large-cc25-orfeo
*t2p-mbart-large-cc25-orfeo* is a text-to-pictograms translation model built by fine-tuning the [mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
The model is used only for **inference**.
## Training details
The model was trained with [Fairseq](https://github.com/facebookresearch/fairseq/blob/main/examples/mbart/README.md).
### Datasets
The [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CEFC-Orféo corpus.
This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
| **Split** | **Number of utterances** |
|:-----------:|:-----------------------:|
| train | 231,374 |
| valid | 28,796 |
| test | 29,009 |
### Parameters
This is the arguments in the training pipeline :
```bash
fairseq-train $DATA \
--encoder-normalize-before --decoder-normalize-before \
--arch mbart_large --layernorm-embedding \
--task translation_from_pretrained_bart \
--source-lang fr --target-lang frp \
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
--lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --total-num-update 40000 \
--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
--max-tokens 1024 --update-freq 2 \
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 5 \
--seed 222 --log-format simple --log-interval 2 \
--langs fr \
--ddp-backend legacy_ddp \
--max-epoch 40 \
--save-dir models/checkpoints/mt_mbart_fr_frp_orfeo \
--keep-best-checkpoints 5 \
--keep-last-epochs 5
```
### Evaluation
The model was evaluated with sacreBLEU, where we compared the reference pictogram translation with the model hypothesis.
```bash
fairseq-generate orfeo_data/data/ \
--path $model_dir/checkpoint_best.pt \
--task translation_from_pretrained_bart \
--gen-subset test \
-t frp -s fr \
--bpe 'sentencepiece' --sentencepiece-model mbart.cc25.v2/sentence.bpe.model \
--sacrebleu \
--batch-size 32 --langs $langs > out.txt
```
The output file prints the following information :
```txt
S-27886 ça sera tout madame<unk>
T-27886 prochain celle-là être tout monsieur
H-27886 -0.2824968993663788 ▁prochain ▁celle - là ▁être ▁tout ▁monsieur
D-27886 -0.2824968993663788 prochain celle-là être tout monsieur
P-27886 -0.5773 -0.1780 -0.2587 -0.2361 -0.2726 -0.3167 -0.1312 -0.3103 -0.2615
Generate test with beam=5: BLEU4 = 75.62, 85.7/78.9/73.9/69.3 (BP=0.986, ratio=0.986, syslen=407923, reflen=413636)
```
### Results
Comparison to other translation models :
| **Model** | **validation** | **test** |
|:-----------:|:-----------------------:|:-----------------------:|
| t2p-t5-large-orféo | 85.2 | 85.8 |
| t2p-nmt-orféo | **87.2** | **87.4** |
| **t2p-mbart-large-cc25-orfeo** | 75.2 | 75.6 |
| t2p-nllb-200-distilled-600M-orfeo | 86.3 | 86.9 |
### Environmental Impact
Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took 18 hours in total.
## Using t2p-mbart-large-cc25-orfeo model
The scripts to use the *t2p-mbart-large-cc25-orfeo* model are located in the [speech-to-pictograms GitHub repository](https://github.com/macairececile/speech-to-pictograms).
## Information
- **Language(s):** French
- **License:** Apache-2.0
- **Developed by:** Cécile Macaire
- **Funded by**
- GENCI-IDRIS (Grant 2023-AD011013625R1)
- PROPICTO ANR-20-CE93-0005
- **Authors**
- Cécile Macaire
- Chloé Dion
- Emmanuelle Esperança-Rodier
- Benjamin Lecouteux
- Didier Schwab
## Citation
If you use this model for your own research work, please cite as follows:
```bibtex
@inproceedings{macaire_jeptaln2024,
title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
url = {https://inria.hal.science/hal-04623007},
booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
address = {Toulouse, France},
publisher = {{ATALA \& AFPC}},
volume = {1 : articles longs et prises de position},
pages = {22-35},
year = {2024}
}
``` |