Commit
·
efd7721
1
Parent(s):
326dbc8
Update README.md
Browse files
README.md
CHANGED
@@ -121,6 +121,29 @@ The model architecture and config are the same as [M2M-100](https://huggingface.
|
|
121 |
|
122 |
**Note**: SMALL100Tokenizer requires sentencepiece, so make sure to install it by ```pip install sentencepiece```
|
123 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
124 |
```
|
125 |
from transformers import M2M100ForConditionalGeneration
|
126 |
from tokenization_small100 import SMALL100Tokenizer
|
@@ -146,7 +169,9 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
|
146 |
# => "Life is like a box of chocolate."
|
147 |
```
|
148 |
|
149 |
-
|
|
|
|
|
150 |
|
151 |
# Languages Covered
|
152 |
|
@@ -156,10 +181,21 @@ Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bas
|
|
156 |
|
157 |
If you use this model for your research, please cite the following work:
|
158 |
```
|
159 |
-
@
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
164 |
}
|
165 |
```
|
|
|
121 |
|
122 |
**Note**: SMALL100Tokenizer requires sentencepiece, so make sure to install it by ```pip install sentencepiece```
|
123 |
|
124 |
+
# Supervised Training
|
125 |
+
|
126 |
+
SMaLL-100 is a seq-to-seq model for the translation task. The input to the model is ```source:[tgt_lang_code] + src_tokens + [EOS]``` and ```target: tgt_tokens + [EOS]```. An example of supervised training is shown below:
|
127 |
+
|
128 |
+
```
|
129 |
+
from transformers import M2M100ForConditionalGeneration
|
130 |
+
from tokenization_small100 import SMALL100Tokenizer
|
131 |
+
|
132 |
+
model = M2M100ForConditionalGeneration.from_pretrained("alirezamsh/small100")
|
133 |
+
tokenizer = M2M100Tokenizer.from_pretrained("alirezamsh/small100", tgt_lang="fr")
|
134 |
+
|
135 |
+
src_text = "Life is like a box of chocolates."
|
136 |
+
tgt_text = "La vie est comme une boîte de chocolat."
|
137 |
+
|
138 |
+
model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt")
|
139 |
+
|
140 |
+
loss = model(**model_inputs).loss # forward pass
|
141 |
+
```
|
142 |
+
|
143 |
+
Training data can be provided upon request.
|
144 |
+
|
145 |
+
# Generation
|
146 |
+
|
147 |
```
|
148 |
from transformers import M2M100ForConditionalGeneration
|
149 |
from tokenization_small100 import SMALL100Tokenizer
|
|
|
169 |
# => "Life is like a box of chocolate."
|
170 |
```
|
171 |
|
172 |
+
# Evaluation
|
173 |
+
|
174 |
+
Please refer to [original repository](https://github.com/alirezamshi/small100) for spBLEU computation.
|
175 |
|
176 |
# Languages Covered
|
177 |
|
|
|
181 |
|
182 |
If you use this model for your research, please cite the following work:
|
183 |
```
|
184 |
+
@misc{mohammadshahi2022small100,
|
185 |
+
title={SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages},
|
186 |
+
author={Alireza Mohammadshahi and Vassilina Nikoulina and Alexandre Berard and Caroline Brun and James Henderson and Laurent Besacier},
|
187 |
+
year={2022},
|
188 |
+
eprint={2210.11621},
|
189 |
+
archivePrefix={arXiv},
|
190 |
+
primaryClass={cs.CL}
|
191 |
+
}
|
192 |
+
|
193 |
+
@misc{mohammadshahi2022compressed,
|
194 |
+
title={What Do Compressed Multilingual Machine Translation Models Forget?},
|
195 |
+
author={Alireza Mohammadshahi and Vassilina Nikoulina and Alexandre Berard and Caroline Brun and James Henderson and Laurent Besacier},
|
196 |
+
year={2022},
|
197 |
+
eprint={2205.10828},
|
198 |
+
archivePrefix={arXiv},
|
199 |
+
primaryClass={cs.CL}
|
200 |
}
|
201 |
```
|