Update README.md
Browse files
README.md
CHANGED
@@ -1,27 +1,27 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
-
- en
|
5 |
- eu
|
|
|
6 |
metrics:
|
7 |
- BLEU
|
8 |
- TER
|
9 |
---
|
10 |
-
## Hitz Center’s English
|
11 |
|
12 |
## Model description
|
13 |
|
14 |
-
This model was trained from scratch using [Marian NMT](https://marian-nmt.github.io/) on a combination of English-Basque datasets totalling
|
15 |
|
16 |
- **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
|
17 |
- **Model type:** traslation
|
18 |
-
- **Source Language:**
|
19 |
-
- **Target Language:**
|
20 |
- **License:** apache-2.0
|
21 |
|
22 |
## Intended uses and limitations
|
23 |
|
24 |
-
You can use this model for machine translation from
|
25 |
|
26 |
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources.
|
27 |
|
@@ -34,9 +34,9 @@ from transformers import MarianMTModel, MarianTokenizer
|
|
34 |
from transformers import AutoTokenizer
|
35 |
from transformers import AutoModelForSeq2SeqLM
|
36 |
|
37 |
-
src_text = ["
|
38 |
|
39 |
-
model_name = "HiTZ/mt-hitz-en
|
40 |
tokenizer = MarianTokenizer.from_pretrained(model_name)
|
41 |
|
42 |
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
|
@@ -44,17 +44,17 @@ translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=T
|
|
44 |
rue))
|
45 |
print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])`
|
46 |
```
|
47 |
-
The recommended environments include the following transfomer versions: 4.12.3 , 4.15.0 , 4.26.1
|
48 |
|
49 |
## Training Details
|
50 |
|
51 |
### Training Data
|
52 |
|
53 |
-
The
|
|
|
54 |
|
55 |
| Dataset | Sentences before cleaning |
|
56 |
|-----------------|--------------------------:|
|
57 |
-
| CCMatrix v1
|
58 |
| EhuHac | 585,210 |
|
59 |
| Ehuskaratuak | 482,259 |
|
60 |
| Ehuskaratuak | 482,259 |
|
@@ -64,17 +64,18 @@ The English-Basque data collected from the web was a combination of the followin
|
|
64 |
| PaCO_2012 | 109,524 |
|
65 |
| PaCO_2013 | 48,892 |
|
66 |
| WikiMatrix | 119,480 |
|
67 |
-
| **Total** | **15,653,108**
|
|
|
68 |
|
69 |
|
|
|
70 |
|
71 |
-
The 11,489,433 sentence pairs of synthetic parallel data were created by translating a compendium of ES-EU parallel corpora into English using the [ES-EN translator from Google Translate](https://translate.google.com/about/).
|
72 |
|
73 |
### Training Procedure
|
74 |
|
75 |
#### Preprocessing
|
76 |
|
77 |
-
After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) and [
|
78 |
|
79 |
#### Tokenization
|
80 |
All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.
|
@@ -84,26 +85,26 @@ All data is tokenized using sentencepiece, with a 32,000 token sentencepiece mod
|
|
84 |
We use the BLEU and TER scores for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
|
85 |
|
86 |
### Evaluation results
|
87 |
-
Below are the evaluation results on the machine translation from
|
88 |
|
89 |
####BLEU scores
|
90 |
|
91 |
-
|
|
|
92 |
|----------------------|-----------------|----------|-------------|
|
93 |
-
| Flores 200 devtest |**
|
94 |
-
| TaCON |
|
95 |
-
| NTREX | **
|
96 |
-
| Average | **
|
97 |
|
98 |
####TER scores
|
99 |
|
100 |
-
| Test set |Google Translate | NLLB 3.3 |mt-hitz-en
|
101 |
|----------------------|-----------------|----------|-------------|
|
102 |
-
| Flores 200 devtest |**
|
103 |
-
| TaCON |**
|
104 |
-
| NTREX |**
|
105 |
-
| Average |**
|
106 |
-
|
107 |
|
108 |
|
109 |
<!-- Momentuz ez dugu artikulurik. ILENIAn zerbait egiten bada eguneratu beharko da -->
|
@@ -131,7 +132,7 @@ For further information, send an email to <[email protected]>
|
|
131 |
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
132 |
### Funding
|
133 |
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334
|
134 |
-
###
|
135 |
<details>
|
136 |
<summary>Click to expand</summary>
|
137 |
The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
language:
|
|
|
4 |
- eu
|
5 |
+
- en
|
6 |
metrics:
|
7 |
- BLEU
|
8 |
- TER
|
9 |
---
|
10 |
+
## Hitz Center’s Basque-English machine translation model
|
11 |
|
12 |
## Model description
|
13 |
|
14 |
+
This model was trained from scratch using [Marian NMT](https://marian-nmt.github.io/) on a combination of English-Basque datasets totalling 18,067,996 sentence pairs. 9,033,998 sentence pairs were parallel data collected from the web while the remaining 9,033,998 sentence pairs were parallel synthetic data created using the [ES-EU translator from HiTZ](https://huggingface.co/HiTZ/mt-hitz-es-eu). The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.
|
15 |
|
16 |
- **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
|
17 |
- **Model type:** traslation
|
18 |
+
- **Source Language:** Basque
|
19 |
+
- **Target Language:** English
|
20 |
- **License:** apache-2.0
|
21 |
|
22 |
## Intended uses and limitations
|
23 |
|
24 |
+
You can use this model for machine translation from Basque to English.
|
25 |
|
26 |
At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources.
|
27 |
|
|
|
34 |
from transformers import AutoTokenizer
|
35 |
from transformers import AutoModelForSeq2SeqLM
|
36 |
|
37 |
+
src_text = ["hau proba bat da"]
|
38 |
|
39 |
+
model_name = "HiTZ/mt-hitz-eu-en"
|
40 |
tokenizer = MarianTokenizer.from_pretrained(model_name)
|
41 |
|
42 |
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
|
|
|
44 |
rue))
|
45 |
print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])`
|
46 |
```
|
|
|
47 |
|
48 |
## Training Details
|
49 |
|
50 |
### Training Data
|
51 |
|
52 |
+
The Catalan-Basque data collected from the web was a combination of the following datasets:
|
53 |
+
|
54 |
|
55 |
| Dataset | Sentences before cleaning |
|
56 |
|-----------------|--------------------------:|
|
57 |
+
| CCMatrix v1 | 7,788,871 |
|
58 |
| EhuHac | 585,210 |
|
59 |
| Ehuskaratuak | 482,259 |
|
60 |
| Ehuskaratuak | 482,259 |
|
|
|
64 |
| PaCO_2012 | 109,524 |
|
65 |
| PaCO_2013 | 48,892 |
|
66 |
| WikiMatrix | 119,480 |
|
67 |
+
| **Total** | **15,653,108** |
|
68 |
+
The recommended environments include the following transfomer versions: 4.12.3 , 4.15.0 , 4.26.1
|
69 |
|
70 |
|
71 |
+
The 9,033,998 sentence pairs of synthetic parallel data were created by translating a compendium of ES-EU parallel corpora into Catalan using the [ES-EU translator from HiTZ](https://huggingface.co/HiTZ/mt-hitz-es-eu).
|
72 |
|
|
|
73 |
|
74 |
### Training Procedure
|
75 |
|
76 |
#### Preprocessing
|
77 |
|
78 |
+
After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) and [biclener](https://github.com/bitextor/bicleaner) tools [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/). Any sentence pairs with a classification score of less than 0.5 is removed. The filtered corpus is composed of 9,033,998 parallel sentences.
|
79 |
|
80 |
#### Tokenization
|
81 |
All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.
|
|
|
85 |
We use the BLEU and TER scores for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
|
86 |
|
87 |
### Evaluation results
|
88 |
+
Below are the evaluation results on the machine translation from Basque to English compared to [Google Translate](https://translate.google.com/) and [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B):
|
89 |
|
90 |
####BLEU scores
|
91 |
|
92 |
+
|
93 |
+
| Test set |Google Translate | NLLB 3.3 |mt-hitz-eu-en|
|
94 |
|----------------------|-----------------|----------|-------------|
|
95 |
+
| Flores 200 devtest |**36.1** | 32.2 | 28.6 |
|
96 |
+
| TaCON | **22.8** | 22.7 | 21.9 |
|
97 |
+
| NTREX | **33.7** | 28.9 | 25.8 |
|
98 |
+
| Average | **30.9** | 27.9 | 25.4 |
|
99 |
|
100 |
####TER scores
|
101 |
|
102 |
+
| Test set |Google Translate | NLLB 3.3 |mt-hitz-eu-en|
|
103 |
|----------------------|-----------------|----------|-------------|
|
104 |
+
| Flores 200 devtest |**46.5** | 51.2 | 53.1 |
|
105 |
+
| TaCON |**57.0** | 63.0 | 57.5 |
|
106 |
+
| NTREX |**50.2** | 55.5 | 58.2 |
|
107 |
+
| Average |**51.2** | 56.6 | 56.3 |
|
|
|
108 |
|
109 |
|
110 |
<!-- Momentuz ez dugu artikulurik. ILENIAn zerbait egiten bada eguneratu beharko da -->
|
|
|
132 |
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
133 |
### Funding
|
134 |
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334
|
135 |
+
### DisclaimerThe recommended environments include the following transfomer versions: 4.12.3 , 4.15.0 , 4.26.1
|
136 |
<details>
|
137 |
<summary>Click to expand</summary>
|
138 |
The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
|