anegda commited on
Commit
705ffa6
·
verified ·
1 Parent(s): 6f9254a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -27
README.md CHANGED
@@ -1,27 +1,27 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
- - en
5
  - eu
 
6
  metrics:
7
  - BLEU
8
  - TER
9
  ---
10
- ## Hitz Center’s English-Basque machine translation model
11
 
12
  ## Model description
13
 
14
- This model was trained from scratch using [Marian NMT](https://marian-nmt.github.io/) on a combination of English-Basque datasets totalling 20,523,431 sentence pairs. 9,033,998 sentence pairs were parallel data collected from the web while the remaining 11,489,433 sentence pairs were parallel synthetic data created using the [Google Translate translator](https://translate.google.com/about/). The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.
15
 
16
  - **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
17
  - **Model type:** traslation
18
- - **Source Language:** English
19
- - **Target Language:** Basque
20
  - **License:** apache-2.0
21
 
22
  ## Intended uses and limitations
23
 
24
- You can use this model for machine translation from English to Basque.
25
 
26
  At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources.
27
 
@@ -34,9 +34,9 @@ from transformers import MarianMTModel, MarianTokenizer
34
  from transformers import AutoTokenizer
35
  from transformers import AutoModelForSeq2SeqLM
36
 
37
- src_text = ["this is a test"]
38
 
39
- model_name = "HiTZ/mt-hitz-en-eu"
40
  tokenizer = MarianTokenizer.from_pretrained(model_name)
41
 
42
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
@@ -44,17 +44,17 @@ translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=T
44
  rue))
45
  print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])`
46
  ```
47
- The recommended environments include the following transfomer versions: 4.12.3 , 4.15.0 , 4.26.1
48
 
49
  ## Training Details
50
 
51
  ### Training Data
52
 
53
- The English-Basque data collected from the web was a combination of the following datasets:
 
54
 
55
  | Dataset | Sentences before cleaning |
56
  |-----------------|--------------------------:|
57
- | CCMatrix v1 | 7,788,871 |
58
  | EhuHac | 585,210 |
59
  | Ehuskaratuak | 482,259 |
60
  | Ehuskaratuak | 482,259 |
@@ -64,17 +64,18 @@ The English-Basque data collected from the web was a combination of the followin
64
  | PaCO_2012 | 109,524 |
65
  | PaCO_2013 | 48,892 |
66
  | WikiMatrix | 119,480 |
67
- | **Total** | **15,653,108** |
 
68
 
69
 
 
70
 
71
- The 11,489,433 sentence pairs of synthetic parallel data were created by translating a compendium of ES-EU parallel corpora into English using the [ES-EN translator from Google Translate](https://translate.google.com/about/).
72
 
73
  ### Training Procedure
74
 
75
  #### Preprocessing
76
 
77
- After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) and [bicleaner](https://github.com/bitextor/bicleaner) tools [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/). Any sentence pairs with a classification score of less than 0.5 is removed. The filtered corpus is composed of 9,033,998 parallel sentences.
78
 
79
  #### Tokenization
80
  All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.
@@ -84,26 +85,26 @@ All data is tokenized using sentencepiece, with a 32,000 token sentencepiece mod
84
  We use the BLEU and TER scores for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
85
 
86
  ### Evaluation results
87
- Below are the evaluation results on the machine translation from English to Basque compared to [Google Translate](https://translate.google.com/) and [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B):
88
 
89
  ####BLEU scores
90
 
91
- | Test set |Google Translate | NLLB 3.3 |mt-hitz-en-eu|
 
92
  |----------------------|-----------------|----------|-------------|
93
- | Flores 200 devtest |**20.5** | 13.3 | 19.2 |
94
- | TaCON | **12.1** | 9.4 | 8.8 |
95
- | NTREX | **15.7** | 8.0 | 14.5 |
96
- | Average | **16.1** | 10.2 | 14.2 |
97
 
98
  ####TER scores
99
 
100
- | Test set |Google Translate | NLLB 3.3 |mt-hitz-en-eu|
101
  |----------------------|-----------------|----------|-------------|
102
- | Flores 200 devtest |**59.5** | 70.4 | 65.0 |
103
- | TaCON |**69.5** | 75.3 | 76.8 |
104
- | NTREX |**65.8** | 81.6 | 66.7 |
105
- | Average |**64.9** | 75.8 | **68.2** |
106
-
107
 
108
 
109
  <!-- Momentuz ez dugu artikulurik. ILENIAn zerbait egiten bada eguneratu beharko da -->
@@ -131,7 +132,7 @@ For further information, send an email to <[email protected]>
131
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
132
  ### Funding
133
  This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334
134
- ### Disclaimer
135
  <details>
136
  <summary>Click to expand</summary>
137
  The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
 
1
  ---
2
  license: apache-2.0
3
  language:
 
4
  - eu
5
+ - en
6
  metrics:
7
  - BLEU
8
  - TER
9
  ---
10
+ ## Hitz Center’s Basque-English machine translation model
11
 
12
  ## Model description
13
 
14
+ This model was trained from scratch using [Marian NMT](https://marian-nmt.github.io/) on a combination of English-Basque datasets totalling 18,067,996 sentence pairs. 9,033,998 sentence pairs were parallel data collected from the web while the remaining 9,033,998 sentence pairs were parallel synthetic data created using the [ES-EU translator from HiTZ](https://huggingface.co/HiTZ/mt-hitz-es-eu). The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.
15
 
16
  - **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
17
  - **Model type:** traslation
18
+ - **Source Language:** Basque
19
+ - **Target Language:** English
20
  - **License:** apache-2.0
21
 
22
  ## Intended uses and limitations
23
 
24
+ You can use this model for machine translation from Basque to English.
25
 
26
  At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources.
27
 
 
34
  from transformers import AutoTokenizer
35
  from transformers import AutoModelForSeq2SeqLM
36
 
37
+ src_text = ["hau proba bat da"]
38
 
39
+ model_name = "HiTZ/mt-hitz-eu-en"
40
  tokenizer = MarianTokenizer.from_pretrained(model_name)
41
 
42
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
 
44
  rue))
45
  print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])`
46
  ```
 
47
 
48
  ## Training Details
49
 
50
  ### Training Data
51
 
52
+ The Catalan-Basque data collected from the web was a combination of the following datasets:
53
+
54
 
55
  | Dataset | Sentences before cleaning |
56
  |-----------------|--------------------------:|
57
+ | CCMatrix v1 | 7,788,871 |
58
  | EhuHac | 585,210 |
59
  | Ehuskaratuak | 482,259 |
60
  | Ehuskaratuak | 482,259 |
 
64
  | PaCO_2012 | 109,524 |
65
  | PaCO_2013 | 48,892 |
66
  | WikiMatrix | 119,480 |
67
+ | **Total** | **15,653,108** |
68
+ The recommended environments include the following transfomer versions: 4.12.3 , 4.15.0 , 4.26.1
69
 
70
 
71
+ The 9,033,998 sentence pairs of synthetic parallel data were created by translating a compendium of ES-EU parallel corpora into Catalan using the [ES-EU translator from HiTZ](https://huggingface.co/HiTZ/mt-hitz-es-eu).
72
 
 
73
 
74
  ### Training Procedure
75
 
76
  #### Preprocessing
77
 
78
+ After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) and [biclener](https://github.com/bitextor/bicleaner) tools [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/). Any sentence pairs with a classification score of less than 0.5 is removed. The filtered corpus is composed of 9,033,998 parallel sentences.
79
 
80
  #### Tokenization
81
  All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.
 
85
  We use the BLEU and TER scores for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
86
 
87
  ### Evaluation results
88
+ Below are the evaluation results on the machine translation from Basque to English compared to [Google Translate](https://translate.google.com/) and [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B):
89
 
90
  ####BLEU scores
91
 
92
+
93
+ | Test set |Google Translate | NLLB 3.3 |mt-hitz-eu-en|
94
  |----------------------|-----------------|----------|-------------|
95
+ | Flores 200 devtest |**36.1** | 32.2 | 28.6 |
96
+ | TaCON | **22.8** | 22.7 | 21.9 |
97
+ | NTREX | **33.7** | 28.9 | 25.8 |
98
+ | Average | **30.9** | 27.9 | 25.4 |
99
 
100
  ####TER scores
101
 
102
+ | Test set |Google Translate | NLLB 3.3 |mt-hitz-eu-en|
103
  |----------------------|-----------------|----------|-------------|
104
+ | Flores 200 devtest |**46.5** | 51.2 | 53.1 |
105
+ | TaCON |**57.0** | 63.0 | 57.5 |
106
+ | NTREX |**50.2** | 55.5 | 58.2 |
107
+ | Average |**51.2** | 56.6 | 56.3 |
 
108
 
109
 
110
  <!-- Momentuz ez dugu artikulurik. ILENIAn zerbait egiten bada eguneratu beharko da -->
 
132
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
133
  ### Funding
134
  This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334
135
+ ### DisclaimerThe recommended environments include the following transfomer versions: 4.12.3 , 4.15.0 , 4.26.1
136
  <details>
137
  <summary>Click to expand</summary>
138
  The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.