RichardErkhov commited on
Commit
5e1d1ae
·
verified ·
1 Parent(s): 50892a9

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +181 -0
README.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ Plume32k - bnb 8bits
11
+ - Model creator: https://huggingface.co/projecte-aina/
12
+ - Original model: https://huggingface.co/projecte-aina/Plume32k/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ license: apache-2.0
20
+ language:
21
+ - en
22
+ - gl
23
+ - de
24
+ - es
25
+ - ca
26
+ - it
27
+ - fr
28
+ - eu
29
+ - pt
30
+ metrics:
31
+ - comet
32
+ - bleu
33
+ pipeline_tag: translation
34
+ widget:
35
+ - text: <s> [spa_Latn] Ayer él se fue, tomó sus cosas y se puso a navegar. \n[cat_Latn]
36
+ inference: false
37
+ ---
38
+
39
+ # Plume32k
40
+
41
+ This is the model card of Plume (**P**arallel **L**ang**u**age **M**od**e**l) with a vocabulary size of 32k.
42
+
43
+ ## Table of Contents
44
+ <details>
45
+ <summary>Click to expand</summary>
46
+
47
+ - [Model description](#model-description)
48
+ - [Intended uses and limitations](#intended-uses-and-limitations)
49
+ - [Run the model](#run-the-model)
50
+ - [Training](#training)
51
+ - [Evaluation](#evaluation)
52
+ - [Citation](#citation)
53
+ - [Additional information](#additional-information)
54
+
55
+ </details>
56
+
57
+ ## Summary
58
+
59
+ Plume is the first LLM trained for Neural Machine Translation with only parallel Catalan-Centric data from scratch. It is a language model with the same architecture as Gemma 2B. The model is trained for general translation tasks at sentence level. For more information about training, architecture and interpretability of the model check out the paper; "Investigating the translation capabilities of Large Language Models trained on parallel data only". The preprint is available on [arXiv](https://arxiv.org/abs/2406.09140).
60
+
61
+ - **Developed by:** The Language Technologies Unit from Barcelona Supercomputing Center (BSC).
62
+ - **Languages:** Spanish, French, Italian, Portuguese, Galician, German, English, and Basque.
63
+ - **License:** Apache License, Version 2.0
64
+
65
+ ## Model Description
66
+
67
+ In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methodologies predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce Plume (**P**arallel **L**ang**u**age **M**od**e**l), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparable to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones.
68
+
69
+ For more details regarding the model architecture, the dataset and model interpretability take a look at the [paper](https://arxiv.org/abs/2406.09140).
70
+
71
+ ## Intended Uses and Limitations
72
+
73
+ The model is proficient in 16 supervised translation directions that include Catalan and is capable of translating in other 56 zero-shot directions as well.
74
+
75
+ At the time of submission, no measures have been taken to estimate the bias and added toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
76
+
77
+ ## Run the model
78
+
79
+
80
+ ```python
81
+ from transformers import AutoTokenizer, AutoModelForCausalLM
82
+
83
+ # language codes: spa_Latn (Spanish), cat_Latn (Catalan), eng_Latn (English), ita_Latn (Italian),
84
+ # eus_Latn (Basque), deu_Latn (German), por_Latn (Portuguese), glg_Latn (Galician), fra_Latn (French)
85
+
86
+ model_id = "projecte-aina/Plume32k"
87
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
88
+ model = AutoModelForCausalLM.from_pretrained(model_id)
89
+
90
+ src_lang_code = 'spa_Latn'
91
+ tgt_lang_code = 'cat_Latn'
92
+ sentence = 'Ayer se fue, tomó sus cosas y se puso a navegar.'
93
+ prompt = '<s> [{}] {} \n[{}]'.format(src_lang_code, sentence, tgt_lang_code)
94
+ input_ids = tokenizer(prompt, return_tensors='pt').input_ids
95
+ output_ids = model.generate( input_ids, max_length=200, num_beams=5 )
96
+ input_length = input_ids.shape[1]
97
+ generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_tokens=True).strip()
98
+ # Ahir se'n va anar, va agafar les seves coses i es va posar a navegar.
99
+ ```
100
+
101
+ ## Training
102
+
103
+ For training, the learning rate is warmed up from 1e-7 to a maximum of 3e-4 over the first 2000 steps. We apply a weight decay of 0.1 and a gradient clipping of 1.0. During training, we set an effective batch size of 81,920 tokens per gradient step distributed over 40 NVIDIA H100-64GB GPUs. We use DeepSpeed with full *float32* training. We show in the next table the training hyperparameters:
104
+
105
+ | **Hyper-Parameter** | **Value** |
106
+ |---------------------|--------------------------|
107
+ | Batch size | 40 |
108
+ | Number of Epochs | 1 |
109
+ | Optimizer | Adam |
110
+ | Adam-β₁ | 0.9 |
111
+ | Adam-β₂ | 0.999 |
112
+ | Adam-ε | 1e-08 |
113
+ | Learning rate | 3e-04 |
114
+ | LR Scheduler | Linear |
115
+ | Warmup Steps | 2000 |
116
+
117
+
118
+ More training details are specified in the [paper](https://arxiv.org/abs/2406.09140). Code for training the model and running other experiments can be found in our [GitHub repository](https://github.com/projecte-aina/Plume).
119
+
120
+ ## Evaluation
121
+
122
+ Below are the evaluation results on Flores-200 and NTREX for supervised MT directions. For more details about model evaluation check out the [paper](https://arxiv.org/abs/2406.09140).
123
+
124
+ | Model | FLORES BLEU | FLORES COMET | NTREX BLEU | NTREX COMET |
125
+ |----------------------|-------------|--------------|------------|-------------|
126
+ | NLLB-1.3B | 31.02 | 0.86 | 29.68 | 0.85 |
127
+ | NLLB-600M | 29.24 | 0.85 | 28.37 | 0.84 |
128
+ | Bilinguals BSC | 31.93 | 0.86 | 29.77 | 0.84 |
129
+ | **Plume 32k** | 30.44 | 0.86 | 28.46 | 0.84 |
130
+ | **Plume 128k** | 30.81 | 0.86 | 28.78 | 0.84 |
131
+ | **Plume 256k** | 30.72 | 0.86 | 28.87 | 0.84 |
132
+
133
+
134
+ ## Citation
135
+
136
+ ```bibtex
137
+ @misc{gilabert2024investigating,
138
+ title={Investigating the translation capabilities of Large Language Models trained on parallel data only},
139
+ author={Javier García Gilabert and Carlos Escolano and Aleix Sant Savall and Francesca De Luca Fornaciari and Audrey Mash and Xixian Liao and Maite Melero},
140
+ year={2024},
141
+ eprint={2406.09140},
142
+ archivePrefix={arXiv}
143
+ }
144
+ ```
145
+
146
+ ## Additional information
147
+
148
+ ### Author
149
+ The Language Technologies Unit from Barcelona Supercomputing Center.
150
+
151
+ ### Contact
152
+ Feel free to write us at with any questions you may have to {javier.garcia1, carlos.escolano, aleix.santsavall, francesca.delucafornaciari, audrey.mash, xixian.liao, maite.melero}@bsc.es
153
+
154
+ ### Copyright
155
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
156
+
157
+ ### License
158
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
159
+
160
+ ### Funding
161
+
162
+ This work has been promoted and financed by the Government of Catalonia through the [Aina](https://projecteaina.cat/) project, by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project [ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337, as well as by [DeepR3](https://ixa2.si.ehu.eus/deepr3/) (TED2021-130295B-C32) founded by MCIN/AEI/10.13039/501100011033 and European Union NextGeneration EU/PRTR.
163
+
164
+
165
+ ### Disclaimer
166
+
167
+ <details>
168
+ <summary>Click to expand</summary>
169
+
170
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
171
+
172
+ Be aware that the model may have biases and/or any other undesirable distortions.
173
+
174
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
175
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
176
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
177
+
178
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties.
179
+
180
+ </details>
181
+