juliehunter commited on
Commit
deb069d
·
verified ·
1 Parent(s): c369eb9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +194 -3
README.md CHANGED
@@ -1,3 +1,194 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ language:
5
+ - fr
6
+ - en
7
+ tags:
8
+ - pretrained
9
+ - llama-3
10
+ - openllm-france
11
+ datasets:
12
+ - cmh/alpaca_data_cleaned_fr_52k
13
+ - OpenLLM-France/Croissant-Aligned-Instruct
14
+ - Gael540/dataSet_ens_sup_fr-v1
15
+ - ai2-adapt-dev/flan_v2_converted
16
+ - teknium/OpenHermes-2.5
17
+ - allenai/tulu-3-sft-personas-math
18
+ - allenai/tulu-3-sft-personas-math-grade
19
+ - allenai/WildChat-1M
20
+ base_model:
21
+ - OpenLLM-France/Lucie-7B
22
+ widget:
23
+ - text: |-
24
+ Quelle est la capitale de l'Espagne ? Madrid.
25
+ Quelle est la capitale de la France ?
26
+ example_title: Capital cities in French
27
+ group: 1-shot Question Answering
28
+ training_progress:
29
+ context_length: 32000
30
+ ---
31
+
32
+
33
+ # Model Card for Lucie-7B-Instruct-v1.1
34
+
35
+ * [Model Description](#model-description)
36
+ <!-- * [Uses](#uses) -->
37
+ * [Training Details](#training-details)
38
+ * [Training Data](#training-data)
39
+ * [Preprocessing](#preprocessing)
40
+ * [Instruction template](#instruction-template)
41
+ * [Training Procedure](#training-procedure)
42
+ <!-- * [Evaluation](#evaluation) -->
43
+ * [Testing the model with ollama](#testing-the-model-with-ollama)
44
+ * [Citation](#citation)
45
+ * [Acknowledgements](#acknowledgements)
46
+ * [Contact](#contact)
47
+
48
+ ## Model Description
49
+
50
+ Lucie-7B-Instruct-v1.1-gguf is a quantized version of [Lucie-7B-Instruct-v1.1](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1) (see [llama.cpp](https://github.com/ggerganov/llama.cpp) for quantization details). Lucie-7B-Instruct-v1.1 is a fine-tuned version of [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B), an open-source, multilingual causal language model created by OpenLLM-France.
51
+
52
+ Lucie-7B-Instruct is fine-tuned on a mixture of human-templated and synthetic instructions (produced by ChatGPT) and a small set of customized prompts about OpenLLM and Lucie.
53
+
54
+ Note that this instruction training is light and is meant to allow Lucie to produce responses of a desired type (answer, summary, list, etc.). Lucie-7B-Instruct-v1.1 would need further training before being implemented in pipelines for specific use-cases or for particular generation tasks such as code generation or mathematical problem solving. It is also susceptible to hallucinations; that is, producing false answers that result from its training. Its performance and accuracy can be improved through further fine-tuning and alignment with methods such as DPO, RLHF, etc.
55
+
56
+ Due to its size, Lucie-7B is limited in the information that it can memorize; its ability to produce correct answers could be improved by implementing the model in a retrieval augmented generation pipeline.
57
+
58
+ While Lucie-7B-Instruct is trained on sequences of 4096 tokens, its base model, Lucie-7B has a context size of 32K tokens. Based on Needle-in-a-haystack evaluations, Lucie-7B-Instruct maintains the capacity of the base model to handle 32K-size context windows.
59
+
60
+
61
+ ## Training details
62
+
63
+ ### Training data
64
+
65
+ Lucie-7B-Instruct-v1.1 is trained on the following datasets:
66
+ * [Alpaca-cleaned-fr](https://huggingface.co/datasets/cmh/alpaca_data_cleaned_fr_52k) (French; 51,655 samples)
67
+ * [Croissant-Aligned-Instruct](https://huggingface.co/datasets/OpenLLM-France/Croissant-Aligned-Instruct) (English-French; 20,000 samples taken from 80,000 total)
68
+ * [ENS](https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1) (French, 394 samples)
69
+ * [FLAN v2 Converted](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted) (English, 78,580 samples)
70
+ * [Open Hermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) (English, 1,000,495 samples)
71
+ * [Oracle](https://github.com/opinionscience/InstructionFr/tree/main/wikipedia) (French, 4,613 samples)
72
+ * [PIAF](https://www.data.gouv.fr/fr/datasets/piaf-le-dataset-francophone-de-questions-reponses/) (French, 1,849 samples)
73
+ * [TULU3 Personas Math](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math)
74
+ * [TULU3 Personas Math Grade](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade)
75
+ * [Wildchat](https://huggingface.co/datasets/allenai/WildChat-1M) (French subset; 26,436 samples)
76
+ * Hard-coded prompts concerning OpenLLM and Lucie (based on [allenai/tulu-3-hard-coded-10x](https://huggingface.co/datasets/allenai/tulu-3-hard-coded-10x))
77
+ * French: openllm_french.jsonl (24x10 samples)
78
+ * English: openllm_english.jsonl (24x10 samples)
79
+
80
+ One epoch was passed on each dataset except for Croissant-Aligned-Instruct for which we randomly selected 20,000 translation pairs.
81
+
82
+ ### Preprocessing
83
+ * Filtering by keyword: Examples containing assistant responses were filtered out from the four synthetic datasets if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
84
+
85
+ ### Instruction template:
86
+ Lucie-7B-Instruct-v1.1 was trained on the chat template from Llama 3.1 with the sole difference that `<|begin_of_text|>` is replaced with `<s>`. The resulting template:
87
+
88
+ ```
89
+ <s><|start_header_id|>system<|end_header_id|>
90
+
91
+ {SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>
92
+
93
+ {INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
94
+
95
+ {OUTPUT}<|eot_id|>
96
+ ```
97
+
98
+
99
+ An example:
100
+
101
+
102
+ ```
103
+ <s><|start_header_id|>system<|end_header_id|>
104
+
105
+ You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
106
+
107
+ Give me three tips for staying in shape.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
108
+
109
+ 1. Eat a balanced diet and be sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|eot_id|>
110
+ ```
111
+
112
+ ### Training procedure
113
+
114
+ The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions:
115
+ * context length: 4096<sup>*</sup>
116
+ * batch size: 1024
117
+ * max learning rate: 3e-5
118
+ * min learning rate: 3e-6
119
+
120
+ <sup>*</sup>As noted above, while Lucie-7B-Instruct is trained on sequences of 4096 tokens, it maintains the capacity of the base model, Lucie-7B, to handle context sizes of up to 32K tokens.
121
+
122
+ ## Testing the model with ollama
123
+
124
+ * Download and install [Ollama](https://ollama.com/download)
125
+ * Download the [GGUF model](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1-gguf/blob/main/Lucie-7B-Instruct-v1.1-q4_k_m.gguf)
126
+ * Copy the [`Modelfile`](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1-gguf/blob/main/Modelfile), adpating if necessary the path to the GGUF file (line starting with `FROM`).
127
+ * Run in a shell:
128
+ * `ollama create -f Modelfile Lucie`
129
+ * `ollama run Lucie`
130
+ * Once ">>>" appears, type your prompt(s) and press Enter.
131
+ * Optionally, restart a conversation by typing "`/clear`"
132
+ * End the session by typing "`/bye`".
133
+
134
+ Useful for debug:
135
+ * [How to print input requests and output responses in Ollama server?](https://stackoverflow.com/a/78831840)
136
+ * [Documentation on Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter)
137
+ * Examples: [Ollama model library](https://github.com/ollama/ollama#model-library)
138
+ * Llama 3 example: https://ollama.com/library/llama3.1
139
+ * Add GUI : https://docs.openwebui.com/
140
+
141
+
142
+
143
+ ## Citation
144
+
145
+ When using the Lucie-7B-Instruct model, please cite the following paper:
146
+
147
+ ✍ Olivier Gouvert, Julie Hunter, Jérôme Louradour, Christophe Cérisara,
148
+ Evan Dufraisse, Yaya Sy, Laura Rivière, Jean-Pierre Lorré (2025).
149
+ The Lucie-7B LLM and the Lucie Training Dataset:
150
+ open resources for multilingual language generation
151
+ ```bibtex
152
+ @misc{openllm2023claire,
153
+ title={The Lucie-7B LLM and the Lucie Training Dataset:
154
+ open resources for multilingual language generation},
155
+ author={Olivier Gouvert and Julie Hunter and Jérôme Louradour and Christophe Cérisara and Evan Dufraisse and Yaya Sy and Laura Rivière and Jean-Pierre Lorré},
156
+ year={2025},
157
+ archivePrefix={arXiv},
158
+ primaryClass={cs.CL}
159
+ }
160
+ ```
161
+
162
+
163
+ ## Acknowledgements
164
+
165
+ This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444). We gratefully acknowledge support from GENCI and IDRIS and from Pierre-François Lavallée (IDRIS) and Stephane Requena (GENCI) in particular.
166
+
167
+
168
+ Lucie-7B-Instruct-v1.1 was created by members of [LINAGORA](https://labs.linagora.com/) and the [OpenLLM-France](https://www.openllm-france.fr/) community, including in alphabetical order:
169
+ Olivier Gouvert (LINAGORA),
170
+ Ismaïl Harrando (LINAGORA/SciencesPo),
171
+ Julie Hunter (LINAGORA),
172
+ Jean-Pierre Lorré (LINAGORA),
173
+ Jérôme Louradour (LINAGORA),
174
+ Michel-Marie Maudet (LINAGORA), and
175
+ Laura Rivière (LINAGORA).
176
+
177
+
178
+ We thank
179
+ Clément Bénesse (Opsci),
180
+ Christophe Cerisara (LORIA),
181
+ Émile Hazard (Opsci),
182
+ Evan Dufraisse (CEA List),
183
+ Guokan Shang (MBZUAI),
184
+ Joël Gombin (Opsci),
185
+ Jordan Ricker (Opsci),
186
+ and
187
+ Olivier Ferret (CEA List)
188
+ for their helpful input.
189
+
190
+ Finally, we thank the entire OpenLLM-France community, whose members have helped in diverse ways.
191
+
192
+ ## Contact
193
+
194