Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,194 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
pipeline_tag: text-generation
|
4 |
+
language:
|
5 |
+
- fr
|
6 |
+
- en
|
7 |
+
tags:
|
8 |
+
- pretrained
|
9 |
+
- llama-3
|
10 |
+
- openllm-france
|
11 |
+
datasets:
|
12 |
+
- cmh/alpaca_data_cleaned_fr_52k
|
13 |
+
- OpenLLM-France/Croissant-Aligned-Instruct
|
14 |
+
- Gael540/dataSet_ens_sup_fr-v1
|
15 |
+
- ai2-adapt-dev/flan_v2_converted
|
16 |
+
- teknium/OpenHermes-2.5
|
17 |
+
- allenai/tulu-3-sft-personas-math
|
18 |
+
- allenai/tulu-3-sft-personas-math-grade
|
19 |
+
- allenai/WildChat-1M
|
20 |
+
base_model:
|
21 |
+
- OpenLLM-France/Lucie-7B
|
22 |
+
widget:
|
23 |
+
- text: |-
|
24 |
+
Quelle est la capitale de l'Espagne ? Madrid.
|
25 |
+
Quelle est la capitale de la France ?
|
26 |
+
example_title: Capital cities in French
|
27 |
+
group: 1-shot Question Answering
|
28 |
+
training_progress:
|
29 |
+
context_length: 32000
|
30 |
+
---
|
31 |
+
|
32 |
+
|
33 |
+
# Model Card for Lucie-7B-Instruct-v1.1
|
34 |
+
|
35 |
+
* [Model Description](#model-description)
|
36 |
+
<!-- * [Uses](#uses) -->
|
37 |
+
* [Training Details](#training-details)
|
38 |
+
* [Training Data](#training-data)
|
39 |
+
* [Preprocessing](#preprocessing)
|
40 |
+
* [Instruction template](#instruction-template)
|
41 |
+
* [Training Procedure](#training-procedure)
|
42 |
+
<!-- * [Evaluation](#evaluation) -->
|
43 |
+
* [Testing the model with ollama](#testing-the-model-with-ollama)
|
44 |
+
* [Citation](#citation)
|
45 |
+
* [Acknowledgements](#acknowledgements)
|
46 |
+
* [Contact](#contact)
|
47 |
+
|
48 |
+
## Model Description
|
49 |
+
|
50 |
+
Lucie-7B-Instruct-v1.1-gguf is a quantized version of [Lucie-7B-Instruct-v1.1](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1) (see [llama.cpp](https://github.com/ggerganov/llama.cpp) for quantization details). Lucie-7B-Instruct-v1.1 is a fine-tuned version of [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B), an open-source, multilingual causal language model created by OpenLLM-France.
|
51 |
+
|
52 |
+
Lucie-7B-Instruct is fine-tuned on a mixture of human-templated and synthetic instructions (produced by ChatGPT) and a small set of customized prompts about OpenLLM and Lucie.
|
53 |
+
|
54 |
+
Note that this instruction training is light and is meant to allow Lucie to produce responses of a desired type (answer, summary, list, etc.). Lucie-7B-Instruct-v1.1 would need further training before being implemented in pipelines for specific use-cases or for particular generation tasks such as code generation or mathematical problem solving. It is also susceptible to hallucinations; that is, producing false answers that result from its training. Its performance and accuracy can be improved through further fine-tuning and alignment with methods such as DPO, RLHF, etc.
|
55 |
+
|
56 |
+
Due to its size, Lucie-7B is limited in the information that it can memorize; its ability to produce correct answers could be improved by implementing the model in a retrieval augmented generation pipeline.
|
57 |
+
|
58 |
+
While Lucie-7B-Instruct is trained on sequences of 4096 tokens, its base model, Lucie-7B has a context size of 32K tokens. Based on Needle-in-a-haystack evaluations, Lucie-7B-Instruct maintains the capacity of the base model to handle 32K-size context windows.
|
59 |
+
|
60 |
+
|
61 |
+
## Training details
|
62 |
+
|
63 |
+
### Training data
|
64 |
+
|
65 |
+
Lucie-7B-Instruct-v1.1 is trained on the following datasets:
|
66 |
+
* [Alpaca-cleaned-fr](https://huggingface.co/datasets/cmh/alpaca_data_cleaned_fr_52k) (French; 51,655 samples)
|
67 |
+
* [Croissant-Aligned-Instruct](https://huggingface.co/datasets/OpenLLM-France/Croissant-Aligned-Instruct) (English-French; 20,000 samples taken from 80,000 total)
|
68 |
+
* [ENS](https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1) (French, 394 samples)
|
69 |
+
* [FLAN v2 Converted](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted) (English, 78,580 samples)
|
70 |
+
* [Open Hermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) (English, 1,000,495 samples)
|
71 |
+
* [Oracle](https://github.com/opinionscience/InstructionFr/tree/main/wikipedia) (French, 4,613 samples)
|
72 |
+
* [PIAF](https://www.data.gouv.fr/fr/datasets/piaf-le-dataset-francophone-de-questions-reponses/) (French, 1,849 samples)
|
73 |
+
* [TULU3 Personas Math](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math)
|
74 |
+
* [TULU3 Personas Math Grade](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade)
|
75 |
+
* [Wildchat](https://huggingface.co/datasets/allenai/WildChat-1M) (French subset; 26,436 samples)
|
76 |
+
* Hard-coded prompts concerning OpenLLM and Lucie (based on [allenai/tulu-3-hard-coded-10x](https://huggingface.co/datasets/allenai/tulu-3-hard-coded-10x))
|
77 |
+
* French: openllm_french.jsonl (24x10 samples)
|
78 |
+
* English: openllm_english.jsonl (24x10 samples)
|
79 |
+
|
80 |
+
One epoch was passed on each dataset except for Croissant-Aligned-Instruct for which we randomly selected 20,000 translation pairs.
|
81 |
+
|
82 |
+
### Preprocessing
|
83 |
+
* Filtering by keyword: Examples containing assistant responses were filtered out from the four synthetic datasets if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
|
84 |
+
|
85 |
+
### Instruction template:
|
86 |
+
Lucie-7B-Instruct-v1.1 was trained on the chat template from Llama 3.1 with the sole difference that `<|begin_of_text|>` is replaced with `<s>`. The resulting template:
|
87 |
+
|
88 |
+
```
|
89 |
+
<s><|start_header_id|>system<|end_header_id|>
|
90 |
+
|
91 |
+
{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>
|
92 |
+
|
93 |
+
{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|
94 |
+
|
95 |
+
{OUTPUT}<|eot_id|>
|
96 |
+
```
|
97 |
+
|
98 |
+
|
99 |
+
An example:
|
100 |
+
|
101 |
+
|
102 |
+
```
|
103 |
+
<s><|start_header_id|>system<|end_header_id|>
|
104 |
+
|
105 |
+
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
|
106 |
+
|
107 |
+
Give me three tips for staying in shape.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|
108 |
+
|
109 |
+
1. Eat a balanced diet and be sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|eot_id|>
|
110 |
+
```
|
111 |
+
|
112 |
+
### Training procedure
|
113 |
+
|
114 |
+
The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions:
|
115 |
+
* context length: 4096<sup>*</sup>
|
116 |
+
* batch size: 1024
|
117 |
+
* max learning rate: 3e-5
|
118 |
+
* min learning rate: 3e-6
|
119 |
+
|
120 |
+
<sup>*</sup>As noted above, while Lucie-7B-Instruct is trained on sequences of 4096 tokens, it maintains the capacity of the base model, Lucie-7B, to handle context sizes of up to 32K tokens.
|
121 |
+
|
122 |
+
## Testing the model with ollama
|
123 |
+
|
124 |
+
* Download and install [Ollama](https://ollama.com/download)
|
125 |
+
* Download the [GGUF model](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1-gguf/blob/main/Lucie-7B-Instruct-v1.1-q4_k_m.gguf)
|
126 |
+
* Copy the [`Modelfile`](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1-gguf/blob/main/Modelfile), adpating if necessary the path to the GGUF file (line starting with `FROM`).
|
127 |
+
* Run in a shell:
|
128 |
+
* `ollama create -f Modelfile Lucie`
|
129 |
+
* `ollama run Lucie`
|
130 |
+
* Once ">>>" appears, type your prompt(s) and press Enter.
|
131 |
+
* Optionally, restart a conversation by typing "`/clear`"
|
132 |
+
* End the session by typing "`/bye`".
|
133 |
+
|
134 |
+
Useful for debug:
|
135 |
+
* [How to print input requests and output responses in Ollama server?](https://stackoverflow.com/a/78831840)
|
136 |
+
* [Documentation on Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter)
|
137 |
+
* Examples: [Ollama model library](https://github.com/ollama/ollama#model-library)
|
138 |
+
* Llama 3 example: https://ollama.com/library/llama3.1
|
139 |
+
* Add GUI : https://docs.openwebui.com/
|
140 |
+
|
141 |
+
|
142 |
+
|
143 |
+
## Citation
|
144 |
+
|
145 |
+
When using the Lucie-7B-Instruct model, please cite the following paper:
|
146 |
+
|
147 |
+
✍ Olivier Gouvert, Julie Hunter, Jérôme Louradour, Christophe Cérisara,
|
148 |
+
Evan Dufraisse, Yaya Sy, Laura Rivière, Jean-Pierre Lorré (2025).
|
149 |
+
The Lucie-7B LLM and the Lucie Training Dataset:
|
150 |
+
open resources for multilingual language generation
|
151 |
+
```bibtex
|
152 |
+
@misc{openllm2023claire,
|
153 |
+
title={The Lucie-7B LLM and the Lucie Training Dataset:
|
154 |
+
open resources for multilingual language generation},
|
155 |
+
author={Olivier Gouvert and Julie Hunter and Jérôme Louradour and Christophe Cérisara and Evan Dufraisse and Yaya Sy and Laura Rivière and Jean-Pierre Lorré},
|
156 |
+
year={2025},
|
157 |
+
archivePrefix={arXiv},
|
158 |
+
primaryClass={cs.CL}
|
159 |
+
}
|
160 |
+
```
|
161 |
+
|
162 |
+
|
163 |
+
## Acknowledgements
|
164 |
+
|
165 |
+
This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444). We gratefully acknowledge support from GENCI and IDRIS and from Pierre-François Lavallée (IDRIS) and Stephane Requena (GENCI) in particular.
|
166 |
+
|
167 |
+
|
168 |
+
Lucie-7B-Instruct-v1.1 was created by members of [LINAGORA](https://labs.linagora.com/) and the [OpenLLM-France](https://www.openllm-france.fr/) community, including in alphabetical order:
|
169 |
+
Olivier Gouvert (LINAGORA),
|
170 |
+
Ismaïl Harrando (LINAGORA/SciencesPo),
|
171 |
+
Julie Hunter (LINAGORA),
|
172 |
+
Jean-Pierre Lorré (LINAGORA),
|
173 |
+
Jérôme Louradour (LINAGORA),
|
174 |
+
Michel-Marie Maudet (LINAGORA), and
|
175 |
+
Laura Rivière (LINAGORA).
|
176 |
+
|
177 |
+
|
178 |
+
We thank
|
179 |
+
Clément Bénesse (Opsci),
|
180 |
+
Christophe Cerisara (LORIA),
|
181 |
+
Émile Hazard (Opsci),
|
182 |
+
Evan Dufraisse (CEA List),
|
183 |
+
Guokan Shang (MBZUAI),
|
184 |
+
Joël Gombin (Opsci),
|
185 |
+
Jordan Ricker (Opsci),
|
186 |
+
and
|
187 |
+
Olivier Ferret (CEA List)
|
188 |
+
for their helpful input.
|
189 |
+
|
190 |
+
Finally, we thank the entire OpenLLM-France community, whose members have helped in diverse ways.
|
191 |
+
|
192 |
+
## Contact
|
193 |
+
|
194 |