OpenLLM-France
/

Lucie-7B-Instruct-v1.1-gguf

+---
+license: apache-2.0
+pipeline_tag: text-generation
+language:
+- fr
+- en
+tags:
+- pretrained
+- llama-3
+- openllm-france
+datasets:
+- cmh/alpaca_data_cleaned_fr_52k
+- OpenLLM-France/Croissant-Aligned-Instruct
+- Gael540/dataSet_ens_sup_fr-v1
+- ai2-adapt-dev/flan_v2_converted
+- teknium/OpenHermes-2.5
+- allenai/tulu-3-sft-personas-math
+- allenai/tulu-3-sft-personas-math-grade
+- allenai/WildChat-1M
+base_model:
+- OpenLLM-France/Lucie-7B
+widget:
+  - text: |-
+      Quelle est la capitale de l'Espagne ? Madrid.
+      Quelle est la capitale de la France ?
+    example_title: Capital cities in French
+    group: 1-shot Question Answering
+training_progress:
+  context_length: 32000
+---
+# Model Card for Lucie-7B-Instruct-v1.1
+* [Model Description](#model-description)
+<!-- * [Uses](#uses) -->
+* [Training Details](#training-details)
+  * [Training Data](#training-data)
+  * [Preprocessing](#preprocessing)
+  * [Instruction template](#instruction-template)
+  * [Training Procedure](#training-procedure)
+<!-- * [Evaluation](#evaluation) -->
+* [Testing the model with ollama](#testing-the-model-with-ollama)
+* [Citation](#citation)
+* [Acknowledgements](#acknowledgements)
+* [Contact](#contact)
+## Model Description
+Lucie-7B-Instruct-v1.1-gguf is a quantized version of [Lucie-7B-Instruct-v1.1](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1) (see [llama.cpp](https://github.com/ggerganov/llama.cpp) for quantization details). Lucie-7B-Instruct-v1.1 is a fine-tuned version of [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B), an open-source, multilingual causal language model created by OpenLLM-France.
+Lucie-7B-Instruct is fine-tuned on a mixture of human-templated and synthetic instructions (produced by ChatGPT) and a small set of customized prompts about OpenLLM and Lucie.
+Note that this instruction training is light and is meant to allow Lucie to produce responses of a desired type (answer, summary, list, etc.). Lucie-7B-Instruct-v1.1 would need further training before being implemented in pipelines for specific use-cases or for particular generation tasks such as code generation or mathematical problem solving. It is also susceptible to hallucinations; that is, producing false answers that result from its training. Its performance and accuracy can be improved through further fine-tuning and alignment with methods such as DPO, RLHF, etc.
+Due to its size, Lucie-7B is limited in the information that it can memorize; its ability to produce correct answers could be improved by implementing the model in a retrieval augmented generation pipeline.
+While Lucie-7B-Instruct is trained on sequences of 4096 tokens, its base model, Lucie-7B has a context size of 32K tokens. Based on Needle-in-a-haystack evaluations, Lucie-7B-Instruct maintains the capacity of the base model to handle 32K-size context windows.
+## Training details
+### Training data
+Lucie-7B-Instruct-v1.1 is trained on the following datasets:
+* [Alpaca-cleaned-fr](https://huggingface.co/datasets/cmh/alpaca_data_cleaned_fr_52k) (French; 51,655 samples)
+* [Croissant-Aligned-Instruct](https://huggingface.co/datasets/OpenLLM-France/Croissant-Aligned-Instruct) (English-French; 20,000 samples taken from 80,000 total)
+* [ENS](https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1) (French, 394 samples)
+* [FLAN v2 Converted](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted) (English, 78,580 samples)
+* [Open Hermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) (English, 1,000,495 samples)
+* [Oracle](https://github.com/opinionscience/InstructionFr/tree/main/wikipedia) (French, 4,613 samples)
+* [PIAF](https://www.data.gouv.fr/fr/datasets/piaf-le-dataset-francophone-de-questions-reponses/) (French, 1,849 samples)
+* [TULU3 Personas Math](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math)
+* [TULU3 Personas Math Grade](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-math-grade)
+* [Wildchat](https://huggingface.co/datasets/allenai/WildChat-1M) (French subset; 26,436 samples)
+* Hard-coded prompts concerning OpenLLM and Lucie (based on [allenai/tulu-3-hard-coded-10x](https://huggingface.co/datasets/allenai/tulu-3-hard-coded-10x))
+    * French: openllm_french.jsonl (24x10 samples)
+    * English: openllm_english.jsonl (24x10 samples)
+One epoch was passed on each dataset except for Croissant-Aligned-Instruct for which we randomly selected 20,000 translation pairs.
+### Preprocessing
+* Filtering by keyword: Examples containing assistant responses were filtered out from the four synthetic datasets if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
+### Instruction template:
+Lucie-7B-Instruct-v1.1 was trained on the chat template from Llama 3.1 with the sole difference that `<|begin_of_text|>` is replaced with `<s>`. The resulting template:
+```
+<s><|start_header_id|>system<|end_header_id|>
+{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>
+{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+{OUTPUT}<|eot_id|>
+```
+An example:
+```
+<s><|start_header_id|>system<|end_header_id|>
+You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
+Give me three tips for staying in shape.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+1. Eat a balanced diet and be sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|eot_id|>
+```
+### Training procedure
+The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions:
+* context length: 4096<sup>*</sup>
+* batch size: 1024
+* max learning rate: 3e-5
+* min learning rate: 3e-6
+<sup>*</sup>As noted above, while Lucie-7B-Instruct is trained on sequences of 4096 tokens, it maintains the capacity of the base model, Lucie-7B, to handle context sizes of up to 32K tokens.
+## Testing the model with ollama
+* Download and install [Ollama](https://ollama.com/download)
+* Download the [GGUF model](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1-gguf/blob/main/Lucie-7B-Instruct-v1.1-q4_k_m.gguf)
+* Copy the [`Modelfile`](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1-gguf/blob/main/Modelfile), adpating if necessary the path to the GGUF file (line starting with `FROM`).
+* Run in a shell:
+    * `ollama create -f Modelfile Lucie`
+    * `ollama run Lucie`
+* Once ">>>" appears, type your prompt(s) and press Enter.
+* Optionally, restart a conversation by typing "`/clear`"
+* End the session by typing "`/bye`".
+Useful for debug:
+* [How to print input requests and output responses in Ollama server?](https://stackoverflow.com/a/78831840)
+* [Documentation on Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter)
+   * Examples: [Ollama model library](https://github.com/ollama/ollama#model-library)
+      * Llama 3 example: https://ollama.com/library/llama3.1
+* Add GUI : https://docs.openwebui.com/
+## Citation
+When using the Lucie-7B-Instruct model, please cite the following paper:
+✍ Olivier Gouvert, Julie Hunter, Jérôme Louradour, Christophe Cérisara,
+Evan Dufraisse, Yaya Sy, Laura Rivière, Jean-Pierre Lorré (2025).
+The Lucie-7B LLM and the Lucie Training Dataset:
+      open resources for multilingual language generation
+```bibtex
+@misc{openllm2023claire,
+      title={The Lucie-7B LLM and the Lucie Training Dataset:
+      open resources for multilingual language generation},
+      author={Olivier Gouvert and Julie Hunter and Jérôme Louradour and Christophe Cérisara and Evan Dufraisse and Yaya Sy and Laura Rivière and Jean-Pierre Lorré},
+      year={2025},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+## Acknowledgements
+This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444). We gratefully acknowledge support from GENCI and IDRIS and from Pierre-François Lavallée (IDRIS) and Stephane Requena (GENCI) in particular.
+Lucie-7B-Instruct-v1.1 was created by members of [LINAGORA](https://labs.linagora.com/) and the [OpenLLM-France](https://www.openllm-france.fr/) community, including in alphabetical order:
+Olivier Gouvert (LINAGORA),
+Ismaïl Harrando (LINAGORA/SciencesPo),
+Julie Hunter (LINAGORA),
+Jean-Pierre Lorré (LINAGORA),
+Jérôme Louradour (LINAGORA),
+Michel-Marie Maudet (LINAGORA), and
+Laura Rivière (LINAGORA).
+We thank
+Clément Bénesse (Opsci),
+Christophe Cerisara (LORIA),
+Émile Hazard (Opsci),
+Evan Dufraisse (CEA List),
+Guokan Shang (MBZUAI),
+Joël Gombin (Opsci),
+Jordan Ricker (Opsci),
+and
+Olivier Ferret (CEA List)
+for their helpful input.
+Finally, we thank the entire OpenLLM-France community, whose members have helped in diverse ways.
+## Contact
+[email protected]