ChocoLlama
/

Llama-3-ChocoLlama-8B-instruct

@@ -1,47 +1,128 @@
 ---
-license: llama3
-base_model: llama-2-nl/Meta-Llama-3-8B-lora-original
-tags:
-- alignment-handbook
-- generated_from_trainer
 datasets:
 - BramVanroy/ultrachat_200k_dutch
 - BramVanroy/stackoverflow-chat-dutch
 - BramVanroy/alpaca-cleaned-dutch
 - BramVanroy/dolly-15k-dutch
 - BramVanroy/no_robots_dutch
-model-index:
-- name: Meta-Llama-3-8B-lora-original-sft
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# Meta-Llama-3-8B-lora-original-sft
-This model is a fine-tuned version of [llama-2-nl/Meta-Llama-3-8B-lora-original](https://huggingface.co/llama-2-nl/Meta-Llama-3-8B-lora-original) on the BramVanroy/ultrachat_200k_dutch, the BramVanroy/stackoverflow-chat-dutch, the BramVanroy/alpaca-cleaned-dutch, the BramVanroy/dolly-15k-dutch and the BramVanroy/no_robots_dutch datasets.
-It achieves the following results on the evaluation set:
-- Loss: 1.0188
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 2e-05
 - train_batch_size: 4
 - eval_batch_size: 4
 - seed: 42
@@ -55,16 +136,39 @@ The following hyperparameters were used during training:
 - lr_scheduler_warmup_ratio: 0.1
 - num_epochs: 1
-### Training results
-| Training Loss | Epoch | Step | Validation Loss |
-|:-------------:|:-----:|:----:|:---------------:|
-| 1.0172        | 1.0   | 812  | 1.0188          |
-### Framework versions
-- Transformers 4.40.1
-- Pytorch 2.1.2
-- Datasets 2.19.0
-- Tokenizers 0.19.1

 ---
+language:
+- nl
+license: cc-by-nc-4.0
+base_model: ChocoLlama/Llama-3-ChocoLlama-8B-base
 datasets:
 - BramVanroy/ultrachat_200k_dutch
 - BramVanroy/stackoverflow-chat-dutch
 - BramVanroy/alpaca-cleaned-dutch
 - BramVanroy/dolly-15k-dutch
 - BramVanroy/no_robots_dutch
+- BramVanroy/ultra_feedback_dutch
 ---
+<p align="center" style="margin:0;padding:0">
+<img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
+</p>
+<div style="margin:auto; text-align:center">
+<h1 style="margin-bottom: 0">ChocoLlama</h1>
+<em>A Llama-2/3-based family of Dutch language models</em>
+</div>
+## Llama-3-ChocoLlama-8B-instruct: Getting Started
+We here present **ChocoLlama-2-7B-instruct**, an instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
+Its base model, [Llama-3-ChocoLlama-8B-base](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base), is a language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
+Use the code below to get started with the model.
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-8B-instruct')
+model = AutoModelForCausalLM.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-8B-instruct', device_map="auto")
+messages = [
+    {"role": "system", "content": "Je bent een artificiële intelligentie-assistent en geeft behulpzame, gedetailleerde en beleefde antwoorden op de vragen van de gebruiker."},
+    {"role": "user", "content": "Jacques brel, Willem Elsschot en Jan Jambon zitten op café. Waar zouden ze over babbelen?"},
+]
+input_ids = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    return_tensors="pt"
+).to(model.device)
+new_terminators = [
+    tokenizer.eos_token_id,
+    tokenizer.convert_tokens_to_ids("<|eot_id|>")
+]
+outputs = model.generate(
+    input_ids,
+    max_new_tokens=512,
+    eos_token_id=new_terminators,
+    do_sample=True,
+    temperature=0.8,
+    top_p=0.95,
+)
+response = outputs[0][input_ids.shape[-1]:]
+print(tokenizer.decode(response, skip_special_tokens=True))
+```
+Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes.
+Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.
+## Model Details
+ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
+We provide 6 variants (of which 3 base and 3 instruction-tuned models):
+- **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
+- **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
+- **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
+- **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
+- **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
+- **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
+For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](some_url).
+### Model Description
+- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
+- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA A100-80GB)
+- **Language(s):** Dutch
+- **License:** cc-by-nc-4.0
+- **Finetuned from model:** [Llama-3-ChocoLlama-8B-instruct](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)
+### Model Sources
+- **Repository:** Will be released soon.
+- **Paper:** Will be released soon.
+## Uses
+### Direct Use
+This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings.
+For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.
+### Out-of-Scope Use
+Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
+## Bias, Risks, and Limitations
+We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
+However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
+## Training Details
+We adopt the same strategy as used to align GEITje-7B to [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra).
+First, we apply supervised finetuning (SFT), utilizing the data made available by [Vanroy](https://arxiv.org/pdf/2312.12852):
+- [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch)
+- [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch)
+- [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch)
+- [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch)
+- [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch)
+Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop,
+now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, [BramVanroy/ultra_feedback_dutch](https://huggingface.co/datasets/BramVanroy/ultra_feedback_dutch).
+For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:
+- learning_rate: 5e-07
 - train_batch_size: 4
 - eval_batch_size: 4
 - seed: 42
 - lr_scheduler_warmup_ratio: 0.1
 - num_epochs: 1
+Further, we leverage the publicly available [alignment handbook](https://github.com/huggingface/alignment-handbook) and use a set of 4 NVIDIA A100 (80 GB) for both stages.
+## Evaluation
+### Quantitative evaluation
+We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
+| Model                                        | ARC            | HellaSwag      | MMLU           | TruthfulQA     | Avg.           |
+|----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
+| **Llama-3-ChocoLlama-instruct**        | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
+| llama-3-8B-rebatch                           | 0.44           | 0.64           | 0.46           | 0.48           | 0.51           |
+| llama-3-8B-instruct                          | 0.47           | 0.59           | 0.47           | 0.52           | 0.51           |
+| llama-3-8B                                   | 0.44           | 0.64           | 0.47           | 0.45           | 0.5            |
+| Reynaerde-7B-Chat                            | 0.44           | 0.62           | 0.39           | 0.52           | 0.49           |
+| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
+| zephyr-7b-beta                               | 0.43           | 0.58           | 0.43           | 0.53           | 0.49           |
+| geitje-7b-ultra                              | 0.40           | 0.66           | 0.36           | 0.49           | 0.48           |
+| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
+| mistral-7b-v0.1                              | 0.43           | 0.58           | 0.37           | 0.45           | 0.46           |
+| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
+| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
+| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
+| llama-2-7b-chat-hf                           | 0.36           | 0.49           | 0.33           | 0.44           | 0.41           |
+| llama-2-7b-hf                                | 0.36           | 0.51           | 0.32           | 0.41           | 0.40           |
+On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
+### Qualitative evaluation
+In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
+For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
+### Compute Infrastructure
+All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM.