Text Generation
Transformers
Safetensors
Dutch
llama
conversational
text-generation-inference
matthieumeeus97 commited on
Commit
0075366
·
verified ·
1 Parent(s): 3f068c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -33
README.md CHANGED
@@ -1,47 +1,128 @@
1
  ---
2
- license: llama3
3
- base_model: llama-2-nl/Meta-Llama-3-8B-lora-original
4
- tags:
5
- - alignment-handbook
6
- - generated_from_trainer
7
  datasets:
8
  - BramVanroy/ultrachat_200k_dutch
9
  - BramVanroy/stackoverflow-chat-dutch
10
  - BramVanroy/alpaca-cleaned-dutch
11
  - BramVanroy/dolly-15k-dutch
12
  - BramVanroy/no_robots_dutch
13
- model-index:
14
- - name: Meta-Llama-3-8B-lora-original-sft
15
- results: []
16
  ---
17
 
18
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
19
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- # Meta-Llama-3-8B-lora-original-sft
22
 
23
- This model is a fine-tuned version of [llama-2-nl/Meta-Llama-3-8B-lora-original](https://huggingface.co/llama-2-nl/Meta-Llama-3-8B-lora-original) on the BramVanroy/ultrachat_200k_dutch, the BramVanroy/stackoverflow-chat-dutch, the BramVanroy/alpaca-cleaned-dutch, the BramVanroy/dolly-15k-dutch and the BramVanroy/no_robots_dutch datasets.
24
- It achieves the following results on the evaluation set:
25
- - Loss: 1.0188
 
 
 
 
26
 
27
- ## Model description
28
 
29
- More information needed
30
 
31
- ## Intended uses & limitations
 
 
 
 
32
 
33
- More information needed
34
 
35
- ## Training and evaluation data
 
36
 
37
- More information needed
38
 
39
- ## Training procedure
40
 
41
- ### Training hyperparameters
 
42
 
43
- The following hyperparameters were used during training:
44
- - learning_rate: 2e-05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  - train_batch_size: 4
46
  - eval_batch_size: 4
47
  - seed: 42
@@ -55,16 +136,39 @@ The following hyperparameters were used during training:
55
  - lr_scheduler_warmup_ratio: 0.1
56
  - num_epochs: 1
57
 
58
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
- | Training Loss | Epoch | Step | Validation Loss |
61
- |:-------------:|:-----:|:----:|:---------------:|
62
- | 1.0172 | 1.0 | 812 | 1.0188 |
63
 
 
 
64
 
65
- ### Framework versions
66
 
67
- - Transformers 4.40.1
68
- - Pytorch 2.1.2
69
- - Datasets 2.19.0
70
- - Tokenizers 0.19.1
 
1
  ---
2
+ language:
3
+ - nl
4
+ license: cc-by-nc-4.0
5
+ base_model: ChocoLlama/Llama-3-ChocoLlama-8B-base
 
6
  datasets:
7
  - BramVanroy/ultrachat_200k_dutch
8
  - BramVanroy/stackoverflow-chat-dutch
9
  - BramVanroy/alpaca-cleaned-dutch
10
  - BramVanroy/dolly-15k-dutch
11
  - BramVanroy/no_robots_dutch
12
+ - BramVanroy/ultra_feedback_dutch
13
+
 
14
  ---
15
 
16
+ <p align="center" style="margin:0;padding:0">
17
+ <img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
18
+ </p>
19
+ <div style="margin:auto; text-align:center">
20
+ <h1 style="margin-bottom: 0">ChocoLlama</h1>
21
+ <em>A Llama-2/3-based family of Dutch language models</em>
22
+ </div>
23
+
24
+ ## Llama-3-ChocoLlama-8B-instruct: Getting Started
25
+
26
+ We here present **ChocoLlama-2-7B-instruct**, an instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
27
+ Its base model, [Llama-3-ChocoLlama-8B-base](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base), is a language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
28
+
29
+ Use the code below to get started with the model.
30
+
31
+ ```python
32
+ from transformers import AutoTokenizer, AutoModelForCausalLM
33
+
34
+ tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-8B-instruct')
35
+ model = AutoModelForCausalLM.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-8B-instruct', device_map="auto")
36
+
37
+ messages = [
38
+ {"role": "system", "content": "Je bent een artificiële intelligentie-assistent en geeft behulpzame, gedetailleerde en beleefde antwoorden op de vragen van de gebruiker."},
39
+ {"role": "user", "content": "Jacques brel, Willem Elsschot en Jan Jambon zitten op café. Waar zouden ze over babbelen?"},
40
+ ]
41
+
42
+ input_ids = tokenizer.apply_chat_template(
43
+ messages,
44
+ add_generation_prompt=True,
45
+ return_tensors="pt"
46
+ ).to(model.device)
47
+
48
+ new_terminators = [
49
+ tokenizer.eos_token_id,
50
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
51
+ ]
52
+
53
+ outputs = model.generate(
54
+ input_ids,
55
+ max_new_tokens=512,
56
+ eos_token_id=new_terminators,
57
+ do_sample=True,
58
+ temperature=0.8,
59
+ top_p=0.95,
60
+ )
61
+ response = outputs[0][input_ids.shape[-1]:]
62
+ print(tokenizer.decode(response, skip_special_tokens=True))
63
+ ```
64
+
65
+ Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes.
66
+ Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.
67
+
68
+ ## Model Details
69
 
70
+ ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
71
 
72
+ We provide 6 variants (of which 3 base and 3 instruction-tuned models):
73
+ - **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
74
+ - **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
75
+ - **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
76
+ - **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
77
+ - **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
78
+ - **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
79
 
80
+ For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](some_url).
81
 
82
+ ### Model Description
83
 
84
+ - **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
85
+ - **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA A100-80GB)
86
+ - **Language(s):** Dutch
87
+ - **License:** cc-by-nc-4.0
88
+ - **Finetuned from model:** [Llama-3-ChocoLlama-8B-instruct](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)
89
 
90
+ ### Model Sources
91
 
92
+ - **Repository:** Will be released soon.
93
+ - **Paper:** Will be released soon.
94
 
95
+ ## Uses
96
 
97
+ ### Direct Use
98
 
99
+ This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings.
100
+ For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.
101
 
102
+ ### Out-of-Scope Use
103
+
104
+ Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
105
+
106
+ ## Bias, Risks, and Limitations
107
+
108
+ We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
109
+ However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
110
+
111
+ ## Training Details
112
+
113
+ We adopt the same strategy as used to align GEITje-7B to [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra).
114
+ First, we apply supervised finetuning (SFT), utilizing the data made available by [Vanroy](https://arxiv.org/pdf/2312.12852):
115
+ - [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch)
116
+ - [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch)
117
+ - [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch)
118
+ - [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch)
119
+ - [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch)
120
+
121
+ Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop,
122
+ now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, [BramVanroy/ultra_feedback_dutch](https://huggingface.co/datasets/BramVanroy/ultra_feedback_dutch).
123
+
124
+ For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:
125
+ - learning_rate: 5e-07
126
  - train_batch_size: 4
127
  - eval_batch_size: 4
128
  - seed: 42
 
136
  - lr_scheduler_warmup_ratio: 0.1
137
  - num_epochs: 1
138
 
139
+ Further, we leverage the publicly available [alignment handbook](https://github.com/huggingface/alignment-handbook) and use a set of 4 NVIDIA A100 (80 GB) for both stages.
140
+
141
+ ## Evaluation
142
+
143
+ ### Quantitative evaluation
144
+
145
+ We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
146
+
147
+ | Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. |
148
+ |----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
149
+ | **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
150
+ | llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 |
151
+ | llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 |
152
+ | llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 |
153
+ | Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 |
154
+ | **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
155
+ | zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 |
156
+ | geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 |
157
+ | **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
158
+ | mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 |
159
+ | **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
160
+ | **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
161
+ | **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
162
+ | llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 |
163
+ | llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 |
164
+
165
+ On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
166
 
167
+ ### Qualitative evaluation
 
 
168
 
169
+ In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
170
+ For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
171
 
172
+ ### Compute Infrastructure
173
 
174
+ All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM.