droussis commited on
Commit
57c29d4
·
verified ·
1 Parent(s): 3d02913

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -6
README.md CHANGED
@@ -17,11 +17,14 @@ base_model:
17
 
18
  # Llama-Krikri-8B-Instruct: An Instruction-tuned Large Language Model for the Greek language
19
 
20
- Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
21
- Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present **Llama-Krikri-8B-Instruct**, along with the base model, [Llama-Krikri-8B-Base](https://huggingface.co/ilsp/Llama-Krikri-8B-Base).
 
22
 
 
 
23
 
24
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/639215a81bae0dde85842ab8/VMTgYzygHsarC9QRv2rGV.png)
25
 
26
 
27
  # Model Information
@@ -43,7 +46,7 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
43
  | English | 21.0 B | 23.1 % |
44
  | Parallel | 5.5 B | 6.0 % |
45
  | Math/Code | 7.8 B | 8.6 % |
46
- | **Total** | 91 B | **100%** |
47
 
48
 
49
  Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.
@@ -60,7 +63,25 @@ Llama-Krikri-8B-Instruct is the result of post-training Llama-Kriki-8B-Base and
60
  - Conversion or structured extraction (e.g., XML, JSON) in data-to-text & text-to-data settings.
61
  - Analytical thinking and Chain-of-Thought (CoT) reasoning for problem-solving.
62
 
63
- 🚨 **More information on the post-training corpus and methdology coming soon.** 🚨
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
 
66
  # How to use
@@ -133,7 +154,43 @@ print(response.choices[0].message.content)
133
 
134
  # Evaluation
135
 
136
- 🚨 **Instruction following and chat capability evaluation benchmarks coming soon.** 🚨
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
  # Acknowledgements
139
 
 
17
 
18
  # Llama-Krikri-8B-Instruct: An Instruction-tuned Large Language Model for the Greek language
19
 
20
+ <div align="center">
21
+ <img src="https://huggingface.co/ilsp/Llama-Krikri-8B-Instruct/resolve/main/KriKri_Logo-eng_54307d80-ee25-49f9-9204-0ce774499fbc.svg?raw=true" width="60%" alt="Krikri" />
22
+ </div>
23
 
24
+ Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
25
+ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present **Llama-Krikri-8B-Instruct**, along with the base model, [Llama-Krikri-8B-Base](https://huggingface.co/ilsp/Llama-Krikri-8B-Base)
26
 
27
+ <!-- ![image/png](llama-krikri-image.jpg) -->
28
 
29
 
30
  # Model Information
 
46
  | English | 21.0 B | 23.1 % |
47
  | Parallel | 5.5 B | 6.0 % |
48
  | Math/Code | 7.8 B | 8.6 % |
49
+ | **Total** | **91 B** | **100%** |
50
 
51
 
52
  Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.
 
63
  - Conversion or structured extraction (e.g., XML, JSON) in data-to-text & text-to-data settings.
64
  - Analytical thinking and Chain-of-Thought (CoT) reasoning for problem-solving.
65
 
66
+ ## Post-training Methodology
67
+
68
+ We used a multi-stage process in order to build Llama-Krikri-8B-Instruct which includes:
69
+ - 2-stage Supervised Fine-Tuning with a combination of Greek & English instruction-response pairs (& multi-turn conversations)
70
+ - **Stage 1**: **856,946** instruction-response pairs (371,379 Greek + 485,567 English)
71
+ - **Stage 2**: **638,408** instruction-response pairs (279,948 Greek + 358,460 English)
72
+ - Alignment with a combination of Greek & English preference triplets (Instruction - Chosen Response - Rejected Response)
73
+ - **Length Normalized DPO**: **92,394** preference triplets (47,132 Greek + 45,262 English)
74
+
75
+ ## Post-training Data Construction
76
+
77
+ To build the SFT & DPO data, we utilized various methodologies including:
78
+ - Collecting existing high-quality datasets such as [Tulu 3](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture), [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk), [MAGPIE Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v1.0), [Orca Agent Instruct](https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1), [IFEval Like Data](https://huggingface.co/datasets/argilla/ifeval-like-data), [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), [NVIDIA HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), [Intel Orca](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs), [UltraMedical](https://huggingface.co/datasets/TsinghuaC3I/UltraMedical-Preference), and other datasets focused on safety, truthfulness, and instruction-following.
79
+ - Translating various data into Greek using an in-house translation tool.
80
+ - Regenerating translated data and contrasting the translated with the regenerated responses (i.e., for creating preference triplets).
81
+ - Distilling (with the MAGPIE methodology) models which exhibit strong performance in Greek, such as [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it).
82
+ - Scoring data with the [Skywork Reward Gemma 2 27B v0.2](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) Reward Model and filtering using rule-based filters.
83
+ - Creating data for sentence and document translation using high-quality parallel corpora mainly from [ELRC-SHARE](https://elrc-share.eu/).
84
+ - Synthetically extracting question-answer pairs and multi-turn dialogues from diverse sources such as Wikipedia, EUR-LEX, Greek School Books, and Kallipos.
85
 
86
 
87
  # How to use
 
154
 
155
  # Evaluation
156
 
157
+ In the table below, we report the scores for our chat evaluation suite which includes:
158
+ - [Greek IFEval](https://huggingface.co/datasets/ilsp/ifeval_greek) (strict average)
159
+ - [English IFEval](https://huggingface.co/datasets/google/IFEval) (strict average)
160
+ - [Greek MT-Bench](https://huggingface.co/datasets/ilsp/mt-bench-greek) using gpt-4o-2024-08-06 as the judge model.
161
+ - [English MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts) using gpt-4o-2024-08-06 as the judge model.
162
+
163
+ We can observe that *Llama-Krikri-8B-Instruct exhibits the strongest performance* in instruction following for both Greek and English across all the models we tested. In particular, it surpasses Llama-3.1-8B-Instruct by **+21.7%** and **+7.3%** on the Greek and English IFEval respectively.
164
+ It also exhibits **the strongest chat capabilities in the Greek MT-Bench benchmark** (+0.28 compared to Aya Expanse 8B), while also being very competitive in the English variant of the MT-Bench benchmark.
165
+
166
+ | | IFEval EL (strict avg) | IFEval EN (strict avg) | MT-Bench EL | MT-Bench EN |
167
+ |---------------- |---------------- |----------------- |------------|------------|
168
+ | Qwen 2.5 7B Instruct | 46.2% | 74.8% | 5.83 | **7.87** |
169
+ | EuroLLM 9B Instruct | 51.3% | 64.5% | 5.98 | 6.27 |
170
+ | Aya Expanse 8B | 50.4% | 62.2% | 7.68 | 6.92 |
171
+ | Meltemi 7B v1.5 Instruct | 32.7% | 41.2% | 6.25 | 5.46 |
172
+ | Llama-3.1-8B Instruct | 45.8% | 75.1% | 6.46 | 7.25 |
173
+ | **Llama-Krikri-8B Instruct** | **67.5%** | **82.4%** | **7.96** | 7.21 |
174
+
175
+
176
+ We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek). We report 2 scores for Arena-Hard-Auto:
177
+ - No Style Control: The original version of the benchmark.
178
+ - With Style Control: The benchmark with style control methods for Markdown elements. You can read more about the methodology and technical background in this [blogspot](https://lmsys.org/blog/2024-08-28-style-control/).
179
+
180
+ Below, we show the scores for the Greek version of Arena-Hard-Auto for various open and closed chat models that were determined using **gpt-4o-2024-08-06 as the judge model** and **gpt-4o-mini-2024-07-18 as the baseline model** (i.e., by default 50% score).
181
+
182
+ Llama-Krikri-8B Instruct exhibits very strong chat capabilities by scoring **higher than models over 8 times its size** (such as Llama-3.1-70B Instruct) and is also **competitive with closed-source** (e.g., GPT-4o-Mini) and **highly-performant open-source models** (e.g., Gemma 2 27B IT & Aya Expanse 32B).
183
+ ![image/png](arena_hard_el.png)
184
+
185
+ Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology by using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
186
+
187
+ Llama-Krikri-8B Instruct performs very well in the English variant of Arena-Hard-Auto as well, since we can observe that it is **competitive with similarly sized LLMs** and that it **improves upon Llama-3.1-8B Instruct by +24.5% / +16%** (No style control / With style control).
188
+ ![image/png](arena_hard_en.png)
189
+
190
+ ***Please note** that judge models are biased towards student models trained on distilled data from them. You can read more [here](https://arxiv.org/pdf/2502.01534?).
191
+
192
+ 🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨
193
+
194
 
195
  # Acknowledgements
196