Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,74 @@
|
|
1 |
-
---
|
2 |
-
license: llama3.1
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3.1
|
3 |
+
---
|
4 |
+
|
5 |
+
# Llama-Krikri-8B: A large foundation Language Model for the Greek language
|
6 |
+
|
7 |
+
Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024 we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
|
8 |
+
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version [Llama-Krikri-8b-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).
|
9 |
+
|
10 |
+
|
11 |
+
# Model Information
|
12 |
+
|
13 |
+
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
|
14 |
+
- 128k context length
|
15 |
+
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **110 billion tokens**.
|
16 |
+
* This corpus includes 55 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (23.3 billion tokens) and Greek-English parallel data (5,26 billion tokens).
|
17 |
+
* The training corpus also contains 6 billion math and code tokens.
|
18 |
+
* This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
|
19 |
+
|
20 |
+
|
21 |
+
| Sub-corpus | # Tokens | Percentage |
|
22 |
+
|-----------|------------------|------------|
|
23 |
+
| Greek | 55,097,452,359 | 61.4% |
|
24 |
+
| English | 23,340,749,356 | 26.0% |
|
25 |
+
| Parallel | 5,262,998,873 | 6.0% |
|
26 |
+
| Math/Code | 5,951,964,497 | 6.6% |
|
27 |
+
| **Total** | **89,653,165,085** | **100%** |
|
28 |
+
|
29 |
+
Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of 110 billion tokens.
|
30 |
+
|
31 |
+
|
32 |
+
# Usage
|
33 |
+
|
34 |
+
Please make sure that the BOS token is always included in the tokenized prompts. This might not be the default setting in all evaluation or fine-tuning frameworks.
|
35 |
+
|
36 |
+
|
37 |
+
# Evaluation
|
38 |
+
|
39 |
+
|
40 |
+
## Greek Benchmarks
|
41 |
+
|
42 |
+
The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
|
43 |
+
|
44 |
+
Our evaluation suite includes:
|
45 |
+
* Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)).
|
46 |
+
* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
|
47 |
+
* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
|
48 |
+
|
49 |
+
Our evaluation for Llama-Krikri-8B is performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). We can see that our training enhances performance across all Greek test sets by a **+14.9%** average improvement. The results for the Greek test sets are shown in the following table:
|
50 |
+
|
51 |
+
| | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
|
52 |
+
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
|
53 |
+
| Meltemi 7B v1.5 | 42.2% | 61.0% | 53.8% | 40.0% | 49.0% | 41.2% | 47.9% |
|
54 |
+
| Llama-3.1-8B | 33.4% | 72.8% | 52.1% | 39.9% | 51.1% | 42.6% | 48.7% |
|
55 |
+
| Llama-Krikri-8B | 53.8% | 82.7% | 64.6% | 49.4% | 54.2% | 52.0% | **59.5%** |
|
56 |
+
|
57 |
+
Please note that the above evaluations were run with the newer version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5
|
58 |
+
|
59 |
+
## English Benchmarks
|
60 |
+
|
61 |
+
| | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
|
62 |
+
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
|
63 |
+
| Meltemi 7B v1.5 | 73.4% | 77.7% | 79.6% | 54.1% | 49.0% | 41.2% | 47.9% |
|
64 |
+
| Llama-3.1-8B | 74.6% | 71.5% | 82.0% | 58.5% | 51.1% | 42.6% | 48.7% |
|
65 |
+
| Llama-Krikri-8B | 72.6% | 79.8% | 80.7% | 57.8% | 54.2% | 52.0% | **59.5%** |
|
66 |
+
|
67 |
+
# Ethical Considerations
|
68 |
+
|
69 |
+
This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.
|
70 |
+
|
71 |
+
|
72 |
+
# Acknowledgements
|
73 |
+
|
74 |
+
The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.
|