soksof commited on
Commit
b0dd771
·
verified ·
1 Parent(s): d4b0524

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -3
README.md CHANGED
@@ -1,3 +1,74 @@
1
- ---
2
- license: llama3.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ ---
4
+
5
+ # Llama-Krikri-8B: A large foundation Language Model for the Greek language
6
+
7
+ Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024 we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
8
+ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version [Llama-Krikri-8b-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).
9
+
10
+
11
+ # Model Information
12
+
13
+ - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
14
+ - 128k context length
15
+ - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **110 billion tokens**.
16
+ * This corpus includes 55 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (23.3 billion tokens) and Greek-English parallel data (5,26 billion tokens).
17
+ * The training corpus also contains 6 billion math and code tokens.
18
+ * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
19
+
20
+
21
+ | Sub-corpus | # Tokens | Percentage |
22
+ |-----------|------------------|------------|
23
+ | Greek | 55,097,452,359 | 61.4% |
24
+ | English | 23,340,749,356 | 26.0% |
25
+ | Parallel | 5,262,998,873 | 6.0% |
26
+ | Math/Code | 5,951,964,497 | 6.6% |
27
+ | **Total** | **89,653,165,085** | **100%** |
28
+
29
+ Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of 110 billion tokens.
30
+
31
+
32
+ # Usage
33
+
34
+ Please make sure that the BOS token is always included in the tokenized prompts. This might not be the default setting in all evaluation or fine-tuning frameworks.
35
+
36
+
37
+ # Evaluation
38
+
39
+
40
+ ## Greek Benchmarks
41
+
42
+ The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
43
+
44
+ Our evaluation suite includes:
45
+ * Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)).
46
+ * An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
47
+ * A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
48
+
49
+ Our evaluation for Llama-Krikri-8B is performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). We can see that our training enhances performance across all Greek test sets by a **+14.9%** average improvement. The results for the Greek test sets are shown in the following table:
50
+
51
+ | | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
52
+ |----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
53
+ | Meltemi 7B v1.5 | 42.2% | 61.0% | 53.8% | 40.0% | 49.0% | 41.2% | 47.9% |
54
+ | Llama-3.1-8B | 33.4% | 72.8% | 52.1% | 39.9% | 51.1% | 42.6% | 48.7% |
55
+ | Llama-Krikri-8B | 53.8% | 82.7% | 64.6% | 49.4% | 54.2% | 52.0% | **59.5%** |
56
+
57
+ Please note that the above evaluations were run with the newer version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5
58
+
59
+ ## English Benchmarks
60
+
61
+ | | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
62
+ |----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
63
+ | Meltemi 7B v1.5 | 73.4% | 77.7% | 79.6% | 54.1% | 49.0% | 41.2% | 47.9% |
64
+ | Llama-3.1-8B | 74.6% | 71.5% | 82.0% | 58.5% | 51.1% | 42.6% | 48.7% |
65
+ | Llama-Krikri-8B | 72.6% | 79.8% | 80.7% | 57.8% | 54.2% | 52.0% | **59.5%** |
66
+
67
+ # Ethical Considerations
68
+
69
+ This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.
70
+
71
+
72
+ # Acknowledgements
73
+
74
+ The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.