ilsp
/

Llama-Krikri-8B-Base

+---
+license: llama3.1
+---
+# Llama-Krikri-8B: A large foundation Language Model for the Greek language
+Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024 we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
+Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version [Llama-Krikri-8b-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).
+# Model Information
+- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
+- 128k context length
+- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **110 billion tokens**.
+  * This corpus includes 55 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (23.3 billion tokens) and Greek-English parallel data (5,26 billion tokens).
+  * The training corpus also contains 6 billion math and code tokens.
+  * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
+| Sub-corpus   | # Tokens         | Percentage |
+|-----------|------------------|------------|
+| Greek     | 55,097,452,359   | 61.4%      |
+| English   | 23,340,749,356   | 26.0%      |
+| Parallel  |  5,262,998,873   | 6.0%       |
+| Math/Code |  5,951,964,497   | 6.6%       |
+| **Total** | **89,653,165,085**   |  **100%**       |
+Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of 110 billion tokens.
+# Usage
+Please make sure that the BOS token is always included in the tokenized prompts. This might not be the default setting in all evaluation or fine-tuning frameworks.
+# Evaluation
+## Greek Benchmarks
+The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
+Our evaluation suite includes:
+* Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)).
+* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
+* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
+Our evaluation for Llama-Krikri-8B is performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). We can see that our training enhances performance across all Greek test sets by a **+14.9%** average improvement. The results for the Greek test sets are shown in the following table:
+|                | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
+|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
+| Meltemi 7B v1.5 | 42.2%         | 61.0%       | 53.8%        | 40.0%            | 49.0%             | 41.2%   | 47.9%   |
+| Llama-3.1-8B    | 33.4%         | 72.8%       | 52.1%        | 39.9%            | 51.1%             | 42.6%   | 48.7%   |
+| Llama-Krikri-8B | 53.8%         | 82.7%       | 64.6%        | 49.4%            | 54.2%             | 52.0%   | **59.5%**   |
+Please note that the above evaluations were run with the newer version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5
+## English Benchmarks
+|                | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
+|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
+| Meltemi 7B v1.5 | 73.4%         | 77.7%       | 79.6%        | 54.1%            | 49.0%             | 41.2%   | 47.9%   |
+| Llama-3.1-8B    | 74.6%         | 71.5%       | 82.0%        | 58.5%            | 51.1%             | 42.6%   | 48.7%   |
+| Llama-Krikri-8B | 72.6%         | 79.8%       | 80.7%        | 57.8%            | 54.2%             | 52.0%   | **59.5%**   |
+# Ethical Considerations
+This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.
+# Acknowledgements
+The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.