File size: 7,245 Bytes
b0dd771 d11ec13 d0f062e d11ec13 d0f062e b0dd771 2011a83 b0dd771 d0f062e b0dd771 962d083 b0dd771 4e4f7dc d0f062e 1d0f57c b0dd771 1bb95bb 1d0f57c b0dd771 5663382 b0dd771 4e4f7dc 5663382 29990a5 5663382 b0dd771 4e4f7dc 1d0f57c 7e877a9 1d0f57c 7e877a9 1d0f57c 7e877a9 1d0f57c 7e877a9 b0dd771 4e4f7dc b0dd771 05f4afc b0dd771 d16a042 b0dd771 4e4f7dc b0dd771 b5142d1 b0dd771 f4f6fd1 b0dd771 d11ec13 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
license: llama3.1
language:
- el
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation-inference
---
# Llama-Krikri-8B-Base: A large foundation Language Model for the Greek language
Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version, [Llama-Krikri-8B-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).

# Model Information
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
- 128k context length (**approximately 80,000 Greek words**)
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
* This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
* The training corpus also contains 7.8 billion math and code tokens.
* This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
| Sub-corpus | # Tokens | Percentage |
|-----------|------------------|------------|
| Greek | 56.7 B | 62.3 % |
| English | 21.0 B | 23.1 % |
| Parallel | 5.5 B | 6.0 % |
| Math/Code | 7.8 B | 8.6 % |
| **Total** | 91 B | **100%** |
Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.
# How to use
## With Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Base")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")
model.to(device)
input_text = tokenizer("Ένα κρικρί διαφέρει απο ένα λάμα επειδή", return_tensors='pt').to(device)
outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=True)
print(tokenizer.batch_decode(outputs)[0])
```
## With OpenAI compatible server via vLLM
```bash
vllm serve ilsp/Llama-Krikri-8B-Base \
--enforce-eager \
--dtype 'bfloat16' \
--api-key token-abc123
```
Then, the model can be used through Python using:
```python
from openai import OpenAI
api_key = "token-abc123"
base_url = "http://localhost:8000/v1"
client = OpenAI(
api_key=api_key,
base_url=base_url,
)
response = client.completions.create(model="ilsp/Llama-Krikri-8B-Base",
prompt="Η εκπαίδευση μεγάλων γλωσσικών μοντέλων περιλαμβάνει")
print(response.choices[0].text)
```
# Evaluation
Below, we report improvements of Llama-Krikri-8B-Base over Llama-3.1-8B for Greek and English:
- **+10.8%** on Greek benchmarks
- **+0.8%** on English benchmarks
Our evaluations for Llama-Krikri-8B-Base, Llama-3.1-8B, and Meltemi 7B v1.5 are performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
## Greek Benchmarks
The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
Our evaluation suite includes:
* Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)).
* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
We can see that our continual pretraining methodology enhances performance across all Greek test sets by a **+10.8%** average improvement over the base model. The results for the Greek test sets are shown in the following table:
| | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
| Meltemi 7B v1.5 | 42.2% | 61.0% | 53.8% | 40.0% | 49.0% | 41.2% | 47.9% |
| Llama-3.1-8B | 33.4% | 72.8% | 52.1% | 39.9% | 51.1% | 42.6% | 48.7% |
| Llama-Krikri-8B | **53.8%** | **82.7%** | **64.6%** | **49.4%** | **54.2%** | **52.0%** | **59.5%** |
## English Benchmarks
We can also see that our training methodology not only mitigates catastrophic forgetting effectively, but also improves average performance across all English test sets by **+0.8%**. The results for the English test sets are shown in the following table:
| | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
| Meltemi 7B v1.5 | 73.4% | 77.7% | 79.6% | 54.1% | 40.5% | 56.9% | 63.7% |
| Llama-3.1-8B | **74.6%** | 71.5% | **82.0%** | **58.5%** | 44.2% | **66.2%** | 66.2% |
| Llama-Krikri-8B | 72.6% | **79.8%** | 80.7% | 57.8% | **44.8%** | 65.1% | **67.0%** |
Please note that all evaluations were run with the latest version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5
# Ethical Considerations
This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.
# Acknowledgements
The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community. |