File size: 7,245 Bytes

b0dd771
 
d11ec13
 
 
d0f062e
d11ec13
d0f062e
 
b0dd771
 
2011a83
b0dd771
d0f062e
 
b0dd771
962d083
b0dd771
 
 
 
4e4f7dc
d0f062e
1d0f57c
 
 
b0dd771
 
 
1bb95bb
 
 
 
 
 
 
 
1d0f57c
 
b0dd771
 
5663382
b0dd771
4e4f7dc
5663382
 
 
 
 
 
 
 
 
 
 
29990a5
5663382
 
 
 
b0dd771
4e4f7dc
1d0f57c
 
 
 
 
 
 
 
 
7e877a9
1d0f57c
 
 
 
7e877a9
1d0f57c
 
 
 
7e877a9
1d0f57c
 
 
7e877a9
b0dd771
 
 
4e4f7dc
 
 
 
 
 
 
b0dd771
 
 
 
 
 
 
 
 
05f4afc
b0dd771
 
 
 
 
d16a042
b0dd771
 
 
 
4e4f7dc
 
b0dd771
 
b5142d1
 
 
b0dd771
f4f6fd1
 
 
b0dd771
 
 
 
 
 
 
d11ec13

---
license: llama3.1
language:
- el
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation-inference
---

# Llama-Krikri-8B-Base: A large foundation Language Model for the Greek language

Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version, [Llama-Krikri-8B-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).

![image/png](llama-krikri-image.jpg)

# Model Information

- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
- 128k context length (**approximately 80,000 Greek words**)
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus. 
  * This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
  * The training corpus also contains 7.8 billion math and code tokens.
  * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:


| Sub-corpus   | # Tokens         | Percentage |
|-----------|------------------|------------|
| Greek     | 56.7 B   | 62.3 %      |
| English   | 21.0 B   | 23.1 %      |
| Parallel  |  5.5 B   | 6.0 %       |
| Math/Code |  7.8 B   | 8.6 %       |
| **Total** | 91 B   |  **100%**       |


Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.


# How to use

## With Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Base")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")

model.to(device)

input_text = tokenizer("Ένα κρικρί διαφέρει απο ένα λάμα επειδή", return_tensors='pt').to(device)
outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=True)

print(tokenizer.batch_decode(outputs)[0])
```

## With OpenAI compatible server via vLLM

```bash
vllm serve ilsp/Llama-Krikri-8B-Base \
  --enforce-eager \
  --dtype 'bfloat16' \
  --api-key token-abc123
```

Then, the model can be used through Python using:
```python
from openai import OpenAI

api_key = "token-abc123"
base_url = "http://localhost:8000/v1"

client = OpenAI(
    api_key=api_key,
    base_url=base_url,
)

response = client.completions.create(model="ilsp/Llama-Krikri-8B-Base",
                                     prompt="Η εκπαίδευση μεγάλων γλωσσικών μοντέλων περιλαμβάνει")
print(response.choices[0].text)
```

# Evaluation

Below, we report improvements of Llama-Krikri-8B-Base over Llama-3.1-8B for Greek and English:
- **+10.8%** on Greek benchmarks
- **+0.8%** on English benchmarks

Our evaluations for Llama-Krikri-8B-Base, Llama-3.1-8B, and Meltemi 7B v1.5 are performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). 

## Greek Benchmarks


The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).

Our evaluation suite includes: 
* Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)). 
* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).

We can see that our continual pretraining methodology enhances performance across all Greek test sets by a **+10.8%** average improvement over the base model. The results for the Greek test sets are shown in the following table:

|                | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
| Meltemi 7B v1.5 | 42.2%         | 61.0%       | 53.8%        | 40.0%            | 49.0%             | 41.2%   | 47.9%   |
| Llama-3.1-8B    | 33.4%         | 72.8%       | 52.1%        | 39.9%            | 51.1%             | 42.6%   | 48.7%   |
| Llama-Krikri-8B | **53.8%**         | **82.7%**       | **64.6%**        | **49.4%**            | **54.2%**             | **52.0%**   | **59.5%**   |


## English Benchmarks

We can also see that our training methodology not only mitigates catastrophic forgetting effectively, but also improves average performance across all English test sets by **+0.8%**. The results for the English test sets are shown in the following table:

|                | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
| Meltemi 7B v1.5 | 73.4%         | 77.7%       | 79.6%        | 54.1%            | 40.5%             | 56.9%   | 63.7%   |
| Llama-3.1-8B    | **74.6%**         | 71.5%       | **82.0%**        | **58.5%**            | 44.2%             | **66.2%**   | 66.2%   |
| Llama-Krikri-8B | 72.6%         | **79.8%**       | 80.7%        | 57.8%            | **44.8%**             | 65.1%   | **67.0%**   |

Please note that all evaluations were run with the latest version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5


# Ethical Considerations

This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.


# Acknowledgements

The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.