Update README.md
Browse files
README.md
CHANGED
@@ -3,14 +3,16 @@ license: llama3.1
|
|
3 |
language:
|
4 |
- el
|
5 |
- en
|
6 |
-
pipeline_tag:
|
7 |
library_name: transformers
|
|
|
|
|
8 |
---
|
9 |
|
10 |
# Llama-Krikri-8B: A large foundation Language Model for the Greek language
|
11 |
|
12 |
-
Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024 we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
|
13 |
-
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version [Llama-Krikri-8B-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).
|
14 |
|
15 |

|
16 |
|
@@ -18,7 +20,7 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
|
|
18 |
|
19 |
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
|
20 |
- 128k context length
|
21 |
-
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large corpus
|
22 |
* This corpus includes 55 billion monolingual Greek tokens, constructed from publicly available resources.
|
23 |
* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (23,3 billion tokens) and Greek-English parallel data (5,26 billion tokens).
|
24 |
* The training corpus also contains 6 billion math and code tokens.
|
@@ -33,7 +35,7 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
|
|
33 |
| Math/Code | 5,951,964,497 | 6.6% |
|
34 |
| **Total** | **89,653,165,085** | **100%** |
|
35 |
|
36 |
-
Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of 110 billion tokens
|
37 |
|
38 |
|
39 |
# How to use
|
|
|
3 |
language:
|
4 |
- el
|
5 |
- en
|
6 |
+
pipeline_tag: text-generation
|
7 |
library_name: transformers
|
8 |
+
tags:
|
9 |
+
- text-generation-inference
|
10 |
---
|
11 |
|
12 |
# Llama-Krikri-8B: A large foundation Language Model for the Greek language
|
13 |
|
14 |
+
Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
|
15 |
+
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version, [Llama-Krikri-8B-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).
|
16 |
|
17 |

|
18 |
|
|
|
20 |
|
21 |
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
|
22 |
- 128k context length
|
23 |
+
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
|
24 |
* This corpus includes 55 billion monolingual Greek tokens, constructed from publicly available resources.
|
25 |
* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (23,3 billion tokens) and Greek-English parallel data (5,26 billion tokens).
|
26 |
* The training corpus also contains 6 billion math and code tokens.
|
|
|
35 |
| Math/Code | 5,951,964,497 | 6.6% |
|
36 |
| **Total** | **89,653,165,085** | **100%** |
|
37 |
|
38 |
+
Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of **110 billion tokens**.
|
39 |
|
40 |
|
41 |
# How to use
|