soksof commited on
Commit
d0f062e
·
verified ·
1 Parent(s): d11ec13

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -3,14 +3,16 @@ license: llama3.1
3
  language:
4
  - el
5
  - en
6
- pipeline_tag: token-classification
7
  library_name: transformers
 
 
8
  ---
9
 
10
  # Llama-Krikri-8B: A large foundation Language Model for the Greek language
11
 
12
- Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024 we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
13
- Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version [Llama-Krikri-8B-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).
14
 
15
  ![image/png](llama-krikri-image.jpg)
16
 
@@ -18,7 +20,7 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
18
 
19
  - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
20
  - 128k context length
21
- - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **110 billion tokens**.
22
  * This corpus includes 55 billion monolingual Greek tokens, constructed from publicly available resources.
23
  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (23,3 billion tokens) and Greek-English parallel data (5,26 billion tokens).
24
  * The training corpus also contains 6 billion math and code tokens.
@@ -33,7 +35,7 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
33
  | Math/Code | 5,951,964,497 | 6.6% |
34
  | **Total** | **89,653,165,085** | **100%** |
35
 
36
- Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of 110 billion tokens.
37
 
38
 
39
  # How to use
 
3
  language:
4
  - el
5
  - en
6
+ pipeline_tag: text-generation
7
  library_name: transformers
8
+ tags:
9
+ - text-generation-inference
10
  ---
11
 
12
  # Llama-Krikri-8B: A large foundation Language Model for the Greek language
13
 
14
+ Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
15
+ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version, [Llama-Krikri-8B-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).
16
 
17
  ![image/png](llama-krikri-image.jpg)
18
 
 
20
 
21
  - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
22
  - 128k context length
23
+ - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
24
  * This corpus includes 55 billion monolingual Greek tokens, constructed from publicly available resources.
25
  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (23,3 billion tokens) and Greek-English parallel data (5,26 billion tokens).
26
  * The training corpus also contains 6 billion math and code tokens.
 
35
  | Math/Code | 5,951,964,497 | 6.6% |
36
  | **Total** | **89,653,165,085** | **100%** |
37
 
38
+ Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of **110 billion tokens**.
39
 
40
 
41
  # How to use