Safetensors
Macedonian
llama
mkd
macedonia
StefanKrsteski commited on
Commit
662c05c
·
verified ·
1 Parent(s): 2c5e3ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -4
README.md CHANGED
@@ -8,12 +8,12 @@ base_model:
8
  - meta-llama/Llama-3.1-8B-Instruct
9
  ---
10
 
11
- # Macedonian Language Model - Base Version
12
 
13
  ## Model Summary
14
  This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.
15
 
16
- ### Results
17
  The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.
18
 
19
  As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.
@@ -29,7 +29,7 @@ As shown in the table, domestic-yak-8B consistently outperforms its foundational
29
  | **NQ Open** | **0.0416 ± 0.0033** | 0.0335 ± 0.0030 |
30
  | **WinoGrande** | **0.6259 ± 0.0136** | 0.5683 ± 0.0139 |
31
 
32
- ## Key Details
33
  - **Language:** Macedonian (`mk`)
34
  - **Base Model:** [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
35
  - **Dataset:** [LVSTCK/macedonian-corpus-raw](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-raw) (deduplicated version)
@@ -37,7 +37,10 @@ As shown in the table, domestic-yak-8B consistently outperforms its foundational
37
  - **Pretraining Epochs:** 1 epoch
38
  - **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)
39
 
40
- ## Limitations
 
 
 
41
  - **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
42
  - **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
43
  - **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).
 
8
  - meta-llama/Llama-3.1-8B-Instruct
9
  ---
10
 
11
+ # 🐂 domestic-yak, a Macedonian LM (base version)
12
 
13
  ## Model Summary
14
  This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.
15
 
16
+ ### 📊 Results
17
  The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.
18
 
19
  As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.
 
29
  | **NQ Open** | **0.0416 ± 0.0033** | 0.0335 ± 0.0030 |
30
  | **WinoGrande** | **0.6259 ± 0.0136** | 0.5683 ± 0.0139 |
31
 
32
+ ## 🔑 Key Details
33
  - **Language:** Macedonian (`mk`)
34
  - **Base Model:** [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
35
  - **Dataset:** [LVSTCK/macedonian-corpus-raw](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-raw) (deduplicated version)
 
37
  - **Pretraining Epochs:** 1 epoch
38
  - **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)
39
 
40
+ ## Usage
41
+ ...
42
+
43
+ ## ⚠️ Limitations
44
  - **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
45
  - **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
46
  - **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).