Update README.md
Browse files
README.md
CHANGED
@@ -8,12 +8,12 @@ base_model:
|
|
8 |
- meta-llama/Llama-3.1-8B-Instruct
|
9 |
---
|
10 |
|
11 |
-
#
|
12 |
|
13 |
## Model Summary
|
14 |
This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.
|
15 |
|
16 |
-
### Results
|
17 |
The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.
|
18 |
|
19 |
As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.
|
@@ -29,7 +29,7 @@ As shown in the table, domestic-yak-8B consistently outperforms its foundational
|
|
29 |
| **NQ Open** | **0.0416 ± 0.0033** | 0.0335 ± 0.0030 |
|
30 |
| **WinoGrande** | **0.6259 ± 0.0136** | 0.5683 ± 0.0139 |
|
31 |
|
32 |
-
## Key Details
|
33 |
- **Language:** Macedonian (`mk`)
|
34 |
- **Base Model:** [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
|
35 |
- **Dataset:** [LVSTCK/macedonian-corpus-raw](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-raw) (deduplicated version)
|
@@ -37,7 +37,10 @@ As shown in the table, domestic-yak-8B consistently outperforms its foundational
|
|
37 |
- **Pretraining Epochs:** 1 epoch
|
38 |
- **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)
|
39 |
|
40 |
-
##
|
|
|
|
|
|
|
41 |
- **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
|
42 |
- **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
|
43 |
- **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).
|
|
|
8 |
- meta-llama/Llama-3.1-8B-Instruct
|
9 |
---
|
10 |
|
11 |
+
# 🐂 domestic-yak, a Macedonian LM (base version)
|
12 |
|
13 |
## Model Summary
|
14 |
This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.
|
15 |
|
16 |
+
### 📊 Results
|
17 |
The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.
|
18 |
|
19 |
As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.
|
|
|
29 |
| **NQ Open** | **0.0416 ± 0.0033** | 0.0335 ± 0.0030 |
|
30 |
| **WinoGrande** | **0.6259 ± 0.0136** | 0.5683 ± 0.0139 |
|
31 |
|
32 |
+
## 🔑 Key Details
|
33 |
- **Language:** Macedonian (`mk`)
|
34 |
- **Base Model:** [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
|
35 |
- **Dataset:** [LVSTCK/macedonian-corpus-raw](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-raw) (deduplicated version)
|
|
|
37 |
- **Pretraining Epochs:** 1 epoch
|
38 |
- **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)
|
39 |
|
40 |
+
## Usage
|
41 |
+
...
|
42 |
+
|
43 |
+
## ⚠️ Limitations
|
44 |
- **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
|
45 |
- **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
|
46 |
- **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).
|