Safetensors
Macedonian
llama
mkd
macedonia
File size: 3,687 Bytes
fd75559
 
 
f1582e0
fd75559
 
 
 
 
 
662c05c
2c5e3ac
fd75559
 
 
662c05c
fd75559
 
 
 
70304c3
fd75559
 
 
 
 
 
 
 
 
 
662c05c
fd75559
998268c
f1582e0
fd75559
 
 
 
662c05c
 
 
94cb7f8
 
 
 
 
 
 
 
662c05c
fd75559
 
f60f896
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: llama3.1
datasets:
- LVSTCK/macedonian-corpus-raw-dedup
language:
- mk
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---

# 🐂 domestic-yak, a Macedonian LM (base version)

## Model Summary
This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.

### 📊 Results
The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.

As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.

| **Task (mk-version)**               | **domestic-yak-8B** | **Llama 3.1-8B Instruct** |
|-------------------------|---------------------------|-----------------------|
| **ARC Easy**           | **0.5244 ± 0.0102**       | 0.4453 ± 0.0102      |
| **ARC Challenge**      | **0.3183 ± 0.0136**       | 0.2824 ± 0.0132      |
| **BoolQ**              | **0.7676 ± 0.0074**       | 0.7639 ± 0.0074      |
| **HellaSwag**          | **0.4324 ± 0.0049**       | 0.3740 ± 0.0048      |
| **Openbook QA**        | **0.2920 ± 0.0204**       | 0.2520 ± 0.0194      |
| **PIQA**               | **0.6687 ± 0.0110**       | 0.5865 ± 0.0115      |
| **NQ Open**            | **0.0416 ± 0.0033**       | 0.0335 ± 0.0030      |
| **WinoGrande**         | **0.6259 ± 0.0136**       | 0.5683 ± 0.0139      |

## 🔑 Key Details
- **Language:** Macedonian (`mk`)
- **Base Model:** [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **Dataset:** [LVSTCK/macedonian-corpus-raw-dedup](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-cleaned-dedup) (deduplicated version)
- **Training Tokens:** ~1.6 billion
- **Pretraining Epochs:** 1 epoch
- **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)

## Usage 
... 

## 📬 Contact

For inquiries, feedback, or contributions, please feel free to reach out to the core team:

- [Stefan Krsteski](https://www.linkedin.com/in/stefan-krsteski-136abb235/) [📧](mailto:[email protected])
- [Matea Tashkovska](https://www.linkedin.com/in/matea-tashkovska-774603198/) [📧](mailto:[email protected])
- [Borjan Sazdov](https://www.linkedin.com/in/borjan-sazdov-4b2187211/) [📧](mailto:[email protected])

## ⚠️ Limitations
- **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
- **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
- **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).

## Citation
```
@model{domestic-yak-8B,
title={domestic-yak-8B: A Macedonian Language Model},
authors={Stefan Krsteski, Matea Tashkovska, Borjan Sazdov},
year={2024},
url={https://huggingface.co/LVSTCK/domestic-yak-8B},
note={Macedonian adaptation of LLaMA 8B.}
}
```