|
--- |
|
license: llama3.1 |
|
datasets: |
|
- LVSTCK/macedonian-corpus-raw-dedup |
|
language: |
|
- mk |
|
base_model: |
|
- meta-llama/Llama-3.1-8B-Instruct |
|
--- |
|
|
|
# 🐂 domestic-yak, a Macedonian LM (base version) |
|
|
|
## Model Summary |
|
This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation. |
|
|
|
### 📊 Results |
|
The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark. |
|
|
|
As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks. |
|
|
|
| **Task (mk-version)** | **domestic-yak-8B** | **Llama 3.1-8B Instruct** | |
|
|-------------------------|---------------------------|-----------------------| |
|
| **ARC Easy** | **0.5244 ± 0.0102** | 0.4453 ± 0.0102 | |
|
| **ARC Challenge** | **0.3183 ± 0.0136** | 0.2824 ± 0.0132 | |
|
| **BoolQ** | **0.7676 ± 0.0074** | 0.7639 ± 0.0074 | |
|
| **HellaSwag** | **0.4324 ± 0.0049** | 0.3740 ± 0.0048 | |
|
| **Openbook QA** | **0.2920 ± 0.0204** | 0.2520 ± 0.0194 | |
|
| **PIQA** | **0.6687 ± 0.0110** | 0.5865 ± 0.0115 | |
|
| **NQ Open** | **0.0416 ± 0.0033** | 0.0335 ± 0.0030 | |
|
| **WinoGrande** | **0.6259 ± 0.0136** | 0.5683 ± 0.0139 | |
|
|
|
## 🔑 Key Details |
|
- **Language:** Macedonian (`mk`) |
|
- **Base Model:** [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |
|
- **Dataset:** [LVSTCK/macedonian-corpus-raw-dedup](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-cleaned-dedup) (deduplicated version) |
|
- **Training Tokens:** ~1.6 billion |
|
- **Pretraining Epochs:** 1 epoch |
|
- **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights) |
|
|
|
## Usage |
|
... |
|
|
|
## 📬 Contact |
|
|
|
For inquiries, feedback, or contributions, please feel free to reach out to the core team: |
|
|
|
- [Stefan Krsteski](https://www.linkedin.com/in/stefan-krsteski-136abb235/) [📧](mailto:[email protected]) |
|
- [Matea Tashkovska](https://www.linkedin.com/in/matea-tashkovska-774603198/) [📧](mailto:[email protected]) |
|
- [Borjan Sazdov](https://www.linkedin.com/in/borjan-sazdov-4b2187211/) [📧](mailto:[email protected]) |
|
|
|
## ⚠️ Limitations |
|
- **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications. |
|
- **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented. |
|
- **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct). |
|
|
|
## Citation |
|
``` |
|
@model{domestic-yak-8B, |
|
title={domestic-yak-8B: A Macedonian Language Model}, |
|
authors={Stefan Krsteski, Matea Tashkovska, Borjan Sazdov}, |
|
year={2024}, |
|
url={https://huggingface.co/LVSTCK/domestic-yak-8B}, |
|
note={Macedonian adaptation of LLaMA 8B.} |
|
} |
|
``` |