metadata

license: llama3.1
datasets:
  - LVSTCK/macedonian-corpus-raw-dedup
language:
  - mk
base_model:
  - meta-llama/Llama-3.1-8B-Instruct
tags:
  - mk
  - mkd
  - macedonia

🐂 domestic-yak, a Macedonian LM (base version)

Model Summary

This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.

📊 Results

The table below compares the performance of our model, domestic-yak-8B, with its foundational model, Llama 3.1-8B Instruct evaluated using the macedonian-llm-eval benchmark.

As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.

Task (mk-version)	domestic-yak-8B	Llama 3.1-8B Instruct
ARC Easy	0.5244 ± 0.0102	0.4453 ± 0.0102
ARC Challenge	0.3183 ± 0.0136	0.2824 ± 0.0132
BoolQ	0.7676 ± 0.0074	0.7639 ± 0.0074
HellaSwag	0.4324 ± 0.0049	0.3740 ± 0.0048
Openbook QA	0.2920 ± 0.0204	0.2520 ± 0.0194
PIQA	0.6687 ± 0.0110	0.5865 ± 0.0115
NQ Open	0.0416 ± 0.0033	0.0335 ± 0.0030
WinoGrande	0.6259 ± 0.0136	0.5683 ± 0.0139

🔑 Key Details

Language: Macedonian (mk)
Base Model: Meta Llama 3.1 8B Instruct
Dataset: LVSTCK/macedonian-corpus-raw-dedup (deduplicated version)
Training Tokens: ~1.6 billion
Pretraining Epochs: 1 epoch
Pretraining Objective: Causal Language Modeling (continued pretraining using all the weights)

⚠️ Limitations

Biases: The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
Domain Specificity: While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
Chat Capabilities: This version is the base model so its chat capabilities might be limited. If you would like to chat use the instruct version.

📬 Contact

For inquiries, feedback, or contributions, please feel free to reach out to the core team:

Citation

@model{domestic-yak-8B,
title={domestic-yak-8B: A Macedonian Language Model},
authors={Stefan Krsteski, Matea Tashkovska, Borjan Sazdov},
year={2024},
url={https://huggingface.co/LVSTCK/domestic-yak-8B},
note={Macedonian adaptation of Llama 8B.}
}