Safetensors
Macedonian
llama
mkd
macedonia
domestic-yak-8B / README.md
StefanKrsteski's picture
Update README.md
34792c8 verified
|
raw
history blame
3.7 kB
metadata
license: llama3.1
datasets:
  - LVSTCK/macedonian-corpus-raw-dedup
language:
  - mk
base_model:
  - meta-llama/Llama-3.1-8B-Instruct
tags:
  - mk
  - mkd
  - macedonia

🐂 domestic-yak, a Macedonian LM (base version)

Model Summary

This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.

📊 Results

The table below compares the performance of our model, domestic-yak-8B, with its foundational model, Llama 3.1-8B Instruct evaluated using the macedonian-llm-eval benchmark.

As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.

Task (mk-version) domestic-yak-8B Llama 3.1-8B Instruct
ARC Easy 0.5244 ± 0.0102 0.4453 ± 0.0102
ARC Challenge 0.3183 ± 0.0136 0.2824 ± 0.0132
BoolQ 0.7676 ± 0.0074 0.7639 ± 0.0074
HellaSwag 0.4324 ± 0.0049 0.3740 ± 0.0048
Openbook QA 0.2920 ± 0.0204 0.2520 ± 0.0194
PIQA 0.6687 ± 0.0110 0.5865 ± 0.0115
NQ Open 0.0416 ± 0.0033 0.0335 ± 0.0030
WinoGrande 0.6259 ± 0.0136 0.5683 ± 0.0139

🔑 Key Details

⚠️ Limitations

  • Biases: The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
  • Domain Specificity: While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
  • Chat Capabilities: This version is the base model so its chat capabilities might be limited. If you would like to chat use the instruct version.

📬 Contact

For inquiries, feedback, or contributions, please feel free to reach out to the core team:

Citation

@model{domestic-yak-8B,
title={domestic-yak-8B: A Macedonian Language Model},
authors={Stefan Krsteski, Matea Tashkovska, Borjan Sazdov},
year={2024},
url={https://huggingface.co/LVSTCK/domestic-yak-8B},
note={Macedonian adaptation of Llama 8B.}
}