domestic-yak-8B / README.md

Update README.md

94cb7f8 verified about 1 month ago

3.69 kB

	---
	license: llama3.1
	datasets:
	- LVSTCK/macedonian-corpus-raw-dedup
	language:
	- mk
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	---

	# 🐂 domestic-yak, a Macedonian LM (base version)

	## Model Summary
	This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.

	### 📊 Results
	The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.

	As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.

	\| Task (mk-version) \| domestic-yak-8B \| Llama 3.1-8B Instruct \|
	\|-------------------------\|---------------------------\|-----------------------\|
	\| ARC Easy \| 0.5244 ± 0.0102 \| 0.4453 ± 0.0102 \|
	\| ARC Challenge \| 0.3183 ± 0.0136 \| 0.2824 ± 0.0132 \|
	\| BoolQ \| 0.7676 ± 0.0074 \| 0.7639 ± 0.0074 \|
	\| HellaSwag \| 0.4324 ± 0.0049 \| 0.3740 ± 0.0048 \|
	\| Openbook QA \| 0.2920 ± 0.0204 \| 0.2520 ± 0.0194 \|
	\| PIQA \| 0.6687 ± 0.0110 \| 0.5865 ± 0.0115 \|
	\| NQ Open \| 0.0416 ± 0.0033 \| 0.0335 ± 0.0030 \|
	\| WinoGrande \| 0.6259 ± 0.0136 \| 0.5683 ± 0.0139 \|

	## 🔑 Key Details
	- Language: Macedonian (`mk`)
	- Base Model: [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
	- Dataset: [LVSTCK/macedonian-corpus-raw-dedup](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-cleaned-dedup) (deduplicated version)
	- Training Tokens: ~1.6 billion
	- Pretraining Epochs: 1 epoch
	- Pretraining Objective: Causal Language Modeling (continued pretraining using all the weights)

	## Usage
	...

	## 📬 Contact

	For inquiries, feedback, or contributions, please feel free to reach out to the core team:

	- [Stefan Krsteski](https://www.linkedin.com/in/stefan-krsteski-136abb235/) [📧](mailto:[email protected])
	- [Matea Tashkovska](https://www.linkedin.com/in/matea-tashkovska-774603198/) [📧](mailto:[email protected])
	- [Borjan Sazdov](https://www.linkedin.com/in/borjan-sazdov-4b2187211/) [📧](mailto:[email protected])

	## ⚠️ Limitations
	- Biases: The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
	- Domain Specificity: While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
	- Chat Capabilities: This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).

	## Citation
	```
	@model{domestic-yak-8B,
	title={domestic-yak-8B: A Macedonian Language Model},
	authors={Stefan Krsteski, Matea Tashkovska, Borjan Sazdov},
	year={2024},
	url={https://huggingface.co/LVSTCK/domestic-yak-8B},
	note={Macedonian adaptation of LLaMA 8B.}
	}
	```