LVSTCK
/

domestic-yak-8B

Safetensors

Macedonian

llama

mkd

macedonia

Model card Files Files and versions Community

StefanKrsteski commited on Jan 15

Commit

662c05c

verified ·

1 Parent(s): 2c5e3ac

Update README.md

Browse files

Files changed (1) hide show

README.md +7 -4

README.md CHANGED Viewed

@@ -8,12 +8,12 @@ base_model:
 - meta-llama/Llama-3.1-8B-Instruct
 ---
-# Macedonian Language Model - Base Version
 ## Model Summary
 This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.
-### Results
 The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.
 As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.
@@ -29,7 +29,7 @@ As shown in the table, domestic-yak-8B consistently outperforms its foundational
 | **NQ Open**            | **0.0416 ± 0.0033**       | 0.0335 ± 0.0030      |
 | **WinoGrande**         | **0.6259 ± 0.0136**       | 0.5683 ± 0.0139      |
-## Key Details
 - **Language:** Macedonian (`mk`)
 - **Base Model:** [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
 - **Dataset:** [LVSTCK/macedonian-corpus-raw](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-raw) (deduplicated version)
@@ -37,7 +37,10 @@ As shown in the table, domestic-yak-8B consistently outperforms its foundational
 - **Pretraining Epochs:** 1 epoch
 - **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)
-## Limitations
 - **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
 - **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
 - **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).

 - meta-llama/Llama-3.1-8B-Instruct
 ---
+# 🐂 domestic-yak, a Macedonian LM (base version)
 ## Model Summary
 This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.
+### 📊 Results
 The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.
 As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.
 | **NQ Open**            | **0.0416 ± 0.0033**       | 0.0335 ± 0.0030      |
 | **WinoGrande**         | **0.6259 ± 0.0136**       | 0.5683 ± 0.0139      |
+## 🔑 Key Details
 - **Language:** Macedonian (`mk`)
 - **Base Model:** [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
 - **Dataset:** [LVSTCK/macedonian-corpus-raw](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-raw) (deduplicated version)
 - **Pretraining Epochs:** 1 epoch
 - **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)
+## Usage
+...
+## ⚠️ Limitations
 - **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
 - **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
 - **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).