language:
- da
license: llama2
library_name: transformers
base_model:
- meta-llama/Llama-2-7b-hf
pipeline_tag: text-generation
Model Details
SnakModel is a 7B-parameter model specifically designed for the Danish language. This is the base variant: SnakModel-7B (base)
. Our models build upon Llama 2, which we continuously pre-train on a diverse collection of Danish corpora comprising 350M documents and 13.6B words, before tuning it on 3.7M Danish instruction-answer pairs.
Model Developers
NLPnorth research unit at the IT University of Copenhagen, Denmark.
Variations
SnakModel comes as an instruction-tuned, and a base version. In addition, each model includes intermediate checkpoints (under model revisions).
Input
Text only.
Output
Text only.
Model Architecture
SnakModel is an auto-regressive, transformer-based language model. The instruct
version uses supervised fine-tuning (SFT) to enable instruction following in Danish.
Model Dates
SnakModel was trained between January 2024 and September 2024.
License
This model follows the original Llama 2 license agreement.
Research Paper
[Released in Q1 2025]
Intended Use & Limitations
Intended Use Cases
SnakModel is intended for use in Danish. The instruction-tuned variant is intended for assistant-like chat.
The instruct
variant follows the Llama 2 (chat) instruction template, in which instructions are encapsulated in special tokens, i.e., [INST] {instruction} [/INST]
.
Limitations
SnakModel variants are fine-tuned on Danish data. As such, the use in other languages falls out-of-scope. While we found SnakModel to be more proficient in Danish, than other Llama 2-based models, it still frequently generates factually incorrect output. Make sure to carefully evaluate and weigh these factors before deploying the model. In addition, make sure to adhere to the original Llama 2 license agreement.
Hardware and Software
Training Factors
SnakModel is trained on private infrastructure with one node, containing four NVIDIA A100-PCIe 40GB GPUs. The node has an AMD Epyc 7662 128 Core Processor and 1TB of RAM.
Carbon Footprint
Total training time accounted to 8,928 GPU hours, with an average carbon efficiency at 0.122kg CO2eq / kWh. This is equivalent to 272.3kg CO2eq emitted, based on the Machine Learning Impact calculator.
Training Data
Overview
SnakModel was continuously pre-train on a diverse collection of Danish corpora comprising 350M documents and 13.6B words. The instruct
version is further tuned on 3.7M Danish instruction-answer pairs.
Data Freshness
The pre-training data has a cutoff of January 2024.
Evaluation Results
Model | LA (mF1) | NER (μF1) | Senti (mF1) | Summ (BERTScore) | CSR (Acc.) | QA (F1) | TM (Acc.) | CT (Acc.) | AVG |
---|---|---|---|---|---|---|---|---|---|
LLaMA2-7B_base | 33.43 | 22.31 | 61.54 | 65.50 | 29.76 | 63.54 | 38.69 | 57.05 | 46.48 |
LLaMA2-7B_chat | 47.42 | 24.63 | 62.35 | 66.15 | 32.24 | 61.34 | 46.67 | 55.18 | 49.50 |
LLaMA2-7B_base + INST₍d₎ₐ | 36.10 | 28.48 | 62.86 | 66.43 | 29.04 | 64.40 | 49.10 | 58.46 | 49.35 |
LLaMA2-7B_chat + INST₍d₎ₐ | 43.40 | 29.70 | 65.92 | 65.81 | 30.95 | 62.46 | 57.26 | 55.59 | 51.39 |
Viking-7B | 33.67 | 17.18 | 49.48 | 61.96 | 25.11 | 56.29 | 23.97 | 34.90 | 37.82 |
SnakModel-7B_base | 56.28 | 19.91 | 57.42 | 58.95 | 30.47 | 18.52 | 69.14 | 60.93 | 46.45 |
SnakModel-7B_inst | 52.91 | 29.76 | 66.70 | 66.61 | 29.46 | 64.66 | 71.05 | 71.88 | 56.63 |
Citation
@inproceedings{zhang-etal-2025-snakmodel,
title = "{SnakModel}: {Lessons} Learned from Training an Open {Danish} Large Language Model",
author = {Zhang, Mike and
M{\"u}ller-Eberstein, Max and
Bassignana, Elisa and
Goot, Rob van der},
editor = "Johansson, Richard and
Stymne, Sara",
booktitle = "Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)",
month = mar,
year = "2025",
address = "Tallinn, Estonia",
publisher = "University of Tartu Library",
url = "https://aclanthology.org/2025.nodalida-1.80/",
pages = "812--825",
ISBN = "978-9908-53-109-0",
abstract = "We present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints."
}