Update README.md
Browse files
README.md
CHANGED
@@ -13,19 +13,21 @@ license: apache-2.0
|
|
13 |
# BübleLM
|
14 |
|
15 |
<div align="center">
|
16 |
-
<img src="/
|
17 |
</div>
|
18 |
|
19 |
-
BübleLM is a German language model based on Gemma-2B, adapted using trans-tokenization with a German
|
20 |
|
21 |
## Model Details
|
22 |
|
23 |
-
- **Architecture**: Based on Gemma-2B
|
24 |
- **Parameters**: 2 billion
|
25 |
-
- **
|
26 |
-
-
|
27 |
-
-
|
28 |
-
-
|
|
|
|
|
29 |
|
30 |
## Training Data
|
31 |
|
@@ -49,6 +51,8 @@ Key improvements over Gemma-2B baseline:
|
|
49 |
- ARC-DE: +41% (32.3% vs 22.9%)
|
50 |
- Average zero-shot: +40% (35.8% vs 25.5%)
|
51 |
|
|
|
|
|
52 |
## Safety & Ethics
|
53 |
|
54 |
### Toxicity
|
@@ -60,8 +64,11 @@ Key improvements over Gemma-2B baseline:
|
|
60 |
- Slight preference for gender-inclusive language (not statistically significant)
|
61 |
- Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
|
62 |
|
|
|
63 |
## Usage
|
64 |
|
|
|
|
|
65 |
```python
|
66 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
67 |
|
@@ -72,21 +79,29 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
72 |
torch_dtype=torch.bfloat16
|
73 |
)
|
74 |
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
outputs = model.generate(**
|
79 |
print(tokenizer.decode(outputs[0]))
|
80 |
```
|
81 |
|
|
|
|
|
|
|
82 |
## Limitations
|
83 |
|
84 |
-
- Limited vocabulary size (20k tokens) compared to multilingual models
|
85 |
- Performance may vary on specialized domains not well-represented in training data
|
86 |
-
-
|
|
|
87 |
|
88 |
## Citation
|
89 |
|
90 |
```bibtex
|
91 |
-
|
|
|
|
|
|
|
|
|
92 |
```
|
|
|
13 |
# BübleLM
|
14 |
|
15 |
<div align="center">
|
16 |
+
<img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" />
|
17 |
</div>
|
18 |
|
19 |
+
BübleLM is a German language model based on Gemma-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.
|
20 |
|
21 |
## Model Details
|
22 |
|
23 |
+
- **Architecture**: Based on Gemma-2B decoder-only architecture
|
24 |
- **Parameters**: 2 billion
|
25 |
+
- **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary)
|
26 |
+
- Fertility rate: 1.78 tokens per word
|
27 |
+
- Optimized for German morphological structures
|
28 |
+
- Trained on the same corpus as the model
|
29 |
+
- **Context Length**: 8192 tokens
|
30 |
+
- **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs
|
31 |
|
32 |
## Training Data
|
33 |
|
|
|
51 |
- ARC-DE: +41% (32.3% vs 22.9%)
|
52 |
- Average zero-shot: +40% (35.8% vs 25.5%)
|
53 |
|
54 |
+
Consistently outperforms both the base Gemma-2B and other German models like LLaMmlein-1B across most tasks.
|
55 |
+
|
56 |
## Safety & Ethics
|
57 |
|
58 |
### Toxicity
|
|
|
64 |
- Slight preference for gender-inclusive language (not statistically significant)
|
65 |
- Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
|
66 |
|
67 |
+
|
68 |
## Usage
|
69 |
|
70 |
+
**Note**: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
|
71 |
+
|
72 |
```python
|
73 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
74 |
|
|
|
79 |
torch_dtype=torch.bfloat16
|
80 |
)
|
81 |
|
82 |
+
# Basic text completion
|
83 |
+
text = "Berlin ist eine Stadt, die"
|
84 |
+
inputs = tokenizer(text, return_tensors="pt").to("cuda")
|
85 |
+
outputs = model.generate(**inputs, max_new_tokens=256)
|
86 |
print(tokenizer.decode(outputs[0]))
|
87 |
```
|
88 |
|
89 |
+
For instruction-tuning experiments or chat applications, we recommend fine-tuning the model first with appropriate German instruction datasets.
|
90 |
+
|
91 |
+
|
92 |
## Limitations
|
93 |
|
94 |
+
- Limited vocabulary size (20k tokens) compared to multilingual models (250k for Gemma)
|
95 |
- Performance may vary on specialized domains not well-represented in training data
|
96 |
+
- Higher fertility rate (1.78) due to smaller vocabulary size
|
97 |
+
- Inherits base limitations from Gemma architecture
|
98 |
|
99 |
## Citation
|
100 |
|
101 |
```bibtex
|
102 |
+
@article{delobelle2024buble,
|
103 |
+
title={BübleLM: A small German LM},
|
104 |
+
author={Delobelle, Pieter and Akbik, Alan and others},
|
105 |
+
year={2024}
|
106 |
+
}
|
107 |
```
|