pdelobelle commited on
Commit
f3af836
·
verified ·
1 Parent(s): c3e5846

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -14
README.md CHANGED
@@ -13,19 +13,21 @@ license: apache-2.0
13
  # BübleLM
14
 
15
  <div align="center">
16
- <img src="/api/placeholder/400/200" alt="BübleLM Logo" />
17
  </div>
18
 
19
- BübleLM is a German language model based on Gemma-2B, adapted using trans-tokenization with a German-specific SentencePiece tokenizer. This 2B parameter model achieves state-of-the-art performance on German language tasks while maintaining strong safety properties.
20
 
21
  ## Model Details
22
 
23
- - **Architecture**: Based on Gemma-2B
24
  - **Parameters**: 2 billion
25
- - **Training**: Trans-tokenization from Gemma-2B using German SentencePiece tokenizer (vocab size: 20k)
26
- - **Context Length**: Same as Gemma-2B
27
- - **Input**: Text (German)
28
- - **Output**: Text (German)
 
 
29
 
30
  ## Training Data
31
 
@@ -49,6 +51,8 @@ Key improvements over Gemma-2B baseline:
49
  - ARC-DE: +41% (32.3% vs 22.9%)
50
  - Average zero-shot: +40% (35.8% vs 25.5%)
51
 
 
 
52
  ## Safety & Ethics
53
 
54
  ### Toxicity
@@ -60,8 +64,11 @@ Key improvements over Gemma-2B baseline:
60
  - Slight preference for gender-inclusive language (not statistically significant)
61
  - Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
62
 
 
63
  ## Usage
64
 
 
 
65
  ```python
66
  from transformers import AutoTokenizer, AutoModelForCausalLM
67
 
@@ -72,21 +79,29 @@ model = AutoModelForCausalLM.from_pretrained(
72
  torch_dtype=torch.bfloat16
73
  )
74
 
75
- messages = [{"role": "user", "content": "Schreibe ein Gedicht über Berlin."}]
76
- input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
77
-
78
- outputs = model.generate(**input_ids, max_new_tokens=256)
79
  print(tokenizer.decode(outputs[0]))
80
  ```
81
 
 
 
 
82
  ## Limitations
83
 
84
- - Limited vocabulary size (20k tokens) compared to multilingual models
85
  - Performance may vary on specialized domains not well-represented in training data
86
- - Model inherits base limitations from Gemma architecture
 
87
 
88
  ## Citation
89
 
90
  ```bibtex
91
-
 
 
 
 
92
  ```
 
13
  # BübleLM
14
 
15
  <div align="center">
16
+ <img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" />
17
  </div>
18
 
19
+ BübleLM is a German language model based on Gemma-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.
20
 
21
  ## Model Details
22
 
23
+ - **Architecture**: Based on Gemma-2B decoder-only architecture
24
  - **Parameters**: 2 billion
25
+ - **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary)
26
+ - Fertility rate: 1.78 tokens per word
27
+ - Optimized for German morphological structures
28
+ - Trained on the same corpus as the model
29
+ - **Context Length**: 8192 tokens
30
+ - **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs
31
 
32
  ## Training Data
33
 
 
51
  - ARC-DE: +41% (32.3% vs 22.9%)
52
  - Average zero-shot: +40% (35.8% vs 25.5%)
53
 
54
+ Consistently outperforms both the base Gemma-2B and other German models like LLaMmlein-1B across most tasks.
55
+
56
  ## Safety & Ethics
57
 
58
  ### Toxicity
 
64
  - Slight preference for gender-inclusive language (not statistically significant)
65
  - Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
66
 
67
+
68
  ## Usage
69
 
70
+ **Note**: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
71
+
72
  ```python
73
  from transformers import AutoTokenizer, AutoModelForCausalLM
74
 
 
79
  torch_dtype=torch.bfloat16
80
  )
81
 
82
+ # Basic text completion
83
+ text = "Berlin ist eine Stadt, die"
84
+ inputs = tokenizer(text, return_tensors="pt").to("cuda")
85
+ outputs = model.generate(**inputs, max_new_tokens=256)
86
  print(tokenizer.decode(outputs[0]))
87
  ```
88
 
89
+ For instruction-tuning experiments or chat applications, we recommend fine-tuning the model first with appropriate German instruction datasets.
90
+
91
+
92
  ## Limitations
93
 
94
+ - Limited vocabulary size (20k tokens) compared to multilingual models (250k for Gemma)
95
  - Performance may vary on specialized domains not well-represented in training data
96
+ - Higher fertility rate (1.78) due to smaller vocabulary size
97
+ - Inherits base limitations from Gemma architecture
98
 
99
  ## Citation
100
 
101
  ```bibtex
102
+ @article{delobelle2024buble,
103
+ title={BübleLM: A small German LM},
104
+ author={Delobelle, Pieter and Akbik, Alan and others},
105
+ year={2024}
106
+ }
107
  ```