flair
/

bueble-lm-2b

@@ -13,19 +13,21 @@ license: apache-2.0
 # BübleLM
 <div align="center">
-   <img src="/api/placeholder/400/200" alt="BübleLM Logo" />
 </div>
-BübleLM is a German language model based on Gemma-2B, adapted using trans-tokenization with a German-specific SentencePiece tokenizer. This 2B parameter model achieves state-of-the-art performance on German language tasks while maintaining strong safety properties.
 ## Model Details
-- **Architecture**: Based on Gemma-2B
 - **Parameters**: 2 billion
-- **Training**: Trans-tokenization from Gemma-2B using German SentencePiece tokenizer (vocab size: 20k)
-- **Context Length**: Same as Gemma-2B
-- **Input**: Text (German)
-- **Output**: Text (German)
 ## Training Data
@@ -49,6 +51,8 @@ Key improvements over Gemma-2B baseline:
 - ARC-DE: +41% (32.3% vs 22.9%)
 - Average zero-shot: +40% (35.8% vs 25.5%)
 ## Safety & Ethics
 ### Toxicity
@@ -60,8 +64,11 @@ Key improvements over Gemma-2B baseline:
 - Slight preference for gender-inclusive language (not statistically significant)
 - Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
 ## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -72,21 +79,29 @@ model = AutoModelForCausalLM.from_pretrained(
     torch_dtype=torch.bfloat16
 )
-messages = [{"role": "user", "content": "Schreibe ein Gedicht über Berlin."}]
-input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
-outputs = model.generate(**input_ids, max_new_tokens=256)
 print(tokenizer.decode(outputs[0]))
 ```
 ## Limitations
-- Limited vocabulary size (20k tokens) compared to multilingual models
 - Performance may vary on specialized domains not well-represented in training data
-- Model inherits base limitations from Gemma architecture
 ## Citation
 ```bibtex
 ```

 # BübleLM
 <div align="center">
+   <img src="https://pieter.ai/resources/buble-logo.png" alt="BübleLM Logo" />
 </div>
+BübleLM is a German language model based on Gemma-2B, adapted using [trans-tokenization](https://pieter.ai/trans-tokenization/) with a custom German SentencePiece tokenizer. The model demonstrates how language-specific tokenization can significantly improve performance while maintaining the base model's capabilities.
 ## Model Details
+- **Architecture**: Based on Gemma-2B decoder-only architecture
 - **Parameters**: 2 billion
+- **Tokenizer**: Custom German SentencePiece tokenizer (20k vocabulary)
+  - Fertility rate: 1.78 tokens per word
+  - Optimized for German morphological structures
+  - Trained on the same corpus as the model
+- **Context Length**: 8192 tokens
+- **Training Hardware**: Single node with 4x NVidia A100-SXM4-80GB GPUs
 ## Training Data
 - ARC-DE: +41% (32.3% vs 22.9%)
 - Average zero-shot: +40% (35.8% vs 25.5%)
+Consistently outperforms both the base Gemma-2B and other German models like LLaMmlein-1B across most tasks.
 ## Safety & Ethics
 ### Toxicity
 - Slight preference for gender-inclusive language (not statistically significant)
 - Example: "Lehrer" vs "Lehrer*innen" (∆PPL = -9.61)
 ## Usage
+**Note**: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
     torch_dtype=torch.bfloat16
 )
+# Basic text completion
+text = "Berlin ist eine Stadt, die"
+inputs = tokenizer(text, return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=256)
 print(tokenizer.decode(outputs[0]))
 ```
+For instruction-tuning experiments or chat applications, we recommend fine-tuning the model first with appropriate German instruction datasets.
 ## Limitations
+- Limited vocabulary size (20k tokens) compared to multilingual models (250k for Gemma)
 - Performance may vary on specialized domains not well-represented in training data
+- Higher fertility rate (1.78) due to smaller vocabulary size
+- Inherits base limitations from Gemma architecture
 ## Citation
 ```bibtex
+@article{delobelle2024buble,
+    title={BübleLM: A small German LM},
+    author={Delobelle, Pieter and Akbik, Alan and others},
+    year={2024}
+}
 ```