Geraldine
/

Gemini-Distill-Qwen2.5-0.5B-ead-GGUF

GGUF

French

English

conversational

Model card Files Files and versions Community

Geraldine commited on 21 days ago

Commit

4dc093c

verified ·

1 Parent(s): 1998e14

Create README.md

Browse files

Files changed (1) hide show

README.md +125 -0

README.md ADDED Viewed

	@@ -0,0 +1,125 @@

+---
+license: mit
+language:
+- fr
+- en
+base_model:
+- Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead
+---
+# Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions (Distilled from Gemini-2.0-Flash-Thinking-Exp)
+## Model Description
+This repository contains **quantized versions** of the fine-tuned **Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead** model, which was trained via knowledge distillation from **Gemini-2.0-Flash-Thinking-Exp**. The fine-tuning process teaches the model to reason through and generate **Encoded Archival Description (EAD/XML)** outputs, ensuring structured reasoning before final archival XML generation.
+This repository provides various **GGUF quantized formats**, allowing efficient inference on different hardware setups, including CPUs and GPUs.
+---
+## Available GGUF Files
+The following quantized versions of the model were generated using **llama.cpp**:
+| File Name | Description |
+|-----------|-------------|
+| `Gemini-Distill-Qwen2.5-0.5B-ead-Q2_K.gguf` | Ultra-low precision (2-bit) for extreme compression |
+| `Gemini-Distill-Qwen2.5-0.5B-ead-Q3_K_M.gguf` | 3-bit quantization with mixed precision |
+| `Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf` | 4-bit quantization with mixed precision |
+| `Gemini-Distill-Qwen2.5-0.5B-ead-Q5_K_M.gguf` | 5-bit quantization with mixed precision |
+| `Gemini-Distill-Qwen2.5-0.5B-ead-Q6_K.gguf` | 6-bit quantization |
+| `Gemini-Distill-Qwen2.5-0.5B-ead-Q8_0.gguf` | 8-bit quantization for balance between speed and accuracy |
+| `Gemini-Distill-Qwen2.5-0.5B-ead-fp16.gguf` | 16-bit floating point (fp16) version |
+| `Gemini-Distill-Qwen2.5-0.5B-ead-fp32.gguf` | Full precision (fp32) version |
+---
+## How to Use the Quantized Model
+### **Running the Model with llama.cpp**
+To run the model using `llama.cpp`, use the following command:
+```bash
+./main -m Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf -p "Convert the following archival information into EAD/XML: ..."
+```
+For optimal performance, ensure you select the right quantized version based on your hardware capabilities.
+### **Running the Model with GPT4All**
+If using GPT4All, load the GGUF model with:
+```python
+from gpt4all import GPT4All
+model_path = "Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf"
+model = GPT4All(model_path)
+response = model.generate("Convert the following archival information into EAD/XML:")
+print(response)
+```
+### **Running the Model with Ollama**
+If using Ollama, load the GGUF model with:
+```bash
+ollama run hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0
+```
+```python
+import requests
+import json
+url = "http://localhost:11434/v1/chat/completions"
+payload = json.dumps({
+  "model": "hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0",
+  "messages": [
+    {
+      "role": "system",
+      "content": "You are an archivist expert in EAD/XML format for archival records metadata."
+    },
+    {
+      "role": "user",
+      "content": "Give me an example of <controlaccess> content."
+    }
+  ],
+  "option": {
+    "num_ctx": 4096,
+    "temperature": 0.1
+  },
+  "stream": False
+})
+headers = {
+  'Content-Type': 'application/json'
+}
+response = requests.request("POST", url, headers=headers, data=payload)
+print(response.text)
+```
+---
+## Choosing the Right Quantization Format
+- **Lower-bit models (Q2_K, Q3_K_M, Q4_K_M):** Best for low-memory devices, but may lose some accuracy.
+- **Mid-range (Q5_K_M, Q6_K):** Good trade-off between speed and precision.
+- **Higher precision (Q8_0, fp16, fp32):** Best for accuracy but requires more memory.
+For CPU inference, **Q4_K_M or Q5_K_M** is recommended for a balance between efficiency and performance.
+---
+## Limitations & Future Improvements
+- **Inference Speed:** Ensure **Sliding Window Attention (SWA) is disabled**, as it may slow down inference.
+  - To disable: `model.config.sliding_window = None`
+- **Future Work:**
+  - Further optimizations for CPU inference
+  - Additional fine-tuning on larger datasets
+  - Exploring LoRA/QLoRA for low-rank adaptation
+---
+## Citation & Acknowledgments
+If you use this model in research or production, please cite:
+```
+@misc{your-citation,
+  author = {Géraldine Geoffroy},
+  title = {Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions},
+  year = {2025},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF}
+}
+```