---
license: mit
language:
- fr
- en
base_model:
- Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead
---

# Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions (Distilled from Gemini-2.0-Flash-Thinking-Exp)

## Model Description
This repository contains **quantized versions** of the fine-tuned **Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead** model, which was trained via knowledge distillation from **Gemini-2.0-Flash-Thinking-Exp**. The fine-tuning process teaches the model to reason through and generate **Encoded Archival Description (EAD/XML)** outputs, ensuring structured reasoning before final archival XML generation.

This repository provides various **GGUF quantized formats**, allowing efficient inference on different hardware setups, including CPUs and GPUs.

---
## Available GGUF Files

The following quantized versions of the model were generated using **llama.cpp**:

| File Name | Description |
|-----------|-------------|
| `Gemini-Distill-Qwen2.5-0.5B-ead-Q2_K.gguf` | Ultra-low precision (2-bit) for extreme compression |
| `Gemini-Distill-Qwen2.5-0.5B-ead-Q3_K_M.gguf` | 3-bit quantization with mixed precision |
| `Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf` | 4-bit quantization with mixed precision |
| `Gemini-Distill-Qwen2.5-0.5B-ead-Q5_K_M.gguf` | 5-bit quantization with mixed precision |
| `Gemini-Distill-Qwen2.5-0.5B-ead-Q6_K.gguf` | 6-bit quantization |
| `Gemini-Distill-Qwen2.5-0.5B-ead-Q8_0.gguf` | 8-bit quantization for balance between speed and accuracy |
| `Gemini-Distill-Qwen2.5-0.5B-ead-fp16.gguf` | 16-bit floating point (fp16) version |
| `Gemini-Distill-Qwen2.5-0.5B-ead-fp32.gguf` | Full precision (fp32) version |

---
## How to Use the Quantized Model

### **Running the Model with llama.cpp**
To run the model using `llama.cpp`, use the following command:

```bash
./main -m Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf -p "Convert the following archival information into EAD/XML: ..."
```

For optimal performance, ensure you select the right quantized version based on your hardware capabilities.

### **Running the Model with GPT4All**
If using GPT4All, load the GGUF model with:

```python
from gpt4all import GPT4All

model_path = "Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf"
model = GPT4All(model_path)
response = model.generate("Convert the following archival information into EAD/XML:")
print(response)
```

### **Running the Model with Ollama**
If using Ollama, load the GGUF model with:

```bash
ollama run hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0
```

```python
import requests
import json

url = "http://localhost:11434/v1/chat/completions"

payload = json.dumps({
  "model": "hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0",
  "messages": [
    {
      "role": "system",
      "content": "You are an archivist expert in EAD/XML format for archival records metadata."
    },
    {
      "role": "user",
      "content": "Give me an example of <controlaccess> content."
    }
  ],
  "option": {
    "num_ctx": 4096,
    "temperature": 0.1
  },
  "stream": False
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)
```

---
## Choosing the Right Quantization Format
- **Lower-bit models (Q2_K, Q3_K_M, Q4_K_M):** Best for low-memory devices, but may lose some accuracy.
- **Mid-range (Q5_K_M, Q6_K):** Good trade-off between speed and precision.
- **Higher precision (Q8_0, fp16, fp32):** Best for accuracy but requires more memory.

For CPU inference, **Q4_K_M or Q5_K_M** is recommended for a balance between efficiency and performance.

---
## Limitations & Future Improvements
- **Inference Speed:** Ensure **Sliding Window Attention (SWA) is disabled**, as it may slow down inference.
  - To disable: `model.config.sliding_window = None`
- **Future Work:**
  - Further optimizations for CPU inference
  - Additional fine-tuning on larger datasets
  - Exploring LoRA/QLoRA for low-rank adaptation

---
## Citation & Acknowledgments
If you use this model in research or production, please cite:
```
@misc{your-citation,
  author = {Géraldine Geoffroy},
  title = {Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF}
}
```