--- license: mit language: - fr - en base_model: - Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead --- # Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions (Distilled from Gemini-2.0-Flash-Thinking-Exp) ## Model Description This repository contains **quantized versions** of the fine-tuned **Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead** model, which was trained via knowledge distillation from **Gemini-2.0-Flash-Thinking-Exp**. The fine-tuning process teaches the model to reason through and generate **Encoded Archival Description (EAD/XML)** outputs, ensuring structured reasoning before final archival XML generation. This repository provides various **GGUF quantized formats**, allowing efficient inference on different hardware setups, including CPUs and GPUs. --- ## Available GGUF Files The following quantized versions of the model were generated using **llama.cpp**: | File Name | Description | |-----------|-------------| | `Gemini-Distill-Qwen2.5-0.5B-ead-Q2_K.gguf` | Ultra-low precision (2-bit) for extreme compression | | `Gemini-Distill-Qwen2.5-0.5B-ead-Q3_K_M.gguf` | 3-bit quantization with mixed precision | | `Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf` | 4-bit quantization with mixed precision | | `Gemini-Distill-Qwen2.5-0.5B-ead-Q5_K_M.gguf` | 5-bit quantization with mixed precision | | `Gemini-Distill-Qwen2.5-0.5B-ead-Q6_K.gguf` | 6-bit quantization | | `Gemini-Distill-Qwen2.5-0.5B-ead-Q8_0.gguf` | 8-bit quantization for balance between speed and accuracy | | `Gemini-Distill-Qwen2.5-0.5B-ead-fp16.gguf` | 16-bit floating point (fp16) version | | `Gemini-Distill-Qwen2.5-0.5B-ead-fp32.gguf` | Full precision (fp32) version | --- ## How to Use the Quantized Model ### **Running the Model with llama.cpp** To run the model using `llama.cpp`, use the following command: ```bash ./main -m Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf -p "Convert the following archival information into EAD/XML: ..." ``` For optimal performance, ensure you select the right quantized version based on your hardware capabilities. ### **Running the Model with GPT4All** If using GPT4All, load the GGUF model with: ```python from gpt4all import GPT4All model_path = "Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf" model = GPT4All(model_path) response = model.generate("Convert the following archival information into EAD/XML:") print(response) ``` ### **Running the Model with Ollama** If using Ollama, load the GGUF model with: ```bash ollama run hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0 ``` ```python import requests import json url = "http://localhost:11434/v1/chat/completions" payload = json.dumps({ "model": "hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0", "messages": [ { "role": "system", "content": "You are an archivist expert in EAD/XML format for archival records metadata." }, { "role": "user", "content": "Give me an example of content." } ], "option": { "num_ctx": 4096, "temperature": 0.1 }, "stream": False }) headers = { 'Content-Type': 'application/json' } response = requests.request("POST", url, headers=headers, data=payload) print(response.text) ``` --- ## Choosing the Right Quantization Format - **Lower-bit models (Q2_K, Q3_K_M, Q4_K_M):** Best for low-memory devices, but may lose some accuracy. - **Mid-range (Q5_K_M, Q6_K):** Good trade-off between speed and precision. - **Higher precision (Q8_0, fp16, fp32):** Best for accuracy but requires more memory. For CPU inference, **Q4_K_M or Q5_K_M** is recommended for a balance between efficiency and performance. --- ## Limitations & Future Improvements - **Inference Speed:** Ensure **Sliding Window Attention (SWA) is disabled**, as it may slow down inference. - To disable: `model.config.sliding_window = None` - **Future Work:** - Further optimizations for CPU inference - Additional fine-tuning on larger datasets - Exploring LoRA/QLoRA for low-rank adaptation --- ## Citation & Acknowledgments If you use this model in research or production, please cite: ``` @misc{your-citation, author = {GĂ©raldine Geoffroy}, title = {Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF} } ```