Geraldine commited on
Commit
4dc093c
·
verified ·
1 Parent(s): 1998e14

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - fr
5
+ - en
6
+ base_model:
7
+ - Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead
8
+ ---
9
+
10
+ # Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions (Distilled from Gemini-2.0-Flash-Thinking-Exp)
11
+
12
+ ## Model Description
13
+ This repository contains **quantized versions** of the fine-tuned **Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead** model, which was trained via knowledge distillation from **Gemini-2.0-Flash-Thinking-Exp**. The fine-tuning process teaches the model to reason through and generate **Encoded Archival Description (EAD/XML)** outputs, ensuring structured reasoning before final archival XML generation.
14
+
15
+ This repository provides various **GGUF quantized formats**, allowing efficient inference on different hardware setups, including CPUs and GPUs.
16
+
17
+ ---
18
+ ## Available GGUF Files
19
+
20
+ The following quantized versions of the model were generated using **llama.cpp**:
21
+
22
+ | File Name | Description |
23
+ |-----------|-------------|
24
+ | `Gemini-Distill-Qwen2.5-0.5B-ead-Q2_K.gguf` | Ultra-low precision (2-bit) for extreme compression |
25
+ | `Gemini-Distill-Qwen2.5-0.5B-ead-Q3_K_M.gguf` | 3-bit quantization with mixed precision |
26
+ | `Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf` | 4-bit quantization with mixed precision |
27
+ | `Gemini-Distill-Qwen2.5-0.5B-ead-Q5_K_M.gguf` | 5-bit quantization with mixed precision |
28
+ | `Gemini-Distill-Qwen2.5-0.5B-ead-Q6_K.gguf` | 6-bit quantization |
29
+ | `Gemini-Distill-Qwen2.5-0.5B-ead-Q8_0.gguf` | 8-bit quantization for balance between speed and accuracy |
30
+ | `Gemini-Distill-Qwen2.5-0.5B-ead-fp16.gguf` | 16-bit floating point (fp16) version |
31
+ | `Gemini-Distill-Qwen2.5-0.5B-ead-fp32.gguf` | Full precision (fp32) version |
32
+
33
+ ---
34
+ ## How to Use the Quantized Model
35
+
36
+ ### **Running the Model with llama.cpp**
37
+ To run the model using `llama.cpp`, use the following command:
38
+
39
+ ```bash
40
+ ./main -m Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf -p "Convert the following archival information into EAD/XML: ..."
41
+ ```
42
+
43
+ For optimal performance, ensure you select the right quantized version based on your hardware capabilities.
44
+
45
+ ### **Running the Model with GPT4All**
46
+ If using GPT4All, load the GGUF model with:
47
+
48
+ ```python
49
+ from gpt4all import GPT4All
50
+
51
+ model_path = "Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf"
52
+ model = GPT4All(model_path)
53
+ response = model.generate("Convert the following archival information into EAD/XML:")
54
+ print(response)
55
+ ```
56
+
57
+ ### **Running the Model with Ollama**
58
+ If using Ollama, load the GGUF model with:
59
+
60
+ ```bash
61
+ ollama run hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0
62
+ ```
63
+
64
+ ```python
65
+ import requests
66
+ import json
67
+
68
+ url = "http://localhost:11434/v1/chat/completions"
69
+
70
+ payload = json.dumps({
71
+ "model": "hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0",
72
+ "messages": [
73
+ {
74
+ "role": "system",
75
+ "content": "You are an archivist expert in EAD/XML format for archival records metadata."
76
+ },
77
+ {
78
+ "role": "user",
79
+ "content": "Give me an example of <controlaccess> content."
80
+ }
81
+ ],
82
+ "option": {
83
+ "num_ctx": 4096,
84
+ "temperature": 0.1
85
+ },
86
+ "stream": False
87
+ })
88
+ headers = {
89
+ 'Content-Type': 'application/json'
90
+ }
91
+
92
+ response = requests.request("POST", url, headers=headers, data=payload)
93
+
94
+ print(response.text)
95
+ ```
96
+
97
+ ---
98
+ ## Choosing the Right Quantization Format
99
+ - **Lower-bit models (Q2_K, Q3_K_M, Q4_K_M):** Best for low-memory devices, but may lose some accuracy.
100
+ - **Mid-range (Q5_K_M, Q6_K):** Good trade-off between speed and precision.
101
+ - **Higher precision (Q8_0, fp16, fp32):** Best for accuracy but requires more memory.
102
+
103
+ For CPU inference, **Q4_K_M or Q5_K_M** is recommended for a balance between efficiency and performance.
104
+
105
+ ---
106
+ ## Limitations & Future Improvements
107
+ - **Inference Speed:** Ensure **Sliding Window Attention (SWA) is disabled**, as it may slow down inference.
108
+ - To disable: `model.config.sliding_window = None`
109
+ - **Future Work:**
110
+ - Further optimizations for CPU inference
111
+ - Additional fine-tuning on larger datasets
112
+ - Exploring LoRA/QLoRA for low-rank adaptation
113
+
114
+ ---
115
+ ## Citation & Acknowledgments
116
+ If you use this model in research or production, please cite:
117
+ ```
118
+ @misc{your-citation,
119
+ author = {Géraldine Geoffroy},
120
+ title = {Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions},
121
+ year = {2025},
122
+ publisher = {Hugging Face},
123
+ url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF}
124
+ }
125
+ ```