Geraldine commited on
Commit
a616970
·
verified ·
1 Parent(s): 7bfc0b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -1
README.md CHANGED
@@ -6,4 +6,85 @@ language:
6
  base_model:
7
  - Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead
8
  library_name: transformers.js
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  base_model:
7
  - Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead
8
  library_name: transformers.js
9
+ ---
10
+
11
+ # Gemini-Distill-Qwen2.5-0.5B-ead-ONNX
12
+
13
+ ## Model Description
14
+ This repository contains **ONNX-optimized versions** of the **Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead** model, distilled from **Gemini-2.0-Flash-Thinking-Exp**. This fine-tuned model is specifically designed for structured **Encoded Archival Description (EAD/XML)** reasoning and generation.
15
+
16
+ ONNX conversion enables **faster inference** on a variety of hardware, including **CPUs, GPUs, and specialized inference accelerators**.
17
+
18
+ ---
19
+
20
+ ## Available ONNX Model Versions
21
+ The following ONNX quantized versions are provided for different inference needs:
22
+
23
+ | File Name | Description |
24
+ |--------------------------|-------------|
25
+ | `model.onnx` | Full precision (fp32) version |
26
+ | `model_fp16.onnx` | Half precision (fp16) for optimized GPU inference |
27
+ | `model_bnb4.onnx` | Bitsandbytes 4-bit quantization |
28
+ | `model_int8.onnx` | 8-bit integer quantization for efficient CPU inference |
29
+ | `model_q4.onnx` | 4-bit quantization (for low-memory scenarios) |
30
+ | `model_q4f16.onnx` | 4-bit quantization with fp16 fallback |
31
+ | `model_uint8.onnx` | Unsigned 8-bit quantization |
32
+ | `model_quantized.onnx` | General quantized model for mixed precision |
33
+
34
+ ---
35
+
36
+ ## How to Use the ONNX Model
37
+
38
+ ### **1. Install Dependencies**
39
+ Ensure you have the required dependencies for ONNX inference:
40
+
41
+ ```bash
42
+ pip install onnxruntime
43
+ ```
44
+
45
+ For GPU acceleration, install:
46
+
47
+ ```bash
48
+ pip install onnxruntime-gpu
49
+ ```
50
+
51
+ ### **2. Load and Run Inference**
52
+ You can use `onnxruntime` to load and run inference with the model:
53
+
54
+ ```python
55
+ import onnxruntime as ort
56
+ import numpy as np
57
+
58
+ # Load the ONNX model
59
+ session = ort.InferenceSession("model_fp16.onnx", providers=["CUDAExecutionProvider"])
60
+
61
+ # Prepare input data (example)
62
+ input_data = {"input_ids": np.array([[...]])} # Replace with tokenized input
63
+
64
+ # Run inference
65
+ outputs = session.run(None, input_data)
66
+
67
+ # Print output
68
+ print(outputs)
69
+ ```
70
+
71
+ ---
72
+
73
+ ## Why ONNX?
74
+ - **Faster Inference:** Optimized execution across different hardware.
75
+ - **Cross-Platform Compatibility:** Run on CPUs, GPUs, and specialized accelerators.
76
+ - **Reduced Memory Usage:** Quantized versions provide significant efficiency gains.
77
+
78
+ ---
79
+
80
+ ## Citation & Acknowledgments
81
+ If you use this model in research or production, please cite:
82
+ ```
83
+ @misc{your-citation,
84
+ author = {Géraldine Geoffroy},
85
+ title = {Gemini-Distill-Qwen2.5-0.5B-ead-ONNX},
86
+ year = {2025},
87
+ publisher = {Hugging Face},
88
+ url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-ONNX}
89
+ }
90
+ ```