Update README.md
Browse files
README.md
CHANGED
@@ -6,4 +6,85 @@ language:
|
|
6 |
base_model:
|
7 |
- Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead
|
8 |
library_name: transformers.js
|
9 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
base_model:
|
7 |
- Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead
|
8 |
library_name: transformers.js
|
9 |
+
---
|
10 |
+
|
11 |
+
# Gemini-Distill-Qwen2.5-0.5B-ead-ONNX
|
12 |
+
|
13 |
+
## Model Description
|
14 |
+
This repository contains **ONNX-optimized versions** of the **Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead** model, distilled from **Gemini-2.0-Flash-Thinking-Exp**. This fine-tuned model is specifically designed for structured **Encoded Archival Description (EAD/XML)** reasoning and generation.
|
15 |
+
|
16 |
+
ONNX conversion enables **faster inference** on a variety of hardware, including **CPUs, GPUs, and specialized inference accelerators**.
|
17 |
+
|
18 |
+
---
|
19 |
+
|
20 |
+
## Available ONNX Model Versions
|
21 |
+
The following ONNX quantized versions are provided for different inference needs:
|
22 |
+
|
23 |
+
| File Name | Description |
|
24 |
+
|--------------------------|-------------|
|
25 |
+
| `model.onnx` | Full precision (fp32) version |
|
26 |
+
| `model_fp16.onnx` | Half precision (fp16) for optimized GPU inference |
|
27 |
+
| `model_bnb4.onnx` | Bitsandbytes 4-bit quantization |
|
28 |
+
| `model_int8.onnx` | 8-bit integer quantization for efficient CPU inference |
|
29 |
+
| `model_q4.onnx` | 4-bit quantization (for low-memory scenarios) |
|
30 |
+
| `model_q4f16.onnx` | 4-bit quantization with fp16 fallback |
|
31 |
+
| `model_uint8.onnx` | Unsigned 8-bit quantization |
|
32 |
+
| `model_quantized.onnx` | General quantized model for mixed precision |
|
33 |
+
|
34 |
+
---
|
35 |
+
|
36 |
+
## How to Use the ONNX Model
|
37 |
+
|
38 |
+
### **1. Install Dependencies**
|
39 |
+
Ensure you have the required dependencies for ONNX inference:
|
40 |
+
|
41 |
+
```bash
|
42 |
+
pip install onnxruntime
|
43 |
+
```
|
44 |
+
|
45 |
+
For GPU acceleration, install:
|
46 |
+
|
47 |
+
```bash
|
48 |
+
pip install onnxruntime-gpu
|
49 |
+
```
|
50 |
+
|
51 |
+
### **2. Load and Run Inference**
|
52 |
+
You can use `onnxruntime` to load and run inference with the model:
|
53 |
+
|
54 |
+
```python
|
55 |
+
import onnxruntime as ort
|
56 |
+
import numpy as np
|
57 |
+
|
58 |
+
# Load the ONNX model
|
59 |
+
session = ort.InferenceSession("model_fp16.onnx", providers=["CUDAExecutionProvider"])
|
60 |
+
|
61 |
+
# Prepare input data (example)
|
62 |
+
input_data = {"input_ids": np.array([[...]])} # Replace with tokenized input
|
63 |
+
|
64 |
+
# Run inference
|
65 |
+
outputs = session.run(None, input_data)
|
66 |
+
|
67 |
+
# Print output
|
68 |
+
print(outputs)
|
69 |
+
```
|
70 |
+
|
71 |
+
---
|
72 |
+
|
73 |
+
## Why ONNX?
|
74 |
+
- **Faster Inference:** Optimized execution across different hardware.
|
75 |
+
- **Cross-Platform Compatibility:** Run on CPUs, GPUs, and specialized accelerators.
|
76 |
+
- **Reduced Memory Usage:** Quantized versions provide significant efficiency gains.
|
77 |
+
|
78 |
+
---
|
79 |
+
|
80 |
+
## Citation & Acknowledgments
|
81 |
+
If you use this model in research or production, please cite:
|
82 |
+
```
|
83 |
+
@misc{your-citation,
|
84 |
+
author = {Géraldine Geoffroy},
|
85 |
+
title = {Gemini-Distill-Qwen2.5-0.5B-ead-ONNX},
|
86 |
+
year = {2025},
|
87 |
+
publisher = {Hugging Face},
|
88 |
+
url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-ONNX}
|
89 |
+
}
|
90 |
+
```
|