Create README.md
Browse files# Gemini-Distill-Qwen2.5-0.5B-ead-ONNX
## Model Description
This repository contains **ONNX-optimized versions** of the **Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead** model, distilled from **Gemini-2.0-Flash-Thinking-Exp**. This fine-tuned model is specifically designed for structured **Encoded Archival Description (EAD/XML)** reasoning and generation.
ONNX conversion enables **faster inference** on a variety of hardware, including **CPUs, GPUs, and specialized inference accelerators**.
---
## Available ONNX Model Versions
The following ONNX quantized versions are provided for different inference needs:
| File Name | Description |
|--------------------------|-------------|
| `model.onnx` | Full precision (fp32) version |
| `model_fp16.onnx` | Half precision (fp16) for optimized GPU inference |
| `model_bnb4.onnx` | Bitsandbytes 4-bit quantization |
| `model_int8.onnx` | 8-bit integer quantization for efficient CPU inference |
| `model_q4.onnx` | 4-bit quantization (for low-memory scenarios) |
| `model_q4f16.onnx` | 4-bit quantization with fp16 fallback |
| `model_uint8.onnx` | Unsigned 8-bit quantization |
| `model_quantized.onnx` | General quantized model for mixed precision |
---
## How to Use the ONNX Model
### **1. Install Dependencies**
Ensure you have the required dependencies for ONNX inference:
```bash
pip install onnxruntime
```
For GPU acceleration, install:
```bash
pip install onnxruntime-gpu
```
### **2. Load and Run Inference**
You can use `onnxruntime` to load and run inference with the model:
```python
import onnxruntime as ort
import numpy as np
# Load the ONNX model
session = ort.InferenceSession("model_fp16.onnx", providers=["CUDAExecutionProvider"])
# Prepare input data (example)
input_data = {"input_ids": np.array([[...]])} # Replace with tokenized input
# Run inference
outputs = session.run(None, input_data)
# Print output
print(outputs)
```
---
## Why ONNX?
- **Faster Inference:** Optimized execution across different hardware.
- **Cross-Platform Compatibility:** Run on CPUs, GPUs, and specialized accelerators.
- **Reduced Memory Usage:** Quantized versions provide significant efficiency gains.
---
## Citation & Acknowledgments
If you use this model in research or production, please cite:
```
@misc
{your-citation,
author = {Géraldine Geoffroy},
title = {Gemini-Distill-Qwen2.5-0.5B-ead-ONNX},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-ONNX}
}
```