Create README.md

# Gemini-Distill-Qwen2.5-0.5B-ead-ONNX

## Model Description
This repository contains **ONNX-optimized versions** of the **Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead** model, distilled from **Gemini-2.0-Flash-Thinking-Exp**. This fine-tuned model is specifically designed for structured **Encoded Archival Description (EAD/XML)** reasoning and generation.

ONNX conversion enables **faster inference** on a variety of hardware, including **CPUs, GPUs, and specialized inference accelerators**.

---

## Available ONNX Model Versions
The following ONNX quantized versions are provided for different inference needs:

| File Name | Description |
|--------------------------|-------------|
| `model.onnx` | Full precision (fp32) version |
| `model_fp16.onnx` | Half precision (fp16) for optimized GPU inference |
| `model_bnb4.onnx` | Bitsandbytes 4-bit quantization |
| `model_int8.onnx` | 8-bit integer quantization for efficient CPU inference |
| `model_q4.onnx` | 4-bit quantization (for low-memory scenarios) |
| `model_q4f16.onnx` | 4-bit quantization with fp16 fallback |
| `model_uint8.onnx` | Unsigned 8-bit quantization |
| `model_quantized.onnx` | General quantized model for mixed precision |

---

## How to Use the ONNX Model

### **1. Install Dependencies**
Ensure you have the required dependencies for ONNX inference:

```bash
pip install onnxruntime
```

For GPU acceleration, install:

```bash
pip install onnxruntime-gpu
```

### **2. Load and Run Inference**
You can use `onnxruntime` to load and run inference with the model:

```python
import onnxruntime as ort
import numpy as np

# Load the ONNX model
session = ort.InferenceSession("model_fp16.onnx", providers=["CUDAExecutionProvider"])

# Prepare input data (example)
input_data = {"input_ids": np.array([[...]])} # Replace with tokenized input

# Run inference
outputs = session.run(None, input_data)

# Print output
print(outputs)
```

---

## Why ONNX?
- **Faster Inference:** Optimized execution across different hardware.
- **Cross-Platform Compatibility:** Run on CPUs, GPUs, and specialized accelerators.
- **Reduced Memory Usage:** Quantized versions provide significant efficiency gains.

---

## Citation & Acknowledgments
If you use this model in research or production, please cite:
```

@misc
{your-citation,
author = {Géraldine Geoffroy},
title = {Gemini-Distill-Qwen2.5-0.5B-ead-ONNX},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-ONNX}
}
```

Files changed (1) hide show

README.md +9 -0

README.md ADDED Viewed

	@@ -0,0 +1,9 @@

+---
+license: mit
+language:
+- fr
+- en
+base_model:
+- Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead
+library_name: transformers.js
+---