Geraldine
/

Gemini-Distill-Qwen2.5-0.5B-ead-ONNX

Text Generation

Transformers.js

Model card Files Files and versions Community

Geraldine commited on 21 days ago

Commit

a616970

·

verified ·

1 Parent(s): 7bfc0b3

Update README.md

Files changed (1) hide show

README.md +82 -1

README.md CHANGED Viewed

@@ -6,4 +6,85 @@ language:
 base_model:
 - Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead
 library_name: transformers.js
----

 base_model:
 - Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead
 library_name: transformers.js
+---
+# Gemini-Distill-Qwen2.5-0.5B-ead-ONNX
+## Model Description
+This repository contains **ONNX-optimized versions** of the **Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead** model, distilled from **Gemini-2.0-Flash-Thinking-Exp**. This fine-tuned model is specifically designed for structured **Encoded Archival Description (EAD/XML)** reasoning and generation.
+ONNX conversion enables **faster inference** on a variety of hardware, including **CPUs, GPUs, and specialized inference accelerators**.
+---
+## Available ONNX Model Versions
+The following ONNX quantized versions are provided for different inference needs:
+| File Name                | Description |
+|--------------------------|-------------|
+| `model.onnx`             | Full precision (fp32) version |
+| `model_fp16.onnx`        | Half precision (fp16) for optimized GPU inference |
+| `model_bnb4.onnx`        | Bitsandbytes 4-bit quantization |
+| `model_int8.onnx`        | 8-bit integer quantization for efficient CPU inference |
+| `model_q4.onnx`          | 4-bit quantization (for low-memory scenarios) |
+| `model_q4f16.onnx`       | 4-bit quantization with fp16 fallback |
+| `model_uint8.onnx`       | Unsigned 8-bit quantization |
+| `model_quantized.onnx`   | General quantized model for mixed precision |
+---
+## How to Use the ONNX Model
+### **1. Install Dependencies**
+Ensure you have the required dependencies for ONNX inference:
+```bash
+pip install onnxruntime
+```
+For GPU acceleration, install:
+```bash
+pip install onnxruntime-gpu
+```
+### **2. Load and Run Inference**
+You can use `onnxruntime` to load and run inference with the model:
+```python
+import onnxruntime as ort
+import numpy as np
+# Load the ONNX model
+session = ort.InferenceSession("model_fp16.onnx", providers=["CUDAExecutionProvider"])
+# Prepare input data (example)
+input_data = {"input_ids": np.array([[...]])}  # Replace with tokenized input
+# Run inference
+outputs = session.run(None, input_data)
+# Print output
+print(outputs)
+```
+---
+## Why ONNX?
+- **Faster Inference:** Optimized execution across different hardware.
+- **Cross-Platform Compatibility:** Run on CPUs, GPUs, and specialized accelerators.
+- **Reduced Memory Usage:** Quantized versions provide significant efficiency gains.
+---
+## Citation & Acknowledgments
+If you use this model in research or production, please cite:
+```
+@misc{your-citation,
+  author = {Géraldine Geoffroy},
+  title = {Gemini-Distill-Qwen2.5-0.5B-ead-ONNX},
+  year = {2025},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-ONNX}
+}
+```