|
--- |
|
license: openrail |
|
datasets: |
|
- openbmb/VisRAG-Ret-Train-In-domain-data |
|
base_model: |
|
- openai/clip-vit-large-patch14 |
|
tags: |
|
- Embeddings |
|
- Multi-modal |
|
- text2image |
|
- text2text |
|
--- |
|
|
|
# OmniEmb-v1: Multi-Modal Embeddings for Unified Retrieval |
|
|
|
A compact multi-modal embedding model that creates unified embeddings for text and images, enabling efficient retrieval across modalities without intermediate VLM transformations. |
|
|
|
## Features |
|
|
|
* 1536d unified embedding space |
|
* Text2Text, Text2Image, and Image2Image retrieval support |
|
* Direct embedding without VLM conversion steps |
|
* Layout preservation for image data |
|
|
|
## Performance |
|
|
|
### Cross-Modal Retrieval (vs CLIP-ViT-B/32) |
|
* Hits@1: 0.428 (+60.8%) |
|
* Hits@5: 0.651 (+38.9%) |
|
|
|
### Correlation Metrics (vs LaBSE) |
|
* STS-B Pearson: 0.800 (+9.7%) |
|
* STS-B Spearman: 0.795 (+7.3%) |
|
* SICK Pearson: 0.782 (+6.3%) |
|
|
|
### Error Metrics (vs LaBSE) |
|
* STS-B MSE: 3.222 (-19.6%) |
|
* SICK MSE: 0.750 (-41.5%) |
|
|
|
## Installation & Usage |
|
|
|
Install package: |
|
```bash |
|
pip install sportsvision |
|
``` |
|
|
|
Basic usage: |
|
```python |
|
import torch |
|
from sportsvision.research.configs import UnifiedEmbedderConfig |
|
from sportsvision.research.models import UnifiedEmbedderModel |
|
from transformers import AutoConfig, AutoModel |
|
from PIL import Image |
|
|
|
# Register the custom configuration and model |
|
AutoConfig.register("unified_embedder", UnifiedEmbedderConfig) |
|
AutoModel.register(UnifiedEmbedderConfig, UnifiedEmbedderModel) |
|
|
|
# Initialize the model from the pretrained repository |
|
emb_model = AutoModel.from_pretrained("sportsvision/omniemb-v1") |
|
|
|
# Determine the device |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
# Move the model to the device |
|
emb_model = emb_model.to(device) |
|
|
|
# Set the model to evaluation mode |
|
emb_model.eval() |
|
|
|
# Sample texts |
|
texts = [ |
|
"Playoff season is exciting!", |
|
"Injury updates for the team." |
|
] |
|
|
|
# Encode texts to obtain embeddings |
|
text_embeddings = emb_model.encode_texts(texts) |
|
print("Text Embeddings:", text_embeddings) |
|
|
|
# Sample images |
|
image_paths = [ |
|
"path_to_image1.jpg", |
|
"path_to_image2.jpg" |
|
] |
|
|
|
# Load images using PIL |
|
images = [Image.open(img_path).convert('RGB') for img_path in image_paths] |
|
|
|
# Encode images to obtain embeddings |
|
image_embeddings = emb_model.encode_images(images) |
|
print("Image Embeddings:", image_embeddings) |
|
``` |
|
|
|
## Training |
|
|
|
* Fine-tuned CLIP architecture |
|
* Trained on VisRAG dataset using contrastive loss |
|
* Evaluation scripts and detailed methodology documentation coming soon |
|
|
|
## Limitations |
|
|
|
* Currently being benchmarked against ImageBind and other similar models |
|
* Working on model extensions |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{kodathala2024omniemb, |
|
author = {Kodathala, Varun}, |
|
title = {OmniEmb-v1: Multi-Modal Embeddings for Unified Retrieval}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
howpublished = {\url{https://huggingface.co/sportsvision/omniemb-v1}} |
|
} |
|
``` |