omniemb-v1 / README.md
VarunKodathala's picture
Update README.md
09c7bc5 verified
---
license: openrail
datasets:
- openbmb/VisRAG-Ret-Train-In-domain-data
base_model:
- openai/clip-vit-large-patch14
tags:
- Embeddings
- Multi-modal
- text2image
- text2text
---
# OmniEmb-v1: Multi-Modal Embeddings for Unified Retrieval
A compact multi-modal embedding model that creates unified embeddings for text and images, enabling efficient retrieval across modalities without intermediate VLM transformations.
## Features
* 1536d unified embedding space
* Text2Text, Text2Image, and Image2Image retrieval support
* Direct embedding without VLM conversion steps
* Layout preservation for image data
## Performance
### Cross-Modal Retrieval (vs CLIP-ViT-B/32)
* Hits@1: 0.428 (+60.8%)
* Hits@5: 0.651 (+38.9%)
### Correlation Metrics (vs LaBSE)
* STS-B Pearson: 0.800 (+9.7%)
* STS-B Spearman: 0.795 (+7.3%)
* SICK Pearson: 0.782 (+6.3%)
### Error Metrics (vs LaBSE)
* STS-B MSE: 3.222 (-19.6%)
* SICK MSE: 0.750 (-41.5%)
## Installation & Usage
Install package:
```bash
pip install sportsvision
```
Basic usage:
```python
import torch
from sportsvision.research.configs import UnifiedEmbedderConfig
from sportsvision.research.models import UnifiedEmbedderModel
from transformers import AutoConfig, AutoModel
from PIL import Image
# Register the custom configuration and model
AutoConfig.register("unified_embedder", UnifiedEmbedderConfig)
AutoModel.register(UnifiedEmbedderConfig, UnifiedEmbedderModel)
# Initialize the model from the pretrained repository
emb_model = AutoModel.from_pretrained("sportsvision/omniemb-v1")
# Determine the device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Move the model to the device
emb_model = emb_model.to(device)
# Set the model to evaluation mode
emb_model.eval()
# Sample texts
texts = [
"Playoff season is exciting!",
"Injury updates for the team."
]
# Encode texts to obtain embeddings
text_embeddings = emb_model.encode_texts(texts)
print("Text Embeddings:", text_embeddings)
# Sample images
image_paths = [
"path_to_image1.jpg",
"path_to_image2.jpg"
]
# Load images using PIL
images = [Image.open(img_path).convert('RGB') for img_path in image_paths]
# Encode images to obtain embeddings
image_embeddings = emb_model.encode_images(images)
print("Image Embeddings:", image_embeddings)
```
## Training
* Fine-tuned CLIP architecture
* Trained on VisRAG dataset using contrastive loss
* Evaluation scripts and detailed methodology documentation coming soon
## Limitations
* Currently being benchmarked against ImageBind and other similar models
* Working on model extensions
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{kodathala2024omniemb,
author = {Kodathala, Varun},
title = {OmniEmb-v1: Multi-Modal Embeddings for Unified Retrieval},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/sportsvision/omniemb-v1}}
}
```