--- license: openrail datasets: - openbmb/VisRAG-Ret-Train-In-domain-data base_model: - openai/clip-vit-large-patch14 tags: - Embeddings - Multi-modal - text2image - text2text --- # OmniEmb-v1: Multi-Modal Embeddings for Unified Retrieval A compact multi-modal embedding model that creates unified embeddings for text and images, enabling efficient retrieval across modalities without intermediate VLM transformations. ## Features * 1536d unified embedding space * Text2Text, Text2Image, and Image2Image retrieval support * Direct embedding without VLM conversion steps * Layout preservation for image data ## Performance ### Cross-Modal Retrieval (vs CLIP-ViT-B/32) * Hits@1: 0.428 (+60.8%) * Hits@5: 0.651 (+38.9%) ### Correlation Metrics (vs LaBSE) * STS-B Pearson: 0.800 (+9.7%) * STS-B Spearman: 0.795 (+7.3%) * SICK Pearson: 0.782 (+6.3%) ### Error Metrics (vs LaBSE) * STS-B MSE: 3.222 (-19.6%) * SICK MSE: 0.750 (-41.5%) ## Installation & Usage Install package: ```bash pip install sportsvision ``` Basic usage: ```python import torch from sportsvision.research.configs import UnifiedEmbedderConfig from sportsvision.research.models import UnifiedEmbedderModel from transformers import AutoConfig, AutoModel from PIL import Image # Register the custom configuration and model AutoConfig.register("unified_embedder", UnifiedEmbedderConfig) AutoModel.register(UnifiedEmbedderConfig, UnifiedEmbedderModel) # Initialize the model from the pretrained repository emb_model = AutoModel.from_pretrained("sportsvision/omniemb-v1") # Determine the device device = "cuda" if torch.cuda.is_available() else "cpu" # Move the model to the device emb_model = emb_model.to(device) # Set the model to evaluation mode emb_model.eval() # Sample texts texts = [ "Playoff season is exciting!", "Injury updates for the team." ] # Encode texts to obtain embeddings text_embeddings = emb_model.encode_texts(texts) print("Text Embeddings:", text_embeddings) # Sample images image_paths = [ "path_to_image1.jpg", "path_to_image2.jpg" ] # Load images using PIL images = [Image.open(img_path).convert('RGB') for img_path in image_paths] # Encode images to obtain embeddings image_embeddings = emb_model.encode_images(images) print("Image Embeddings:", image_embeddings) ``` ## Training * Fine-tuned CLIP architecture * Trained on VisRAG dataset using contrastive loss * Evaluation scripts and detailed methodology documentation coming soon ## Limitations * Currently being benchmarked against ImageBind and other similar models * Working on model extensions ## Citation If you use this model in your research, please cite: ```bibtex @misc{kodathala2024omniemb, author = {Kodathala, Varun}, title = {OmniEmb-v1: Multi-Modal Embeddings for Unified Retrieval}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/sportsvision/omniemb-v1}} } ```