jupyterjazz's picture
Update README.md
c0d0f38 verified
---
license: cc-by-nc-4.0
tags:
- vidore
- colpali
- multimodal-embedding
- multilingual-embedding
- Text-to-Visual Document (T→VD) retrieval
- feature-extraction
- sentence-similarity
- mteb
language:
- multilingual
library_name: transformers
pipeline_tag: visual-document-retrieval
---
<br><br>
<p align="center">
<img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
</p>
<p align="center">
<b>The embedding model trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
</p>
# Jina Embeddings v4: Universal Embeddings for Multimodal Multilingual Retrieval
[Original Model](https://huggingface.co/jinaai/jina-embeddings-v4) | [Blog](https://jina.ai/news/jina-embeddings-v4-universal-embeddings-for-multimodal-multilingual-retrieval) | [Technical Report](https://arxiv.org/abs/2506.18902) | [API](https://jina.ai/embeddings)
## Model Overview
This repository hosts a vLLM-compatible version of [`jina-embeddings-v4`](https://huggingface.co/jinaai/jina-embeddings-v4) with the **code** adapter merged into the base `Qwen2.5-VL` weights. This architecture modification enables native compatibility with vLLM without requiring custom adapter-handling code.
## Usage
```python
import torch
from PIL import Image
from vllm import LLM
from vllm.config import PoolerConfig
from vllm.inputs.data import TextPrompt
# Initialize model
model = LLM(
model="jinaai/jina-embeddings-v4-vllm-code",
task="embed",
override_pooler_config=PoolerConfig(pooling_type="ALL", normalize=False),
dtype="float16",
)
# Create text prompts
query =query = "Find a function that prints a greeting message to the console"
query_prompt = TextPrompt(
prompt=f"Query: {query}"
)
passage = "def hello_world():\n print('Hello, World!')"
passage_prompt = TextPrompt(
prompt=f"Passage: {passage}"
)
# Create image prompt
image = Image.open("<path_to_image>")
image_prompt = TextPrompt(
prompt="<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|>\n",
multi_modal_data={"image": image},
)
# Encode all prompts
prompts = [query_prompt, passage_prompt, image_prompt]
outputs = model.encode(prompts)
def get_embeddings(outputs):
VISION_START_TOKEN_ID, VISION_END_TOKEN_ID = 151652, 151653
embeddings = []
for output in outputs:
if VISION_START_TOKEN_ID in output.prompt_token_ids:
# Gather only vision tokens
img_start_pos = torch.where(
torch.tensor(output.prompt_token_ids) == VISION_START_TOKEN_ID
)[0][0]
img_end_pos = torch.where(
torch.tensor(output.prompt_token_ids) == VISION_END_TOKEN_ID
)[0][0]
embeddings_tensor = output.outputs.data.detach().clone()[
img_start_pos : img_end_pos + 1
]
else:
# Use all tokens for text-only prompts
embeddings_tensor = output.outputs.data.detach().clone()
# Pool and normalize embeddings
pooled_output = (
embeddings_tensor.sum(dim=0, dtype=torch.float32)
/ embeddings_tensor.shape[0]
)
embeddings.append(torch.nn.functional.normalize(pooled_output, dim=-1))
return embeddings
embeddings = get_embeddings(outputs)
```