File size: 6,905 Bytes
eed0287
a3252c3
dc7ac73
a3252c3
dc7ac73
 
 
 
 
a3252c3
 
c781f57
eed0287
a3252c3
eed0287
 
 
 
 
 
 
a3252c3
eed0287
 
 
 
 
 
 
 
 
bf5bc7e
eed0287
d0139cd
 
 
 
 
 
 
2176478
d0139cd
 
 
 
 
eed0287
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41345ad
eed0287
 
 
 
 
 
 
 
 
41345ad
eed0287
 
 
 
 
 
 
 
 
 
 
41345ad
eed0287
 
 
 
 
 
 
 
 
 
 
 
41345ad
eed0287
 
 
 
 
 
 
 
41345ad
eed0287
 
 
 
 
 
 
 
 
 
 
 
41345ad
eed0287
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fec63ca
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
pipeline_tag: image-text-to-text
library_name: transformers
license: apache-2.0
language:
- en
tags:
- Sentence Similarity
- Embedding
- zero-shot-image-classification
- video-text-to-text
base_model: BAAI/Aquila-VL-2B-llava-qwen
---

# LLaVE-2B

## Model Summary

The LLaVE models are 2B parameter multimodal embedding models based on the Aquila-VL-2B model with a context window of 4K tokens.

- **Repository:** [LLaVE](https://github.com/DeepLearnXMU/LLaVE)
- **Paper:** [LLaVE](https://huggingface.co/papers/2503.04812)

## Train/Eval Data
 - Train data: https://huggingface.co/datasets/TIGER-Lab/MMEB-train
 - Eval data: https://huggingface.co/datasets/TIGER-Lab/MMEB-eval

## Use

### Intended use

The model have the ability to embed with texts, images, multi-image and videos. 

## MMEB Leaderboard
We achieved the top ranking on the MMEB leaderboard using only a small amount of data.

![MMEB Leaderboard](./figures/leaderboard.png)


## Model Performance
LLaVE-2B achieved excellent performance on MMEB using fewer parameters and 662K training pairs.
![MMEB](./figures/results.png)

Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.
<img src="./figures/zero-shot-vr.png" alt="video-retrieve" width="400" height="auto">

### Quick Start

First clone our github
```bash
git clone https://github.com/DeepLearnXMU/LLaVE
cd LLaVE
pip install -e ".[train]"
```

We provide the simple embedding process for using our model. For more details, you could refer to [Github](https://github.com/DeepLearnXMU/LLaVE).

```python
# pip install git+https://github.com/DeepLearnXMU/LLaVE


import torch
import copy
from PIL import Image
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, process_images

pretrained = "zhibinlan/LLaVE-2B"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()

# Image + Text -> Text
image = Image.open("figures/example.jpg")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models

question = DEFAULT_IMAGE_TOKEN + " Represent the given image with the following question: What is in the image"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], "\n")
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
attention_mask=input_ids.ne(tokenizer.pad_token_id)
image_sizes = [image.size]
query_embed = model.encode_multimodal_embeddings(input_ids, attention_mask=attention_mask,images=image_tensor, image_sizes=image_sizes)

target_string = "A cat and a dog"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], target_string)
conv.append_message(conv.roles[1], "\n")
target_string = conv.get_prompt()
target_input_ids = tokenizer(target_string, return_tensors="pt").input_ids.to(device)
attention_mask=target_input_ids.ne(tokenizer.pad_token_id)
target_embed = model.encode_multimodal_embeddings(target_input_ids, attention_mask=attention_mask)

print("A cat and a dog similarity score: ", query_embed @ target_embed.T)
# 2B: A cat and a dog similarity score: tensor([[0.5132]]

neg_string = "A cat and a tiger"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], neg_string)
conv.append_message(conv.roles[1], "\n")
neg_string = conv.get_prompt()
neg_input_ids = tokenizer(neg_string, return_tensors="pt").input_ids.to(device)
attention_mask=neg_input_ids.ne(tokenizer.pad_token_id)
neg_embed = model.encode_multimodal_embeddings(neg_input_ids, attention_mask=attention_mask)
print("A cat and a tiger similarity score: ", query_embed @ neg_embed.T)
# 2B: A cat and a tiger similarity score: tensor([[0.3809]]


# Text -> Image
pos_string = "Find me an everyday image that matches the given caption: A cat and a dog."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], pos_string)
conv.append_message(conv.roles[1], "\n")
pos_string = conv.get_prompt()
pos_input_ids = tokenizer(pos_string, return_tensors="pt").input_ids.to(device)
attention_mask=pos_input_ids.ne(tokenizer.pad_token_id)
pos_query_embed = model.encode_multimodal_embeddings(pos_input_ids, attention_mask=attention_mask)

target = DEFAULT_IMAGE_TOKEN + " Represent the given image."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], target)
conv.append_message(conv.roles[1], "\n")
prompt_target = conv.get_prompt()
target_input_ids = tokenizer_image_token(prompt_target, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
attention_mask=target_input_ids.ne(tokenizer.pad_token_id)
target_image_sizes = [image.size]
target_embed = model.encode_multimodal_embeddings(target_input_ids, attention_mask=attention_mask,images=image_tensor, image_sizes=target_image_sizes)

print("A cat and a dog image similarity score: ", pos_query_embed @ target_embed.T)
# 2B: A cat and a dog similarity score: tensor([[0.5225]]

neg_string = "Find me an everyday image that matches the given caption: A cat and a tiger."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], neg_string)
conv.append_message(conv.roles[1], "\n")
neg_string = conv.get_prompt()
neg_input_ids = tokenizer(neg_string, return_tensors="pt").input_ids.to(device)
attention_mask=neg_input_ids.ne(tokenizer.pad_token_id)
neg_query_embed = model.encode_multimodal_embeddings(neg_input_ids, attention_mask=attention_mask)

print("A cat and a tiger image similarity score: ", neg_query_embed @ target_embed.T)
# 2B: A cat and a dog similarity score: tensor([[0.4141]]
```

## Hardware & Software
- **GPUs:** 8 * Nvidia A100 (40G) (for whole model training)
- **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
- **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)

## Citation
```
@article{lan2025llave,
  title={LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning},
  author={Lan, Zhibin and Niu, Liqiang and Meng, Fandong and Zhou, Jie and Su, Jinsong},
  journal={arXiv preprint arXiv:2503.04812},
  year={2025}
}
```