Update README.md
Browse files
README.md
CHANGED
@@ -20,6 +20,19 @@ The LLaVE models are 2B parameter multimodal embedding models based on the Aquil
|
|
20 |
|
21 |
The model have the ability to embed with texts, images, multi-image and videos.
|
22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
### Quick Start
|
24 |
|
25 |
First clone our github
|
|
|
20 |
|
21 |
The model have the ability to embed with texts, images, multi-image and videos.
|
22 |
|
23 |
+
## MMEB Leaderboard
|
24 |
+
We achieved the top ranking on the MMEB leaderboard using only a small amount of data.
|
25 |
+
|
26 |
+

|
27 |
+
|
28 |
+
|
29 |
+
## Model Performance
|
30 |
+
LLaVE-7B achieved the SOTA performance on MMEB using only 662K training pairs.
|
31 |
+

|
32 |
+
|
33 |
+
Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.
|
34 |
+
<img src="./figures/zero-shot-vr.png" alt="video-retrieve" width="400" height="auto">
|
35 |
+
|
36 |
### Quick Start
|
37 |
|
38 |
First clone our github
|