zhibinlan
/

LLaVE-2B

Image-Text-to-Text

text-generation

Sentence Similarity

zero-shot-image-classification

video-text-to-text

text-generation-inference

Model card Files Files and versions Community

zhibinlan commited on 19 days ago

Commit

d0139cd

·

verified ·

1 Parent(s): b616de3

Update README.md

Files changed (1) hide show

README.md +13 -0

README.md CHANGED Viewed

@@ -20,6 +20,19 @@ The LLaVE models are 2B parameter multimodal embedding models based on the Aquil
 The model have the ability to embed with texts, images, multi-image and videos.
 ### Quick Start
 First clone our github

 The model have the ability to embed with texts, images, multi-image and videos.
+## MMEB Leaderboard
+We achieved the top ranking on the MMEB leaderboard using only a small amount of data.
+![MMEB Leaderboard](./figures/leaderboard.png)
+## Model Performance
+LLaVE-7B achieved the SOTA performance on MMEB using only 662K training pairs.
+![MMEB](./figures/results.png)
+Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.
+<img src="./figures/zero-shot-vr.png" alt="video-retrieve" width="400" height="auto">
 ### Quick Start
 First clone our github