LLaVE is a series of large language and vision embedding models trained on a variety of multimodal embedding datasets
-
zhibinlan/LLaVE-0.5B
Image-Text-to-Text • Updated • 155 • 2 -
zhibinlan/LLaVE-2B
Image-Text-to-Text • Updated • 149 -
zhibinlan/LLaVE-7B
Image-Text-to-Text • Updated • 113 • 3 -
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Paper • 2503.04812 • Published • 12