SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper β’ 2502.14786 β’ Published 21 days ago β’ 129
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding Paper β’ 2502.01341 β’ Published Feb 3 β’ 36
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks Paper β’ 2412.04626 β’ Published Dec 5, 2024 β’ 14
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild Paper β’ 2407.04172 β’ Published Jul 4, 2024 β’ 26