EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Paper • 2502.06788 • Published 1 day ago • 10
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey Paper • 2412.18619 • Published Dec 16, 2024 • 54
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models Paper • 2311.17404 • Published Nov 29, 2023