-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 108 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 76 -
mistralai/Pixtral-12B-2409
Image-Text-to-Text • Updated • • 622 -
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text • Updated • 62.9k • 408
Collections
Discover the best community collections!
Collections including paper arxiv:2409.12191
-
Training Language Models to Self-Correct via Reinforcement Learning
Paper • 2409.12917 • Published • 138 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 76 -
Expect the Unexpected: FailSafe Long Context QA for Finance
Paper • 2502.06329 • Published • 126 -
Competitive Programming with Large Reasoning Models
Paper • 2502.06807 • Published • 67
-
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
Paper • 2409.08513 • Published • 14 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper • 2409.08264 • Published • 45 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 76 -
LLMs + Persona-Plug = Personalized LLMs
Paper • 2409.11901 • Published • 32
-
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 88 -
Visual Instruction Tuning
Paper • 2304.08485 • Published • 13 -
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 37 -
PALO: A Polyglot Large Multimodal Model for 5B People
Paper • 2402.14818 • Published • 24
-
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 126 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 59 -
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper • 2408.16725 • Published • 53 -
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Paper • 2408.15998 • Published • 86
-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper • 2408.10188 • Published • 52 -
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper • 2408.08872 • Published • 99 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 126 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper • 2408.12528 • Published • 51
-
LLaVA-OneVision: Easy Visual Task Transfer
Paper • 2408.03326 • Published • 60 -
VILA^2: VILA Augmented VILA
Paper • 2407.17453 • Published • 40 -
PaliGemma: A versatile 3B VLM for transfer
Paper • 2407.07726 • Published • 69 -
openbmb/MiniCPM-V-2_6
Image-Text-to-Text • Updated • 73.4k • 953
-
Qwen2.5-VL Technical Report
Paper • 2502.13923 • Published • 164 -
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper • 2404.05719 • Published • 82 -
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper • 2411.17465 • Published • 80 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 76
-
SelfEval: Leveraging the discriminative nature of generative models for evaluation
Paper • 2311.10708 • Published • 17 -
OmniGen: Unified Image Generation
Paper • 2409.11340 • Published • 112 -
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 73 -
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper • 2409.11355 • Published • 29
-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 68 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 131 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 88