Collections
Discover the best community collections!
Collections including paper arxiv:2409.12191
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper β’ 2409.11402 β’ Published β’ 73 -
BRAVE: Broadening the visual encoding of vision-language models
Paper β’ 2404.07204 β’ Published β’ 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper β’ 2403.18814 β’ Published β’ 47 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper β’ 2409.17146 β’ Published β’ 106
-
The Llama 3 Herd of Models
Paper β’ 2407.21783 β’ Published β’ 111 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper β’ 2409.12191 β’ Published β’ 76 -
Baichuan Alignment Technical Report
Paper β’ 2410.14940 β’ Published β’ 50 -
A Survey of Small Language Models
Paper β’ 2410.20011 β’ Published β’ 40
-
Qwen2.5-Coder Technical Report
Paper β’ 2409.12186 β’ Published β’ 140 -
Attention Heads of Large Language Models: A Survey
Paper β’ 2409.03752 β’ Published β’ 89 -
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
Paper β’ 2409.02634 β’ Published β’ 93 -
OmniGen: Unified Image Generation
Paper β’ 2409.11340 β’ Published β’ 111
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper β’ 2409.17146 β’ Published β’ 106 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper β’ 2409.12191 β’ Published β’ 76 -
mistralai/Pixtral-12B-2409
Image-Text-to-Text β’ Updated β’ 599 -
HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text β’ Updated β’ 111k β’ 375
-
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
Paper β’ 2409.08513 β’ Published β’ 12 -
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Paper β’ 2409.08264 β’ Published β’ 44 -
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper β’ 2409.12191 β’ Published β’ 76 -
LLMs + Persona-Plug = Personalized LLMs
Paper β’ 2409.11901 β’ Published β’ 32
-
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 87 -
Visual Instruction Tuning
Paper β’ 2304.08485 β’ Published β’ 13 -
Improved Baselines with Visual Instruction Tuning
Paper β’ 2310.03744 β’ Published β’ 37 -
PALO: A Polyglot Large Multimodal Model for 5B People
Paper β’ 2402.14818 β’ Published β’ 23
-
Building and better understanding vision-language models: insights and future directions
Paper β’ 2408.12637 β’ Published β’ 125 -
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper β’ 2408.11039 β’ Published β’ 59 -
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper β’ 2408.16725 β’ Published β’ 53 -
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Paper β’ 2408.15998 β’ Published β’ 86