-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 43 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2408.16500
-
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 93 -
CogVLM2: Visual Language Models for Image and Video Understanding
Paper • 2408.16500 • Published • 57 -
Learning to Move Like Professional Counter-Strike Players
Paper • 2408.13934 • Published • 23 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 126
-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper • 2408.10188 • Published • 52 -
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper • 2408.08872 • Published • 99 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 126 -
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Paper • 2408.12528 • Published • 51
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper • 2406.16860 • Published • 60 -
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper • 2407.02477 • Published • 23 -
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper • 2408.10188 • Published • 52 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 126
-
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Paper • 2401.14405 • Published • 13 -
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Paper • 2406.18521 • Published • 29 -
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Paper • 2408.12590 • Published • 36 -
Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 93
-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 68 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 131 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 88
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 15 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 88 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 32
-
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Paper • 2405.08748 • Published • 24 -
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper • 2405.10300 • Published • 29 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 131 -
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper • 2405.11143 • Published • 38
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Paper • 2402.17177 • Published • 88 -
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
Paper • 2402.17485 • Published • 191 -
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper • 2403.00522 • Published • 46 -
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Paper • 2403.04692 • Published • 40
-
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models
Paper • 2402.10986 • Published • 78 -
Aria Everyday Activities Dataset
Paper • 2402.13349 • Published • 31 -
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper • 2403.04132 • Published • 39 -
SaulLM-7B: A pioneering Large Language Model for Law
Paper • 2403.03883 • Published • 80