-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 43 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2406.04325
-
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 105 -
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Paper • 2404.12253 • Published • 55 -
Make Your LLM Fully Utilize the Context
Paper • 2404.16811 • Published • 54 -
ReFT: Representation Finetuning for Language Models
Paper • 2404.03592 • Published • 94
-
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 74 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 51 -
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Paper • 2311.10122 • Published • 27 -
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Paper • 2311.16103 • Published • 1
-
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 74 -
SF-V: Single Forward Video Generation Model
Paper • 2406.04324 • Published • 25 -
VideoTetris: Towards Compositional Text-to-Video Generation
Paper • 2406.04277 • Published • 25 -
Vript: A Video Is Worth Thousands of Words
Paper • 2406.06040 • Published • 29
-
Vript: A Video Is Worth Thousands of Words
Paper • 2406.06040 • Published • 29 -
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 74 -
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Paper • 2406.01574 • Published • 46 -
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Paper • 2405.21075 • Published • 24
-
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper • 2406.06469 • Published • 28 -
Mixture-of-Agents Enhances Large Language Model Capabilities
Paper • 2406.04692 • Published • 58 -
CRAG -- Comprehensive RAG Benchmark
Paper • 2406.04744 • Published • 47 -
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 74
-
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Paper • 2406.04325 • Published • 74 -
SF-V: Single Forward Video Generation Model
Paper • 2406.04324 • Published • 25 -
I4VGen: Image as Stepping Stone for Text-to-Video Generation
Paper • 2406.02230 • Published • 18 -
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
Paper • 2412.18176 • Published • 15