[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
AI & ML interests
Computer Vision
Recent Activity
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 280 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 91.4k • 68 -
OpenGVLab/InternVL3-2B
Image-Text-to-Text • 2B • Updated • 54.3k • 30 -
OpenGVLab/InternVL3-8B
Image-Text-to-Text • 8B • Updated • 309k • 88
A Pioneering Monolithic MLLM
-
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Paper • 2410.08202 • Published • 4 -
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Paper • 2507.12566 • Published • 14 -
OpenGVLab/Mono-InternVL-2B
Image-Text-to-Text • 3B • Updated • 5.12k • 36 -
OpenGVLab/Mono-InternVL-2B-S1-1
Image-Text-to-Text • 3B • Updated • 9
[NeurIPS 2024 Spotlight ] Parameter-Inverted Image Pyramid Networks
-
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Paper • 2501.07783 • Published • 7 -
OpenGVLab/PIIP
Object Detection • Updated • 5 -
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B
Image-Text-to-Text • 7B • Updated • 3 -
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B
Image-Text-to-Text • 7B • Updated • 4
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published -
OpenGVLab/VideoMAEv2-Base
Video Classification • 0.1B • Updated • 9.76k • 7 -
OpenGVLab/VideoMAEv2-Large
Video Classification • 0.3B • Updated • 18.4k • 1 -
OpenGVLab/VideoMAEv2-Huge
Video Classification • 0.6B • Updated • 5.78k • 1
Better than InternVL 2.0
-
486
InternVL
⚡Chat with an AI that understands text and images
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 161 -
OpenGVLab/InternVL2_5-78B
Image-Text-to-Text • 78B • Updated • 693 • 192 -
OpenGVLab/InternVL2_5-78B-AWQ
Image-Text-to-Text • Updated • 60 • 14
Expanding Performance Boundaries of Open-Source MLLM
Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper • 2312.14238 • Published • 20 -
OpenGVLab/InternViT-6B-224px
Image Feature Extraction • Updated • 491 • 24 -
OpenGVLab/InternVL-14B-224px
Image Feature Extraction • 14B • Updated • 786 • 35 -
OpenGVLab/InternVL-Chat-V1-2-Plus
Image-Text-to-Text • 40B • Updated • 42 • 34
Adaptation Models for Specific Domains
-
OpenGVLab/Mini-InternVL2-4B-DA-DriveLM
Image-Text-to-Text • 4B • Updated • 48 • 3 -
OpenGVLab/Mini-InternVL2-4B-DA-Medical
Image-Text-to-Text • 4B • Updated • 9 • 5 -
OpenGVLab/Mini-InternVL2-4B-DA-BDD
Image-Text-to-Text • 4B • Updated • 8 -
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM
Image-Text-to-Text • 2B • Updated • 38
Chat-Centric Video Understanding
A Large-Scale Video-Text Dataset
Improved Baselines with Pyramid Vision Transformer
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
-
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper • 2503.10291 • Published • 37 -
OpenGVLab/VisualPRM-8B
Image-Text-to-Text • 8B • Updated • 1.28k • 15 -
OpenGVLab/VisualPRM-8B-v1_1
Image-Text-to-Text • 8B • Updated • 165 • 7 -
OpenGVLab/VisualPRM400K
Preview • Updated • 131 • 13
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text • 8B • Updated • 22.7k • 73 -
OpenGVLab/InternVL_2_5_HiCo_R16
Video-Text-to-Text • 8B • Updated • 2.66k • 4 -
OpenGVLab/InternVL_2_5_HiCo_R64
Video-Text-to-Text • 8B • Updated • 123 • 3 -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Paper • 2501.12386 • Published • 1
Faster and more powerful VideoChat.
-
OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448
Video-Text-to-Text • 2B • Updated • 933 • 23 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224
Video-Text-to-Text • 8B • Updated • 36 • 7 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
Video-Text-to-Text • 8B • Updated • 2.25k • 12 -
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Paper • 2501.00574 • Published • 6
Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 85 -
OpenGVLab/InternVL2_5-78B-MPO
Image-Text-to-Text • 78B • Updated • 220 • 54 -
OpenGVLab/InternVL2_5-38B-MPO
Image-Text-to-Text • 38B • Updated • 608 • 20 -
OpenGVLab/InternVL2_5-26B-MPO
Image-Text-to-Text • 26B • Updated • 450 • 14
A Pioneering Open-Source Alternative to GPT-4V
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 58 -
OpenGVLab/InternVL-Chat-V1-5
Image-Text-to-Text • 26B • Updated • 2.93k • 412 -
OpenGVLab/InternViT-6B-448px-V1-5
Image Feature Extraction • 6B • Updated • 734 • 78 -
OpenGVLab/InternViT-300M-448px
Image Feature Extraction • 0.3B • Updated • 9.09k • 55
Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
InternVideo2
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Paper • 2403.15377 • Published • 27 -
OpenGVLab/InternVideo2-Chat-8B
Video-Text-to-Text • 8B • Updated • 500 • 23 -
OpenGVLab/InternVideo2_chat_8B_HD
Video-Text-to-Text • 8B • Updated • 112 • 18 -
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5
Video-Text-to-Text • 9B • Updated • 44 • 7
State Space Model for Efficient Video Understanding
A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
-
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Paper • 2211.05778 • Published -
OpenGVLab/internimage_t_1k_224
Image Classification • 0.0B • Updated • 177 • 1 -
OpenGVLab/internimage_s_1k_224
Image Classification • 0.1B • Updated • 8 • 1 -
OpenGVLab/internimage_b_1k_224
Image Classification • 0.1B • Updated • 103 • 1
[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 280 -
OpenGVLab/InternVL3-1B
Image-Text-to-Text • 0.9B • Updated • 91.4k • 68 -
OpenGVLab/InternVL3-2B
Image-Text-to-Text • 2B • Updated • 54.3k • 30 -
OpenGVLab/InternVL3-8B
Image-Text-to-Text • 8B • Updated • 309k • 88
-
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper • 2503.10291 • Published • 37 -
OpenGVLab/VisualPRM-8B
Image-Text-to-Text • 8B • Updated • 1.28k • 15 -
OpenGVLab/VisualPRM-8B-v1_1
Image-Text-to-Text • 8B • Updated • 165 • 7 -
OpenGVLab/VisualPRM400K
Preview • Updated • 131 • 13
A Pioneering Monolithic MLLM
-
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Paper • 2410.08202 • Published • 4 -
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Paper • 2507.12566 • Published • 14 -
OpenGVLab/Mono-InternVL-2B
Image-Text-to-Text • 3B • Updated • 5.12k • 36 -
OpenGVLab/Mono-InternVL-2B-S1-1
Image-Text-to-Text • 3B • Updated • 9
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
[NeurIPS 2024 Spotlight ] Parameter-Inverted Image Pyramid Networks
-
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Paper • 2501.07783 • Published • 7 -
OpenGVLab/PIIP
Object Detection • Updated • 5 -
OpenGVLab/PIIP-LLaVA_CLIP-BL_512-256_7B
Image-Text-to-Text • 7B • Updated • 3 -
OpenGVLab/PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B
Image-Text-to-Text • 7B • Updated • 4
-
OpenGVLab/InternVideo2_5_Chat_8B
Video-Text-to-Text • 8B • Updated • 22.7k • 73 -
OpenGVLab/InternVL_2_5_HiCo_R16
Video-Text-to-Text • 8B • Updated • 2.66k • 4 -
OpenGVLab/InternVL_2_5_HiCo_R64
Video-Text-to-Text • 8B • Updated • 123 • 3 -
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Paper • 2501.12386 • Published • 1
-
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published -
OpenGVLab/VideoMAEv2-Base
Video Classification • 0.1B • Updated • 9.76k • 7 -
OpenGVLab/VideoMAEv2-Large
Video Classification • 0.3B • Updated • 18.4k • 1 -
OpenGVLab/VideoMAEv2-Huge
Video Classification • 0.6B • Updated • 5.78k • 1
Faster and more powerful VideoChat.
-
OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448
Video-Text-to-Text • 2B • Updated • 933 • 23 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res224
Video-Text-to-Text • 8B • Updated • 36 • 7 -
OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
Video-Text-to-Text • 8B • Updated • 2.25k • 12 -
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Paper • 2501.00574 • Published • 6
Better than InternVL 2.0
-
486
InternVL
⚡Chat with an AI that understands text and images
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Paper • 2412.05271 • Published • 161 -
OpenGVLab/InternVL2_5-78B
Image-Text-to-Text • 78B • Updated • 693 • 192 -
OpenGVLab/InternVL2_5-78B-AWQ
Image-Text-to-Text • Updated • 60 • 14
Enhancing the Reasoning Ability of MLLMs via Mixed Preference Optimization
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 85 -
OpenGVLab/InternVL2_5-78B-MPO
Image-Text-to-Text • 78B • Updated • 220 • 54 -
OpenGVLab/InternVL2_5-38B-MPO
Image-Text-to-Text • 38B • Updated • 608 • 20 -
OpenGVLab/InternVL2_5-26B-MPO
Image-Text-to-Text • 26B • Updated • 450 • 14
Expanding Performance Boundaries of Open-Source MLLM
A Pioneering Open-Source Alternative to GPT-4V
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 58 -
OpenGVLab/InternVL-Chat-V1-5
Image-Text-to-Text • 26B • Updated • 2.93k • 412 -
OpenGVLab/InternViT-6B-448px-V1-5
Image Feature Extraction • 6B • Updated • 734 • 78 -
OpenGVLab/InternViT-300M-448px
Image Feature Extraction • 0.3B • Updated • 9.09k • 55
Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper • 2312.14238 • Published • 20 -
OpenGVLab/InternViT-6B-224px
Image Feature Extraction • Updated • 491 • 24 -
OpenGVLab/InternVL-14B-224px
Image Feature Extraction • 14B • Updated • 786 • 35 -
OpenGVLab/InternVL-Chat-V1-2-Plus
Image-Text-to-Text • 40B • Updated • 42 • 34
Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Adaptation Models for Specific Domains
-
OpenGVLab/Mini-InternVL2-4B-DA-DriveLM
Image-Text-to-Text • 4B • Updated • 48 • 3 -
OpenGVLab/Mini-InternVL2-4B-DA-Medical
Image-Text-to-Text • 4B • Updated • 9 • 5 -
OpenGVLab/Mini-InternVL2-4B-DA-BDD
Image-Text-to-Text • 4B • Updated • 8 -
OpenGVLab/Mini-InternVL2-2B-DA-DriveLM
Image-Text-to-Text • 2B • Updated • 38
InternVideo2
-
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Paper • 2403.15377 • Published • 27 -
OpenGVLab/InternVideo2-Chat-8B
Video-Text-to-Text • 8B • Updated • 500 • 23 -
OpenGVLab/InternVideo2_chat_8B_HD
Video-Text-to-Text • 8B • Updated • 112 • 18 -
OpenGVLab/InternVideo2_Chat_8B_InternLM2_5
Video-Text-to-Text • 9B • Updated • 44 • 7
Chat-Centric Video Understanding
State Space Model for Efficient Video Understanding
A Large-Scale Video-Text Dataset
A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
-
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Paper • 2211.05778 • Published -
OpenGVLab/internimage_t_1k_224
Image Classification • 0.0B • Updated • 177 • 1 -
OpenGVLab/internimage_s_1k_224
Image Classification • 0.1B • Updated • 8 • 1 -
OpenGVLab/internimage_b_1k_224
Image Classification • 0.1B • Updated • 103 • 1
Improved Baselines with Pyramid Vision Transformer