Weiyun1025 commited on
Commit
2c408ff
Β·
verified Β·
1 Parent(s): 0fb4585

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -19,7 +19,7 @@ tags:
19
 
20
  # InternVL3-1B
21
 
22
- [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](TBD)
23
 
24
  [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
25
 
@@ -66,9 +66,9 @@ Notably, in InternVL3, we integrate the [Variable Visual Position Encoding (V2PE
66
 
67
  ### Native Multimodal Pre-Training
68
 
69
- We propose a [Native Multimodal Pre-Training](TBD) approach that consolidates language and vision learning into a single pre-training stage.
70
  In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules.
71
- Please see [our paper](TBD) for more details.
72
 
73
  ### Supervised Fine-Tuning
74
 
 
19
 
20
  # InternVL3-1B
21
 
22
+ [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479)
23
 
24
  [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
25
 
 
66
 
67
  ### Native Multimodal Pre-Training
68
 
69
+ We propose a [Native Multimodal Pre-Training](https://huggingface.co/papers/2504.10479) approach that consolidates language and vision learning into a single pre-training stage.
70
  In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules.
71
+ Please see [our paper](https://huggingface.co/papers/2504.10479) for more details.
72
 
73
  ### Supervised Fine-Tuning
74