czczup commited on
Commit
d348ad6
·
verified ·
1 Parent(s): f19aeab

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +19 -4
README.md CHANGED
@@ -1,6 +1,20 @@
1
  ---
2
  license: mit
3
  pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
  # InternVL-Chat-V1-2-Plus
@@ -27,10 +41,11 @@ InternVL-Chat-V1-2-Plus uses the same model architecture as [InternVL-Chat-V1-2]
27
 
28
  - **Training Strategy:**
29
 
30
- - Pretraining Stage
31
- - Learnable Component: MLP
32
- - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data. In this stage, we first load the pre-trained weights of [InternViT-6B-448px-V1-0](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) and connect it to Nous-Hermes-2-Yi-34B. After pre-training, the extracted ViT is published as [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
33
- - Supervised Finetuning Stage
 
34
  - Learnable Component: ViT + MLP + LLM
35
  - Data: 12 million SFT samples.
36
 
 
1
  ---
2
  license: mit
3
  pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ base_model:
6
+ - OpenGVLab/InternViT-6B-448px-V1-2
7
+ - NousResearch/Nous-Hermes-2-Yi-34B
8
+ base_model_relation: merge
9
+ language:
10
+ - multilingual
11
+ tags:
12
+ - internvl
13
+ - vision
14
+ - ocr
15
+ - multi-image
16
+ - video
17
+ - custom_code
18
  ---
19
 
20
  # InternVL-Chat-V1-2-Plus
 
41
 
42
  - **Training Strategy:**
43
 
44
+ - Pre-training Stage
45
+ - Learnable Component: ViT + MLP
46
+ - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
47
+ - Note: In this stage, we first load the pre-trained weights of [InternViT-6B-448px-V1-0](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) and connect it to Nous-Hermes-2-Yi-34B. After pre-training, the extracted ViT is published as [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
48
+ - Supervised Fine-tuning Stage
49
  - Learnable Component: ViT + MLP + LLM
50
  - Data: 12 million SFT samples.
51