OpenGVLab
/

InternVL-Chat-V1-2-Plus

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Metrics Training metrics Community

czczup commited on Aug 23, 2024

Commit

d348ad6

·

verified ·

1 Parent(s): f19aeab

Upload folder using huggingface_hub

Files changed (1) hide show

README.md +19 -4

README.md CHANGED Viewed

@@ -1,6 +1,20 @@
 ---
 license: mit
 pipeline_tag: image-text-to-text
 ---
 # InternVL-Chat-V1-2-Plus
@@ -27,10 +41,11 @@ InternVL-Chat-V1-2-Plus uses the same model architecture as [InternVL-Chat-V1-2]
 - **Training Strategy:**
-  - Pretraining Stage
-    - Learnable Component: MLP
-    - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data. In this stage, we first load the pre-trained weights of [InternViT-6B-448px-V1-0](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) and connect it to Nous-Hermes-2-Yi-34B. After pre-training, the extracted ViT is published as [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
-  - Supervised Finetuning Stage
     - Learnable Component: ViT + MLP + LLM
     - Data: 12 million SFT samples.

 ---
 license: mit
 pipeline_tag: image-text-to-text
+library_name: transformers
+base_model:
+  - OpenGVLab/InternViT-6B-448px-V1-2
+  - NousResearch/Nous-Hermes-2-Yi-34B
+base_model_relation: merge
+language:
+  - multilingual
+tags:
+  - internvl
+  - vision
+  - ocr
+  - multi-image
+  - video
+  - custom_code
 ---
 # InternVL-Chat-V1-2-Plus
 - **Training Strategy:**
+  - Pre-training Stage
+    - Learnable Component: ViT + MLP
+    - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
+    - Note: In this stage, we first load the pre-trained weights of [InternViT-6B-448px-V1-0](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) and connect it to Nous-Hermes-2-Yi-34B. After pre-training, the extracted ViT is published as [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
+  - Supervised Fine-tuning Stage
     - Learnable Component: ViT + MLP + LLM
     - Data: 12 million SFT samples.