OpenGVLab
/

InternVL-Chat-V1-2

@@ -1,6 +1,20 @@
 ---
 license: mit
 pipeline_tag: image-text-to-text
 ---
 # InternVL-Chat-V1-2
@@ -31,13 +45,13 @@ For better training reproducibility, we follow the minimalist design and data ef
 - **Training Strategy:**
-  - Pretraining Stage
     - Learnable Component: ViT + MLP
-    - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR-related datasets.
-    - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
-  - Supervised Finetuning Stage
     - Learnable Component: ViT + MLP + LLM
-    - Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
 ## Performance
@@ -54,7 +68,6 @@ For better training reproducibility, we follow the minimalist design and data ef
 | LLaVA−NEXT−34B         | 672x672    | 51.1          | 44.7           | 46.5                    | 79.3          | 79.0             | -    | 1631/397 | 81.8                 | 87.7 | 69.5             | 75.9              | 63.8             | 67.1          |
 | InternVL−Chat<br>−V1-2 | 448x448    | 51.6          | 46.2           | 47.7                    | 82.2          | 81.2             | 56.7 | 1687/489 | 83.3                 | 88.0 | 72.5             | 75.6              | 60.0             | 64.0          |
-- Note that we use the [official evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator) to test the MMVet scores, with `GPT-4-0613` serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
 - In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
 Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.
@@ -65,15 +78,15 @@ Here, we have conducted only a simple performance comparison. For more detailed
 Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
-For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
-### Training (Supervised Finetuning)
-We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
-For more details about training, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#start-training).
-The hyperparameters used for finetuning are listed in the following table.
 | Hyperparameter         | Trainable Param  | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
 | ---------------------- | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |

 ---
 license: mit
 pipeline_tag: image-text-to-text
+library_name: transformers
+base_model:
+  - OpenGVLab/InternViT-6B-448px-V1-2
+  - NousResearch/Nous-Hermes-2-Yi-34B
+base_model_relation: merge
+language:
+  - multilingual
+tags:
+  - internvl
+  - vision
+  - ocr
+  - multi-image
+  - video
+  - custom_code
 ---
 # InternVL-Chat-V1-2
 - **Training Strategy:**
+  - Pre-training Stage
     - Learnable Component: ViT + MLP
+    - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
+    - Note: In this stage, we first load the pre-trained weights of [InternViT-6B-448px-V1-0](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) and connect it to Nous-Hermes-2-Yi-34B. After pre-training, the extracted ViT is published as [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
+  - Supervised Fine-tuning Stage
     - Learnable Component: ViT + MLP + LLM
+    - Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples. You can download it from [here](https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data).
 ## Performance
 | LLaVA−NEXT−34B         | 672x672    | 51.1          | 44.7           | 46.5                    | 79.3          | 79.0             | -    | 1631/397 | 81.8                 | 87.7 | 69.5             | 75.9              | 63.8             | 67.1          |
 | InternVL−Chat<br>−V1-2 | 448x448    | 51.6          | 46.2           | 47.7                    | 82.2          | 81.2             | 56.7 | 1687/489 | 83.3                 | 88.0 | 72.5             | 75.6              | 60.0             | 64.0          |
 - In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
 Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.
 Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
+Now, you can download these datasets directly from [HuggingFace](https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data). For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
+### Training (Supervised Fine-tuning)
+We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internvl1.2/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
+For more details about training, please see [here](https://internvl.readthedocs.io/en/latest/internvl1.2/reproduce.html).
+The hyperparameters used for fine-tuning are listed in the following table.
 | Hyperparameter         | Trainable Param  | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
 | ---------------------- | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |