Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,6 +1,20 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: image-text-to-text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
# InternVL-Chat-V1-2
|
|
@@ -31,13 +45,13 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
| 31 |
|
| 32 |
- **Training Strategy:**
|
| 33 |
|
| 34 |
-
-
|
| 35 |
- Learnable Component: ViT + MLP
|
| 36 |
-
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR
|
| 37 |
-
- Note: In this stage, we load the
|
| 38 |
-
- Supervised
|
| 39 |
- Learnable Component: ViT + MLP + LLM
|
| 40 |
-
- Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
|
| 41 |
|
| 42 |
## Performance
|
| 43 |
|
|
@@ -54,7 +68,6 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
| 54 |
| LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
|
| 55 |
| InternVL−Chat<br>−V1-2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1687/489 | 83.3 | 88.0 | 72.5 | 75.6 | 60.0 | 64.0 |
|
| 56 |
|
| 57 |
-
- Note that we use the [official evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator) to test the MMVet scores, with `GPT-4-0613` serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
|
| 58 |
- In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
|
| 59 |
|
| 60 |
Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.
|
|
@@ -65,15 +78,15 @@ Here, we have conducted only a simple performance comparison. For more detailed
|
|
| 65 |
|
| 66 |
Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
|
| 67 |
|
| 68 |
-
For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
| 69 |
|
| 70 |
-
### Training (Supervised
|
| 71 |
|
| 72 |
-
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/
|
| 73 |
|
| 74 |
-
For more details about training, please see [here](https://
|
| 75 |
|
| 76 |
-
The hyperparameters used for
|
| 77 |
|
| 78 |
| Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|
| 79 |
| ---------------------- | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: image-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
+
base_model:
|
| 6 |
+
- OpenGVLab/InternViT-6B-448px-V1-2
|
| 7 |
+
- NousResearch/Nous-Hermes-2-Yi-34B
|
| 8 |
+
base_model_relation: merge
|
| 9 |
+
language:
|
| 10 |
+
- multilingual
|
| 11 |
+
tags:
|
| 12 |
+
- internvl
|
| 13 |
+
- vision
|
| 14 |
+
- ocr
|
| 15 |
+
- multi-image
|
| 16 |
+
- video
|
| 17 |
+
- custom_code
|
| 18 |
---
|
| 19 |
|
| 20 |
# InternVL-Chat-V1-2
|
|
|
|
| 45 |
|
| 46 |
- **Training Strategy:**
|
| 47 |
|
| 48 |
+
- Pre-training Stage
|
| 49 |
- Learnable Component: ViT + MLP
|
| 50 |
+
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
| 51 |
+
- Note: In this stage, we first load the pre-trained weights of [InternViT-6B-448px-V1-0](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) and connect it to Nous-Hermes-2-Yi-34B. After pre-training, the extracted ViT is published as [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
| 52 |
+
- Supervised Fine-tuning Stage
|
| 53 |
- Learnable Component: ViT + MLP + LLM
|
| 54 |
+
- Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples. You can download it from [here](https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data).
|
| 55 |
|
| 56 |
## Performance
|
| 57 |
|
|
|
|
| 68 |
| LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
|
| 69 |
| InternVL−Chat<br>−V1-2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1687/489 | 83.3 | 88.0 | 72.5 | 75.6 | 60.0 | 64.0 |
|
| 70 |
|
|
|
|
| 71 |
- In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
|
| 72 |
|
| 73 |
Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.
|
|
|
|
| 78 |
|
| 79 |
Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
|
| 80 |
|
| 81 |
+
Now, you can download these datasets directly from [HuggingFace](https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data). For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
| 82 |
|
| 83 |
+
### Training (Supervised Fine-tuning)
|
| 84 |
|
| 85 |
+
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internvl1.2/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
|
| 86 |
|
| 87 |
+
For more details about training, please see [here](https://internvl.readthedocs.io/en/latest/internvl1.2/reproduce.html).
|
| 88 |
|
| 89 |
+
The hyperparameters used for fine-tuning are listed in the following table.
|
| 90 |
|
| 91 |
| Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|
| 92 |
| ---------------------- | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|