czczup commited on
Commit
b67fbed
·
verified ·
1 Parent(s): 4a0c918

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +18 -9
README.md CHANGED
@@ -1,9 +1,20 @@
1
  ---
2
  license: mit
3
  pipeline_tag: image-text-to-text
 
4
  base_model:
5
- - OpenGVLab/InternViT-6B-448px
6
  - meta-llama/Llama-2-13b-hf
 
 
 
 
 
 
 
 
 
 
7
  ---
8
 
9
  # InternVL-Chat-V1-1
@@ -15,7 +26,7 @@ base_model:
15
  ## Introduction
16
 
17
  We released [🤗 InternVL-Chat-V1-1](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1), featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM.
18
- As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
19
 
20
  <p align="center">
21
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 100%;">
@@ -49,12 +60,12 @@ This model can also conduct an in-depth analysis of AAAI's official website and
49
 
50
  - **Training Strategy:**
51
 
52
- - Pretraining Stage
53
- - Learnable Component: InternViT-6B + LLaMA2-13B
54
  - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR-related datasets.
55
- - Note: In this stage, we load the pretrained weights of the original [InternViT-6B-224px](https://huggingface.co/OpenGVLab/InternViT-6B-224px) and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle operation to reduce 1024 tokens to 256 tokens.
56
- - Supervised Finetuning Stage
57
- - Learnable Component: MLP + LLaMA2-13B
58
  - Data: A comprehensive collection of open-source datasets, along with their Chinese translation versions, totaling approximately 6M samples.
59
 
60
  ## Performance
@@ -79,8 +90,6 @@ This model can also conduct an in-depth analysis of AAAI's official website and
79
 
80
  - Note that we use the [official evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator) to test the MMVet scores, with `GPT-4-0613` serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
81
 
82
- Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.
83
-
84
  ## Quick Start
85
 
86
  We provide an example code to run InternVL-Chat-V1-1 using `transformers`.
 
1
  ---
2
  license: mit
3
  pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  base_model:
6
+ - OpenGVLab/InternViT-6B-448px-V1-0
7
  - meta-llama/Llama-2-13b-hf
8
+ base_model_relation: merge
9
+ language:
10
+ - multilingual
11
+ tags:
12
+ - internvl
13
+ - vision
14
+ - ocr
15
+ - multi-image
16
+ - video
17
+ - custom_code
18
  ---
19
 
20
  # InternVL-Chat-V1-1
 
26
  ## Introduction
27
 
28
  We released [🤗 InternVL-Chat-V1-1](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1), featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM.
29
+ As shown in the figure below, we connected our [InternViT-6B](https://huggingface.co/OpenGVLab/InternViT-6B-448px) to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
30
 
31
  <p align="center">
32
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 100%;">
 
60
 
61
  - **Training Strategy:**
62
 
63
+ - Pre-training Stage
64
+ - Learnable Component: ViT + MLP
65
  - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR-related datasets.
66
+ - Note: In this stage, we load the pretrained weights of the original [InternViT-6B-224px](https://huggingface.co/OpenGVLab/InternViT-6B-224px) and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle (unshuffle) operation to reduce 1024 tokens to 256 tokens.
67
+ - Supervised Fine-tuning Stage
68
+ - Learnable Component: MLP + LLM
69
  - Data: A comprehensive collection of open-source datasets, along with their Chinese translation versions, totaling approximately 6M samples.
70
 
71
  ## Performance
 
90
 
91
  - Note that we use the [official evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator) to test the MMVet scores, with `GPT-4-0613` serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
92
 
 
 
93
  ## Quick Start
94
 
95
  We provide an example code to run InternVL-Chat-V1-1 using `transformers`.