wjpoom commited on
Commit
c469139
·
verified ·
1 Parent(s): e4b46ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -6
README.md CHANGED
@@ -9,7 +9,7 @@ metrics:
9
  - accuracy
10
  base_model:
11
  - liuhaotian/llava-v1.6-vicuna-7b
12
- pipeline_tag: image-text-to-text
13
  tags:
14
  - multimodal
15
  - fine-grained
@@ -20,7 +20,7 @@ model-index:
20
  - task:
21
  type: multimodal
22
  dataset:
23
- name: Inst-IT-Bench-I
24
  type: Open-Ended
25
  metrics:
26
  - type: accuracy
@@ -30,7 +30,7 @@ model-index:
30
  - task:
31
  type: multimodal
32
  dataset:
33
- name: Inst-IT-Bench-I
34
  type: Multi-Choice
35
  metrics:
36
  - type: accuracy
@@ -90,7 +90,7 @@ model-index:
90
  - task:
91
  type: multimodal
92
  dataset:
93
- name: Inst-IT-Bench-V
94
  type: Open-Ended
95
  metrics:
96
  - type: accuracy
@@ -100,7 +100,7 @@ model-index:
100
  - task:
101
  type: multimodal
102
  dataset:
103
- name: Inst-IT-Bench-V
104
  type: Multi-Choice
105
  metrics:
106
  - type: accuracy
@@ -158,4 +158,72 @@ model-index:
158
  name: accuracy
159
  verified: true
160
 
161
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - accuracy
10
  base_model:
11
  - liuhaotian/llava-v1.6-vicuna-7b
12
+ pipeline_tag: video-text-to-text
13
  tags:
14
  - multimodal
15
  - fine-grained
 
20
  - task:
21
  type: multimodal
22
  dataset:
23
+ name: Inst-IT-Bench-I-OE
24
  type: Open-Ended
25
  metrics:
26
  - type: accuracy
 
30
  - task:
31
  type: multimodal
32
  dataset:
33
+ name: Inst-IT-Bench-I-MC
34
  type: Multi-Choice
35
  metrics:
36
  - type: accuracy
 
90
  - task:
91
  type: multimodal
92
  dataset:
93
+ name: Inst-IT-Bench-V-OE
94
  type: Open-Ended
95
  metrics:
96
  - type: accuracy
 
100
  - task:
101
  type: multimodal
102
  dataset:
103
+ name: Inst-IT-Bench-V-MC
104
  type: Multi-Choice
105
  metrics:
106
  - type: accuracy
 
158
  name: accuracy
159
  verified: true
160
 
161
+ ---
162
+
163
+ # LLaVA-Next-Inst-It-Vicuna-7B: A Multimodal Model that Excels at Instance-level Understanding
164
+ introduced in the paper [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https://huggingface.co/papers/2412.03565)
165
+
166
+ [**🌐 Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**🤗 Paper**](https://huggingface.co/papers/2412.03565) | [**📖 arXiv**](https://arxiv.org/abs/2412.03565)
167
+
168
+ ## Quick Start
169
+ **Install**
170
+ Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT to prepare the environment:
171
+ ```shell
172
+ pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
173
+ ```
174
+ **Load Model**
175
+ ```python
176
+ from llava.model.builder import load_pretrained_model
177
+ from llava.constants import (
178
+ DEFAULT_IM_END_TOKEN,
179
+ DEFAULT_IM_START_TOKEN,
180
+ DEFAULT_IMAGE_TOKEN,
181
+ IGNORE_INDEX,
182
+ IMAGE_TOKEN_INDEX,
183
+ )
184
+ from llava.mm_utils import (
185
+ KeywordsStoppingCriteria,
186
+ get_model_name_from_path,
187
+ tokenizer_image_token,
188
+ )
189
+ from llava.conversation import SeparatorStyle, conv_templates
190
+
191
+
192
+ overwrite_config = {}
193
+ overwrite_config["mm_spatial_pool_stride"] = 2
194
+ overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
195
+ overwrite_config["mm_pooling_position"] = 'after'
196
+ overwrite_config["mm_newline_position"] = 'no_token'
197
+
198
+ model_path = "Inst-IT/LLaVA-Next-Inst-It-Vicuna-7B"
199
+ model_name = get_model_name_from_path(model_path)
200
+
201
+ tokenizer, model, image_processor, max_length = load_pretrained_model(
202
+ model_path=model_path,
203
+ model_base=None,
204
+ model_name=model_name,
205
+ device_map="auto",
206
+ torch_dtype='bfloat16',
207
+ overwrite_config=overwrite_config,
208
+ attn_implementation='sdpa')
209
+ ```
210
+ **Image Inference**
211
+
212
+
213
+ **Video Inference**
214
+
215
+
216
+ ## Contact
217
+ Feel free to contact us if you have any questions or suggestions
218
+ - Email (Wujian Peng): [email protected]
219
+ - Email (Lingchen Meng): [email protected]
220
+
221
+ ## Citation
222
+ ```bibtex
223
+ @article{peng2024boosting,
224
+ title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
225
+ author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Hang, Xu and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
226
+ journal={arXiv preprint arXiv:2412.03565},
227
+ year={2024}
228
+ }
229
+ ```