Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ metrics:
|
|
9 |
- accuracy
|
10 |
base_model:
|
11 |
- liuhaotian/llava-v1.6-vicuna-7b
|
12 |
-
pipeline_tag:
|
13 |
tags:
|
14 |
- multimodal
|
15 |
- fine-grained
|
@@ -20,7 +20,7 @@ model-index:
|
|
20 |
- task:
|
21 |
type: multimodal
|
22 |
dataset:
|
23 |
-
name: Inst-IT-Bench-I
|
24 |
type: Open-Ended
|
25 |
metrics:
|
26 |
- type: accuracy
|
@@ -30,7 +30,7 @@ model-index:
|
|
30 |
- task:
|
31 |
type: multimodal
|
32 |
dataset:
|
33 |
-
name: Inst-IT-Bench-I
|
34 |
type: Multi-Choice
|
35 |
metrics:
|
36 |
- type: accuracy
|
@@ -90,7 +90,7 @@ model-index:
|
|
90 |
- task:
|
91 |
type: multimodal
|
92 |
dataset:
|
93 |
-
name: Inst-IT-Bench-V
|
94 |
type: Open-Ended
|
95 |
metrics:
|
96 |
- type: accuracy
|
@@ -100,7 +100,7 @@ model-index:
|
|
100 |
- task:
|
101 |
type: multimodal
|
102 |
dataset:
|
103 |
-
name: Inst-IT-Bench-V
|
104 |
type: Multi-Choice
|
105 |
metrics:
|
106 |
- type: accuracy
|
@@ -158,4 +158,72 @@ model-index:
|
|
158 |
name: accuracy
|
159 |
verified: true
|
160 |
|
161 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
- accuracy
|
10 |
base_model:
|
11 |
- liuhaotian/llava-v1.6-vicuna-7b
|
12 |
+
pipeline_tag: video-text-to-text
|
13 |
tags:
|
14 |
- multimodal
|
15 |
- fine-grained
|
|
|
20 |
- task:
|
21 |
type: multimodal
|
22 |
dataset:
|
23 |
+
name: Inst-IT-Bench-I-OE
|
24 |
type: Open-Ended
|
25 |
metrics:
|
26 |
- type: accuracy
|
|
|
30 |
- task:
|
31 |
type: multimodal
|
32 |
dataset:
|
33 |
+
name: Inst-IT-Bench-I-MC
|
34 |
type: Multi-Choice
|
35 |
metrics:
|
36 |
- type: accuracy
|
|
|
90 |
- task:
|
91 |
type: multimodal
|
92 |
dataset:
|
93 |
+
name: Inst-IT-Bench-V-OE
|
94 |
type: Open-Ended
|
95 |
metrics:
|
96 |
- type: accuracy
|
|
|
100 |
- task:
|
101 |
type: multimodal
|
102 |
dataset:
|
103 |
+
name: Inst-IT-Bench-V-MC
|
104 |
type: Multi-Choice
|
105 |
metrics:
|
106 |
- type: accuracy
|
|
|
158 |
name: accuracy
|
159 |
verified: true
|
160 |
|
161 |
+
---
|
162 |
+
|
163 |
+
# LLaVA-Next-Inst-It-Vicuna-7B: A Multimodal Model that Excels at Instance-level Understanding
|
164 |
+
introduced in the paper [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https://huggingface.co/papers/2412.03565)
|
165 |
+
|
166 |
+
[**🌐 Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**🤗 Paper**](https://huggingface.co/papers/2412.03565) | [**📖 arXiv**](https://arxiv.org/abs/2412.03565)
|
167 |
+
|
168 |
+
## Quick Start
|
169 |
+
**Install**
|
170 |
+
Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT to prepare the environment:
|
171 |
+
```shell
|
172 |
+
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
|
173 |
+
```
|
174 |
+
**Load Model**
|
175 |
+
```python
|
176 |
+
from llava.model.builder import load_pretrained_model
|
177 |
+
from llava.constants import (
|
178 |
+
DEFAULT_IM_END_TOKEN,
|
179 |
+
DEFAULT_IM_START_TOKEN,
|
180 |
+
DEFAULT_IMAGE_TOKEN,
|
181 |
+
IGNORE_INDEX,
|
182 |
+
IMAGE_TOKEN_INDEX,
|
183 |
+
)
|
184 |
+
from llava.mm_utils import (
|
185 |
+
KeywordsStoppingCriteria,
|
186 |
+
get_model_name_from_path,
|
187 |
+
tokenizer_image_token,
|
188 |
+
)
|
189 |
+
from llava.conversation import SeparatorStyle, conv_templates
|
190 |
+
|
191 |
+
|
192 |
+
overwrite_config = {}
|
193 |
+
overwrite_config["mm_spatial_pool_stride"] = 2
|
194 |
+
overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
|
195 |
+
overwrite_config["mm_pooling_position"] = 'after'
|
196 |
+
overwrite_config["mm_newline_position"] = 'no_token'
|
197 |
+
|
198 |
+
model_path = "Inst-IT/LLaVA-Next-Inst-It-Vicuna-7B"
|
199 |
+
model_name = get_model_name_from_path(model_path)
|
200 |
+
|
201 |
+
tokenizer, model, image_processor, max_length = load_pretrained_model(
|
202 |
+
model_path=model_path,
|
203 |
+
model_base=None,
|
204 |
+
model_name=model_name,
|
205 |
+
device_map="auto",
|
206 |
+
torch_dtype='bfloat16',
|
207 |
+
overwrite_config=overwrite_config,
|
208 |
+
attn_implementation='sdpa')
|
209 |
+
```
|
210 |
+
**Image Inference**
|
211 |
+
|
212 |
+
|
213 |
+
**Video Inference**
|
214 |
+
|
215 |
+
|
216 |
+
## Contact
|
217 |
+
Feel free to contact us if you have any questions or suggestions
|
218 |
+
- Email (Wujian Peng): [email protected]
|
219 |
+
- Email (Lingchen Meng): [email protected]
|
220 |
+
|
221 |
+
## Citation
|
222 |
+
```bibtex
|
223 |
+
@article{peng2024boosting,
|
224 |
+
title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
|
225 |
+
author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Hang, Xu and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
|
226 |
+
journal={arXiv preprint arXiv:2412.03565},
|
227 |
+
year={2024}
|
228 |
+
}
|
229 |
+
```
|