Inst-IT
/

LLaVA-Next-Inst-It-Vicuna-7B

@@ -9,7 +9,7 @@ metrics:
 - accuracy
 base_model:
 - liuhaotian/llava-v1.6-vicuna-7b
-pipeline_tag: image-text-to-text
 tags:
 - multimodal
 - fine-grained
@@ -20,7 +20,7 @@ model-index:
   - task:
       type: multimodal
     dataset:
-      name: Inst-IT-Bench-I
       type: Open-Ended
     metrics:
       - type: accuracy
@@ -30,7 +30,7 @@ model-index:
   - task:
       type: multimodal
     dataset:
-      name: Inst-IT-Bench-I
       type: Multi-Choice
     metrics:
       - type: accuracy
@@ -90,7 +90,7 @@ model-index:
   - task:
       type: multimodal
     dataset:
-      name: Inst-IT-Bench-V
       type: Open-Ended
     metrics:
       - type: accuracy
@@ -100,7 +100,7 @@ model-index:
   - task:
       type: multimodal
     dataset:
-      name: Inst-IT-Bench-V
       type: Multi-Choice
     metrics:
       - type: accuracy
@@ -158,4 +158,72 @@ model-index:
         name: accuracy
         verified: true
----

 - accuracy
 base_model:
 - liuhaotian/llava-v1.6-vicuna-7b
+pipeline_tag: video-text-to-text
 tags:
 - multimodal
 - fine-grained
   - task:
       type: multimodal
     dataset:
+      name: Inst-IT-Bench-I-OE
       type: Open-Ended
     metrics:
       - type: accuracy
   - task:
       type: multimodal
     dataset:
+      name: Inst-IT-Bench-I-MC
       type: Multi-Choice
     metrics:
       - type: accuracy
   - task:
       type: multimodal
     dataset:
+      name: Inst-IT-Bench-V-OE
       type: Open-Ended
     metrics:
       - type: accuracy
   - task:
       type: multimodal
     dataset:
+      name: Inst-IT-Bench-V-MC
       type: Multi-Choice
     metrics:
       - type: accuracy
         name: accuracy
         verified: true
+---
+# LLaVA-Next-Inst-It-Vicuna-7B: A Multimodal Model that Excels at Instance-level Understanding
+introduced in the paper [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https://huggingface.co/papers/2412.03565)
+[**🌐 Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**🤗 Paper**](https://huggingface.co/papers/2412.03565) | [**📖 arXiv**](https://arxiv.org/abs/2412.03565)
+## Quick Start
+**Install**
+Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT to prepare the environment:
+```shell
+pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
+```
+**Load Model**
+```python
+from llava.model.builder import load_pretrained_model
+from llava.constants import (
+    DEFAULT_IM_END_TOKEN,
+    DEFAULT_IM_START_TOKEN,
+    DEFAULT_IMAGE_TOKEN,
+    IGNORE_INDEX,
+    IMAGE_TOKEN_INDEX,
+)
+from llava.mm_utils import (
+    KeywordsStoppingCriteria,
+    get_model_name_from_path,
+    tokenizer_image_token,
+)
+from llava.conversation import SeparatorStyle, conv_templates
+overwrite_config = {}
+overwrite_config["mm_spatial_pool_stride"] = 2
+overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
+overwrite_config["mm_pooling_position"] = 'after'
+overwrite_config["mm_newline_position"] = 'no_token'
+model_path = "Inst-IT/LLaVA-Next-Inst-It-Vicuna-7B"
+model_name = get_model_name_from_path(model_path)
+tokenizer, model, image_processor, max_length = load_pretrained_model(
+            model_path=model_path,
+            model_base=None,
+            model_name=model_name,
+            device_map="auto",
+            torch_dtype='bfloat16',
+            overwrite_config=overwrite_config,
+            attn_implementation='sdpa')
+```
+**Image Inference**
+**Video Inference**
+## Contact
+Feel free to contact us if you have any questions or suggestions
+- Email (Wujian Peng): [email protected]
+- Email (Lingchen Meng): [email protected]
+## Citation
+```bibtex
+ @article{peng2024boosting,
+   title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
+   author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Hang, Xu and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
+   journal={arXiv preprint arXiv:2412.03565},
+   year={2024}
+ }
+```