Inst-IT
/

LLaVA-Next-Inst-It-Vicuna-7B

@@ -1,167 +1,166 @@
----
-license: llama2
-datasets:
-- Inst-IT/Inst-IT-Dataset
-- lmms-lab/LLaVA-NeXT-Data
-language:
-- en
-metrics:
-- accuracy
-base_model:
-- liuhaotian/llava-v1.6-vicuna-7b
-pipeline_tag: video-text-to-text
-tags:
-- multimodal
-- fine-grained
-- instance-understanding
-model-index:
-- name: LLaVA-Next-Inst-It-Vicuna-7B
-  results:
-  - task:
-      type: multimodal
-    dataset:
-      name: Inst-IT-Bench-I-OE
-      type: Open-Ended
-    metrics:
-      - type: accuracy
-        value: 68.6
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: Inst-IT-Bench-I-MC
-      type: Multi-Choice
-    metrics:
-      - type: accuracy
-        value: 63.0
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: AI2D
-      type: ai2d
-    metrics:
-      - type: accuracy
-        value: 71.0
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: MMMU
-      type: mmmu
-    metrics:
-      - type: accuracy
-        value: 37.4
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: POPE
-      type: pope
-    metrics:
-      - type: accuracy
-        value: 87.2
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: GQA
-      type: gqa
-    metrics:
-      - type: accuracy
-        value: 65.9
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: MM-Vet
-      type: mm-vet
-    metrics:
-      - type: accuracy
-        value: 38.1
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: Inst-IT-Bench-V-OE
-      type: Open-Ended
-    metrics:
-      - type: accuracy
-        value: 49.3
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: Inst-IT-Bench-V-MC
-      type: Multi-Choice
-    metrics:
-      - type: accuracy
-        value: 42.1
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: ActNet-QA
-      type: actnet-qa
-    metrics:
-      - type: accuracy
-        value: 53.7
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: EgoSchema
-      type: egoschema
-    metrics:
-      - type: accuracy
-        value: 57.8
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: NextQA
-      type: nextqa
-    metrics:
-      - type: accuracy
-        value: 70.2
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: VideoMME
-      type: videomme
-    metrics:
-      - type: accuracy
-        value: 44.3
-        name: accuracy
-        verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: TempoCompass
-      type: tempocompass
-    metrics:
-      - type: accuracy
-        value: 59.8
-        name: accuracy
-        verified: true
----
 # LLaVA-Next-Inst-It-Vicuna-7B
-[**🌐 Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**🤗 Paper**](https://huggingface.co/papers/2412.03565) | [**📖 arXiv**](https://arxiv.org/abs/2412.03565)
 LLaVA-Next-Inst-It-Vicuna-7B is a multimodal model that excels at instance-level understanding,
 which is introduced in the paper [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https://huggingface.co/papers/2412.03565)
@@ -217,11 +216,10 @@ tokenizer, model, image_processor, max_length = load_pretrained_model(
 ```
 **Image Inference**
 <details>
 <summary>Inference without SoMs</summary>
-Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
 ```python
 import torch
 import requests
@@ -267,14 +265,12 @@ print(pred)
 ```
 </details>
-<details>
-<summary>Inference with SoMs</summary>
 Our model performs even better when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
 Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
 You can refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
 ```python
 import torch
 import requests

+---
+license: apache-2.0
+datasets:
+- Inst-IT/Inst-IT-Dataset
+- lmms-lab/LLaVA-NeXT-Data
+language:
+- en
+metrics:
+- accuracy
+base_model:
+- liuhaotian/llava-v1.6-vicuna-7b
+pipeline_tag: video-text-to-text
+tags:
+- multimodal
+- fine-grained
+- instance-understanding
+model-index:
+- name: LLaVA-Next-Inst-It-Vicuna-7B
+  results:
+  - task:
+      type: multimodal
+    dataset:
+      name: Inst-IT-Bench-I-OE
+      type: Open-Ended
+    metrics:
+    - type: accuracy
+      value: 68.6
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: Inst-IT-Bench-I-MC
+      type: Multi-Choice
+    metrics:
+    - type: accuracy
+      value: 63
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: AI2D
+      type: ai2d
+    metrics:
+    - type: accuracy
+      value: 71
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: MMMU
+      type: mmmu
+    metrics:
+    - type: accuracy
+      value: 37.4
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: POPE
+      type: pope
+    metrics:
+    - type: accuracy
+      value: 87.2
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: GQA
+      type: gqa
+    metrics:
+    - type: accuracy
+      value: 65.9
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: MM-Vet
+      type: mm-vet
+    metrics:
+    - type: accuracy
+      value: 38.1
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: Inst-IT-Bench-V-OE
+      type: Open-Ended
+    metrics:
+    - type: accuracy
+      value: 49.3
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: Inst-IT-Bench-V-MC
+      type: Multi-Choice
+    metrics:
+    - type: accuracy
+      value: 42.1
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: ActNet-QA
+      type: actnet-qa
+    metrics:
+    - type: accuracy
+      value: 53.7
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: EgoSchema
+      type: egoschema
+    metrics:
+    - type: accuracy
+      value: 57.8
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: NextQA
+      type: nextqa
+    metrics:
+    - type: accuracy
+      value: 70.2
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: VideoMME
+      type: videomme
+    metrics:
+    - type: accuracy
+      value: 44.3
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: TempoCompass
+      type: tempocompass
+    metrics:
+    - type: accuracy
+      value: 59.8
+      name: accuracy
+      verified: true
+---
 # LLaVA-Next-Inst-It-Vicuna-7B
+[**Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**Paper**](https://huggingface.co/papers/2412.03565) | [**arXiv**](https://arxiv.org/abs/2412.03565)
 LLaVA-Next-Inst-It-Vicuna-7B is a multimodal model that excels at instance-level understanding,
 which is introduced in the paper [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https://huggingface.co/papers/2412.03565)
 ```
 **Image Inference**
+Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
 <details>
 <summary>Inference without SoMs</summary>
 ```python
 import torch
 import requests
 ```
 </details>
 Our model performs even better when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
 Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
 You can refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
+<details>
+<summary>Inference with SoMs</summary>
 ```python
 import torch
 import requests