Inst-IT
/

LLaVA-Next-Inst-It-Vicuna-7B

@@ -1,163 +1,163 @@
----
-license: apache-2.0
-datasets:
-- Inst-IT/Inst-IT-Dataset
-- lmms-lab/LLaVA-NeXT-Data
-language:
-- en
-metrics:
-- accuracy
-base_model:
-- liuhaotian/llava-v1.6-vicuna-7b
-pipeline_tag: video-text-to-text
-tags:
-- multimodal
-- fine-grained
-- instance-understanding
-model-index:
-- name: LLaVA-Next-Inst-It-Vicuna-7B
-  results:
-  - task:
-      type: multimodal
-    dataset:
-      name: Inst-IT-Bench-I-OE
-      type: Open-Ended
-    metrics:
-    - type: accuracy
-      value: 68.6
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: Inst-IT-Bench-I-MC
-      type: Multi-Choice
-    metrics:
-    - type: accuracy
-      value: 63
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: AI2D
-      type: ai2d
-    metrics:
-    - type: accuracy
-      value: 71
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: MMMU
-      type: mmmu
-    metrics:
-    - type: accuracy
-      value: 37.4
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: POPE
-      type: pope
-    metrics:
-    - type: accuracy
-      value: 87.2
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: GQA
-      type: gqa
-    metrics:
-    - type: accuracy
-      value: 65.9
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: MM-Vet
-      type: mm-vet
-    metrics:
-    - type: accuracy
-      value: 38.1
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: Inst-IT-Bench-V-OE
-      type: Open-Ended
-    metrics:
-    - type: accuracy
-      value: 49.3
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: Inst-IT-Bench-V-MC
-      type: Multi-Choice
-    metrics:
-    - type: accuracy
-      value: 42.1
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: ActNet-QA
-      type: actnet-qa
-    metrics:
-    - type: accuracy
-      value: 53.7
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: EgoSchema
-      type: egoschema
-    metrics:
-    - type: accuracy
-      value: 57.8
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: NextQA
-      type: nextqa
-    metrics:
-    - type: accuracy
-      value: 70.2
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: VideoMME
-      type: videomme
-    metrics:
-    - type: accuracy
-      value: 44.3
-      name: accuracy
-      verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: TempoCompass
-      type: tempocompass
-    metrics:
-    - type: accuracy
-      value: 59.8
-      name: accuracy
-      verified: true
----
 # LLaVA-Next-Inst-It-Vicuna-7B
 [**Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**Paper**](https://huggingface.co/papers/2412.03565) | [**arXiv**](https://arxiv.org/abs/2412.03565)
@@ -225,7 +225,7 @@ import torch
 import requests
 from PIL import Image
-img_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
 image = Image.open(requests.get(img_url, stream=True).raw)
 image_tensor = process_images([image], image_processor, model.config).bfloat16()
 image_sizes = [image.size]
@@ -265,9 +265,10 @@ print(pred)
 ```
 </details>
-Our model performs even better when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
 Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
-You can refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
 <details>
 <summary>Inference with SoMs</summary>
@@ -276,12 +277,13 @@ import torch
 import requests
 from PIL import Image
-img_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
 image = Image.open(requests.get(img_url, stream=True).raw)
 image_tensor = process_images([image], image_processor, model.config).bfloat16()
 image_sizes = [image.size]
-question = "Describe this image."
 question = DEFAULT_IMAGE_TOKEN + "\n" + question
 conv_template = 'vicuna_v1'

+---
+license: apache-2.0
+datasets:
+- Inst-IT/Inst-IT-Dataset
+- lmms-lab/LLaVA-NeXT-Data
+language:
+- en
+metrics:
+- accuracy
+base_model:
+- liuhaotian/llava-v1.6-vicuna-7b
+pipeline_tag: video-text-to-text
+tags:
+- multimodal
+- fine-grained
+- instance-understanding
+model-index:
+- name: LLaVA-Next-Inst-It-Vicuna-7B
+  results:
+  - task:
+      type: multimodal
+    dataset:
+      name: Inst-IT-Bench-I-OE
+      type: Open-Ended
+    metrics:
+    - type: accuracy
+      value: 68.6
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: Inst-IT-Bench-I-MC
+      type: Multi-Choice
+    metrics:
+    - type: accuracy
+      value: 63
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: AI2D
+      type: ai2d
+    metrics:
+    - type: accuracy
+      value: 71
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: MMMU
+      type: mmmu
+    metrics:
+    - type: accuracy
+      value: 37.4
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: POPE
+      type: pope
+    metrics:
+    - type: accuracy
+      value: 87.2
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: GQA
+      type: gqa
+    metrics:
+    - type: accuracy
+      value: 65.9
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: MM-Vet
+      type: mm-vet
+    metrics:
+    - type: accuracy
+      value: 38.1
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: Inst-IT-Bench-V-OE
+      type: Open-Ended
+    metrics:
+    - type: accuracy
+      value: 49.3
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: Inst-IT-Bench-V-MC
+      type: Multi-Choice
+    metrics:
+    - type: accuracy
+      value: 42.1
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: ActNet-QA
+      type: actnet-qa
+    metrics:
+    - type: accuracy
+      value: 53.7
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: EgoSchema
+      type: egoschema
+    metrics:
+    - type: accuracy
+      value: 57.8
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: NextQA
+      type: nextqa
+    metrics:
+    - type: accuracy
+      value: 70.2
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: VideoMME
+      type: videomme
+    metrics:
+    - type: accuracy
+      value: 44.3
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: TempoCompass
+      type: tempocompass
+    metrics:
+    - type: accuracy
+      value: 59.8
+      name: accuracy
+      verified: true
+---
 # LLaVA-Next-Inst-It-Vicuna-7B
 [**Homepage**](https://inst-it.github.io/) | [**Code**](https://github.com/inst-it/inst-it) | [**Paper**](https://huggingface.co/papers/2412.03565) | [**arXiv**](https://arxiv.org/abs/2412.03565)
 import requests
 from PIL import Image
+img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image.jpg?raw=true"
 image = Image.open(requests.get(img_url, stream=True).raw)
 image_tensor = process_images([image], image_processor, model.config).bfloat16()
 image_sizes = [image.size]
 ```
 </details>
+Our model performs more fine-grained understanding when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
+You can refer to the instances that you are interested in using their IDs.
 Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
+Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
 <details>
 <summary>Inference with SoMs</summary>
 import requests
 from PIL import Image
+img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image_som.jpg?raw=true"
 image = Image.open(requests.get(img_url, stream=True).raw)
 image_tensor = process_images([image], image_processor, model.config).bfloat16()
 image_sizes = [image.size]
+# You can use [id] to refer to the instances that you are interested in
+question = "Describe [8] in detail."
 question = DEFAULT_IMAGE_TOKEN + "\n" + question
 conv_template = 'vicuna_v1'