Inst-IT
/

LLaVA-Next-Inst-It-Vicuna-7B

@@ -216,10 +216,11 @@ tokenizer, model, image_processor, max_length = load_pretrained_model(
 ```
 **Image Inference**
-Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
 <details>
 <summary>Inference without SoMs</summary>
 ```python
 import torch
 import requests
@@ -265,12 +266,13 @@ print(pred)
 ```
 </details>
 Our model performs more fine-grained understanding when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
 You can refer to the instances that you are interested in using their IDs.
 Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
 Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
-<details>
-<summary>Inference with SoMs</summary>
 ```python
 import torch
@@ -320,6 +322,130 @@ print(pred)
 **Video Inference**
 ## Contact
 Feel free to contact us if you have any questions or suggestions

 ```
 **Image Inference**
 <details>
 <summary>Inference without SoMs</summary>
+Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
 ```python
 import torch
 import requests
 ```
 </details>
+<details>
+<summary>Inference with SoMs</summary>
 Our model performs more fine-grained understanding when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
 You can refer to the instances that you are interested in using their IDs.
 Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
 Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
 ```python
 import torch
 **Video Inference**
+For the video, we organize each frame into a list. You can use the format \<t\> to refer to a specific timestamp (e.g. <1>).
+<details>
+<summary>Inference without SoMs</summary>
+Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
+```python
+import torch
+import requests
+from PIL import Image
+frame_urls = [
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_1.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_2.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_3.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_4.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_5.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_6.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_7.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_8.jpg?raw=true"
+]
+video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
+video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
+video = video.bfloat16()
+videos = [video]
+question = "Describe the video."  # overall video caption
+question = "What happens at frame <1>?"  # caption a specific moment
+question = DEFAULT_IMAGE_TOKEN + "\n" + question
+conv_template = 'vicuna_v1'
+conv = conv_templates[conv_template].copy()
+conv.append_message(conv.roles[0], question)
+conv.append_message(conv.roles[1], None)
+prompt = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
+pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
+attention_masks = input_ids.ne(pad_token_ids).long().cuda()
+stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+keywords = [stop_str]
+stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+with torch.inference_mode():
+    output_ids = model.generate(
+        inputs=input_ids,
+        images=videos,
+        attention_mask=attention_masks,
+        modalities="video",
+        use_cache=True,
+        stopping_criteria=[stopping_criteria],
+        max_new_tokens=4096
+    )
+pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
+print(pred)
+```
+</details>
+<details>
+<summary>Inference without SoMs</summary>
+Our model performs more fine-grained understanding when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
+You can refer to the instances that you are interested in using their IDs.
+Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
+Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.
+```python
+import torch
+import requests
+from PIL import Image
+frame_urls = [
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_1.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_2.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_3.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_4.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_5.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_6.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_7.jpg?raw=true",
+    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_8.jpg?raw=true"
+]
+video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
+video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
+video = video.bfloat16()
+videos = [video]
+# You can use [id] to refer to the instances that you are interested in
+question = "Is [3] visible at <1>?"
+question = DEFAULT_IMAGE_TOKEN + "\n" + question
+conv_template = 'vicuna_v1'
+conv = conv_templates[conv_template].copy()
+conv.append_message(conv.roles[0], question)
+conv.append_message(conv.roles[1], None)
+prompt = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
+pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
+attention_masks = input_ids.ne(pad_token_ids).long().cuda()
+stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+keywords = [stop_str]
+stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+with torch.inference_mode():
+    output_ids = model.generate(
+        inputs=input_ids,
+        images=videos,
+        attention_mask=attention_masks,
+        modalities="video",
+        use_cache=True,
+        stopping_criteria=[stopping_criteria],
+        max_new_tokens=4096
+    )
+pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
+print(pred)
+```
+</details>
 ## Contact
 Feel free to contact us if you have any questions or suggestions