Inst-IT
/

LLaVA-Next-Inst-It-Vicuna-7B

@@ -167,6 +167,7 @@ introduced in the paper [Inst-IT: Boosting Multimodal Instance Understanding via
 ## Quick Start
 **Install**
 Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT to prepare the environment:
 ```shell
 pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
@@ -185,10 +186,10 @@ from llava.mm_utils import (
     KeywordsStoppingCriteria,
     get_model_name_from_path,
     tokenizer_image_token,
 )
 from llava.conversation import SeparatorStyle, conv_templates
 overwrite_config = {}
 overwrite_config["mm_spatial_pool_stride"] = 2
 overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
@@ -209,6 +210,108 @@ tokenizer, model, image_processor, max_length = load_pretrained_model(
 ```
 **Image Inference**
 **Video Inference**

 ## Quick Start
 **Install**
 Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT to prepare the environment:
 ```shell
 pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
     KeywordsStoppingCriteria,
     get_model_name_from_path,
     tokenizer_image_token,
+    process_images
 )
 from llava.conversation import SeparatorStyle, conv_templates
 overwrite_config = {}
 overwrite_config["mm_spatial_pool_stride"] = 2
 overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
 ```
 **Image Inference**
+<details>
+<summary>Inference without SoMs</summary>
+Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
+```python
+import torch
+import requests
+from PIL import Image
+img_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
+image = Image.open(requests.get(img_url, stream=True).raw)
+image_tensor = process_images([image], image_processor, model.config).bfloat16()
+image_sizes = [image.size]
+question = "Describe this image."
+question = DEFAULT_IMAGE_TOKEN + "\n" + question
+conv_template = 'vicuna_v1'
+conv = conv_templates[conv_template].copy()
+conv.append_message(conv.roles[0], question)
+conv.append_message(conv.roles[1], None)
+prompt = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
+pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
+attention_masks = input_ids.ne(pad_token_ids).long().cuda()
+stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+keywords = [stop_str]
+stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+with torch.inference_mode():
+    output_ids = model.generate(
+        inputs=input_ids,
+        images=image_tensor,
+        attention_mask=attention_masks,
+        modalities="image",
+        image_sizes=image_sizes,
+        use_cache=True,
+        stopping_criteria=[stopping_criteria],
+        max_new_tokens=4096
+    )
+pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
+print(pred)
+```
+</details>
+<details>
+<summary>Inference with SoMs</summary>
+Our model performs even better when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
+Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
+You can refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
+```python
+import torch
+import requests
+from PIL import Image
+img_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
+image = Image.open(requests.get(img_url, stream=True).raw)
+image_tensor = process_images([image], image_processor, model.config).bfloat16()
+image_sizes = [image.size]
+question = "Describe this image."
+question = DEFAULT_IMAGE_TOKEN + "\n" + question
+conv_template = 'vicuna_v1'
+conv = conv_templates[conv_template].copy()
+conv.append_message(conv.roles[0], question)
+conv.append_message(conv.roles[1], None)
+prompt = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
+pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
+attention_masks = input_ids.ne(pad_token_ids).long().cuda()
+stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+keywords = [stop_str]
+stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+with torch.inference_mode():
+    output_ids = model.generate(
+        inputs=input_ids,
+        images=image_tensor,
+        attention_mask=attention_masks,
+        modalities="image",
+        image_sizes=image_sizes,
+        use_cache=True,
+        stopping_criteria=[stopping_criteria],
+        max_new_tokens=4096
+    )
+pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
+print(pred)
+```
+</details>
 **Video Inference**