Update README.md

9554453 verified 1 day ago

14.7 kB

	---
	license: apache-2.0
	datasets:
	- Inst-IT/Inst-IT-Dataset
	- lmms-lab/LLaVA-NeXT-Data
	language:
	- en
	metrics:
	- accuracy
	pipeline_tag: video-text-to-text
	tags:
	- multimodal
	- fine-grained
	- instance-understanding
	model-index:
	- name: LLaVA-Next-Inst-It-Qwen2-7B
	results:
	- task:
	type: multimodal
	dataset:
	name: Inst-IT-Bench-I-OE
	type: Open-Ended
	metrics:
	- type: accuracy
	value: 67.9
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: Inst-IT-Bench-I-MC
	type: Multi-Choice
	metrics:
	- type: accuracy
	value: 75.3
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: AI2D
	type: ai2d
	metrics:
	- type: accuracy
	value: 78.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MMMU
	type: mmmu
	metrics:
	- type: accuracy
	value: 42.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: POPE
	type: pope
	metrics:
	- type: accuracy
	value: 87.6
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: GQA
	type: gqa
	metrics:
	- type: accuracy
	value: 65.5
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MM-Vet
	type: mm-vet
	metrics:
	- type: accuracy
	value: 44.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: Inst-IT-Bench-V-OE
	type: Open-Ended
	metrics:
	- type: accuracy
	value: 45.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: Inst-IT-Bench-V-MC
	type: Multi-Choice
	metrics:
	- type: accuracy
	value: 53.3
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: ActNet-QA
	type: actnet-qa
	metrics:
	- type: accuracy
	value: 55.2
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: EgoSchema
	type: egoschema
	metrics:
	- type: accuracy
	value: 50.4
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: NextQA
	type: nextqa
	metrics:
	- type: accuracy
	value: 73.0
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: VideoMME
	type: videomme
	metrics:
	- type: accuracy
	value: 54.0
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: TempoCompass
	type: tempocompass
	metrics:
	- type: accuracy
	value: 63.9
	name: accuracy
	verified: true
	---

	# LLaVA-Next-Inst-It-Qwen2-7B
	[Homepage](https://inst-it.github.io/) \| [Code](https://github.com/inst-it/inst-it) \| [Paper](https://huggingface.co/papers/2412.03565) \| [arXiv](https://arxiv.org/abs/2412.03565)

	LLaVA-Next-Inst-It-Qwen2-7B is a multimodal model that excels at instance-level understanding,
	which is introduced in the paper [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https://huggingface.co/papers/2412.03565)

	* Architecture: siglip-so400m-patch14-384 + Qwen2-7B
	* Data: LLaVA-NeXT-Data / Inst-IT-Dataset
	* Precision: bfloat16


	## Quick Start
	Install

	Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT to prepare the environment:
	```shell
	pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
	```
	Load Model
	```python
	from llava.model.builder import load_pretrained_model
	from llava.constants import (
	DEFAULT_IM_END_TOKEN,
	DEFAULT_IM_START_TOKEN,
	DEFAULT_IMAGE_TOKEN,
	IGNORE_INDEX,
	IMAGE_TOKEN_INDEX,
	)
	from llava.mm_utils import (
	KeywordsStoppingCriteria,
	get_model_name_from_path,
	tokenizer_image_token
	)
	from llava.conversation import SeparatorStyle, conv_templates
	from llava.eval.model_vqa import preprocess_qwen

	overwrite_config = {}
	overwrite_config["mm_spatial_pool_stride"] = 2
	overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
	overwrite_config["mm_pooling_position"] = 'after'
	overwrite_config["mm_newline_position"] = 'no_token'

	model_path = "Inst-IT/LLaVA-Next-Inst-It-Qwen2-7B"
	model_name = get_model_name_from_path(model_path)

	tokenizer, model, image_processor, max_length = load_pretrained_model(
	model_path=model_path,
	model_base=None,
	model_name=model_name,
	device_map="auto",
	torch_dtype='bfloat16',
	overwrite_config=overwrite_config,
	attn_implementation='sdpa')
	```
	Image Inference

	<details>
	<summary>Inference without SoMs</summary>

	Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).

	```python
	import torch
	import requests
	from PIL import Image

	img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image.jpg?raw=true"
	image = Image.open(requests.get(img_url, stream=True).raw)
	image_tensor = process_images([image], image_processor, model.config).bfloat16()
	image_sizes = [image.size]

	question = "Describe this image."
	question = DEFAULT_IMAGE_TOKEN + "\n" + question

	conv_template = 'qwen_1_5'
	conv = conv_templates[conv_template].copy()
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = preprocess_qwen([{'from': 'human','value': question},{'from': 'gpt','value': None}], tokenizer, has_image=True).cuda()

	pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
	attention_masks = input_ids.ne(pad_token_ids).long().cuda()

	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

	with torch.inference_mode():
	output_ids = model.generate(
	inputs=input_ids,
	images=image_tensor,
	attention_mask=attention_masks,
	modalities="image",
	image_sizes=image_sizes,
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	max_new_tokens=4096
	)

	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	print(pred)
	```
	</details>

	<details>
	<summary>Inference with SoMs</summary>

	Our model performs more fine-grained understanding when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
	You can refer to the instances that you are interested in using their IDs.
	Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
	Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.

	```python
	import torch
	import requests
	from PIL import Image

	img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image_som.jpg?raw=true"
	image = Image.open(requests.get(img_url, stream=True).raw)
	image_tensor = process_images([image], image_processor, model.config).bfloat16()
	image_sizes = [image.size]

	# You can use [id] to refer to the instances that you are interested in
	question = "Describe [8] in detail."
	question = DEFAULT_IMAGE_TOKEN + "\n" + question

	conv_template = 'qwen_1_5'
	conv = conv_templates[conv_template].copy()
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = preprocess_qwen([{'from': 'human','value': question},{'from': 'gpt','value': None}], tokenizer, has_image=True).cuda()

	pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
	attention_masks = input_ids.ne(pad_token_ids).long().cuda()

	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

	with torch.inference_mode():
	output_ids = model.generate(
	inputs=input_ids,
	images=image_tensor,
	attention_mask=attention_masks,
	modalities="image",
	image_sizes=image_sizes,
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	max_new_tokens=4096
	)

	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	print(pred)
	```
	</details>

	Video Inference

	For the video, we organize each frame into a list. You can use the format \<t\> to refer to a specific timestamp (e.g. <1>).

	<details>
	<summary>Inference without SoMs</summary>

	Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).

	```python
	import torch
	import requests
	from PIL import Image

	frame_urls = [
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_1.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_2.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_3.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_4.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_5.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_6.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_7.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_8.jpg?raw=true"
	]
	video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
	video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
	video = video.bfloat16()
	videos = [video]

	question = "Describe the video." # overall video caption
	question = "What happens at frame <1>?" # caption a specific moment
	question = DEFAULT_IMAGE_TOKEN + "\n" + question

	conv_template = 'qwen_1_5'
	conv = conv_templates[conv_template].copy()
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = preprocess_qwen([{'from': 'human','value': question},{'from': 'gpt','value': None}], tokenizer, has_image=True).cuda()

	pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
	attention_masks = input_ids.ne(pad_token_ids).long().cuda()

	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

	with torch.inference_mode():
	output_ids = model.generate(
	inputs=input_ids,
	images=videos,
	attention_mask=attention_masks,
	modalities="video",
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	max_new_tokens=4096
	)

	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	print(pred)
	```
	</details>

	<details>
	<summary>Inference without SoMs</summary>

	Our model performs more fine-grained understanding when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
	You can refer to the instances that you are interested in using their IDs.
	Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
	Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.

	```python
	import torch
	import requests
	from PIL import Image

	frame_urls = [
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_1.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_2.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_3.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_4.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_5.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_6.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_7.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_8.jpg?raw=true"
	]
	video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
	video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
	video = video.bfloat16()
	videos = [video]

	# You can use [id] to refer to the instances that you are interested in
	question = "Is [3] visible at <1>?"
	question = DEFAULT_IMAGE_TOKEN + "\n" + question

	conv_template = 'qwen_1_5'
	conv = conv_templates[conv_template].copy()
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = preprocess_qwen([{'from': 'human','value': question},{'from': 'gpt','value': None}], tokenizer, has_image=True).cuda()

	pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
	attention_masks = input_ids.ne(pad_token_ids).long().cuda()

	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

	with torch.inference_mode():
	output_ids = model.generate(
	inputs=input_ids,
	images=videos,
	attention_mask=attention_masks,
	modalities="video",
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	max_new_tokens=4096
	)

	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	print(pred)
	```
	</details>

	## Contact
	Feel free to contact us if you have any questions or suggestions
	- Email (Wujian Peng): [email protected]
	- Email (Lingchen Meng): [email protected]

	## Citation
	```bibtex
	@article{peng2024boosting,
	title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
	author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Hang, Xu and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
	journal={arXiv preprint arXiv:2412.03565},
	year={2024}
	}
	```