---
license: apache-2.0
datasets:
- VARGPT-family/VARGPT_datasets
language:
- en
metrics:
- accuracy
- f1
pipeline_tag: any-to-any
library_name: transformers
---
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
VARGPT (7B+2B) modeling understanding and generation as two distinct paradigms within a unified model: **predicting the next token for visual understanding and predicting the next scale for visual generation**.
We provide the simple generation process for using our model. For more details, you could refer to Github: [VARGPT-v1](https://github.com/VARGPT-family/VARGPT).
### Multimodal Understanding
Inference demo for **Multimodal Understanding**. You can execute the following code:
```python
# Or execute the following code
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_llava.modeling_vargpt_llava import VARGPTLlavaForConditionalGeneration
from vargpt_llava.prepare_vargpt_llava import prepare_vargpt_llava
from vargpt_llava.processing_vargpt_llava import VARGPTLlavaProcessor
from patching_utils.patching import patching
model_id = "VARGPT_LLaVA-v1"
prepare_vargpt_llava(model_id)
model = VARGPTLlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float32,
low_cpu_mem_usage=True,
).to(0)
patching(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTLlavaProcessor.from_pretrained(model_id)
# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image")
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Please explain the meme in detail."},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "./assets/llava_bench_demo.png"
print(prompt)
raw_image = Image.open(image_file)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float32)
output = model.generate(
**inputs,
max_new_tokens=2048,
do_sample=False)
print(processor.decode(output[0], skip_special_tokens=True))
```
### Multimodal Generation
Inference demo for **Text-to-Image Generation**. You can execute the following code:
```python
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, AutoTokenizer
from vargpt_llava.modeling_vargpt_llava import VARGPTLlavaForConditionalGeneration
from vargpt_llava.prepare_vargpt_llava import prepare_vargpt_llava
from vargpt_llava.processing_vargpt_llava import VARGPTLlavaProcessor
from patching_utils.patching import patching
model_id = "VARGPT_LLaVA-v1"
prepare_vargpt_llava(model_id)
model = VARGPTLlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float32,
low_cpu_mem_usage=True,
).to(0)
patching(model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = VARGPTLlavaProcessor.from_pretrained(model_id)
# some instruction examples:
# Please design a drawing of a butterfly on a flower.
# Please create a painting of a black weasel is standing in the grass.
# Can you generate a rendered photo of a rabbit sitting in the grass.
# I need a designed photo of a lighthouse is seen in the distance.
# Please create a rendered drawing of an old photo of an aircraft carrier in the water.
# Please produce a designed photo of a squirrel is standing in the snow.
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Please design a drawing of a butterfly on a flower."},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
print(prompt)
inputs = processor(text=prompt, return_tensors='pt').to(0, torch.float32)
model._IMAGE_GEN_PATH = "output.png"
output = model.generate(
**inputs,
max_new_tokens=2048,
do_sample=False)
print(processor.decode(output[0], skip_special_tokens=True))
```