File size: 4,435 Bytes
72c3b41
 
 
 
 
 
9002dd4
 
 
 
 
 
 
72c3b41
 
49b4bd8
72c3b41
063e9f7
 
 
72c3b41
f4732fa
 
72c3b41
 
679a364
72c3b41
 
 
 
 
 
 
 
29067e8
 
 
 
72c3b41
6eb25c2
 
72c3b41
 
81f430d
72c3b41
 
 
 
 
6eb25c2
72c3b41
 
 
 
 
 
 
 
 
 
 
 
 
 
fbf5dc7
72c3b41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72a8043
fbf5dc7
72a8043
72c3b41
56248b7
 
 
72c3b41
56248b7
fbf5dc7
72a8043
72c3b41
 
73e42c0
fbf5dc7
49b4bd8
72c3b41
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: apache-2.0
base_model:
- Qwen/Qwen2.5-3B-Instruct
- google/siglip-so400m-patch14-384
tags:
- multimodal
- llava
language:
- en
- zh
pipeline_tag: visual-question-answering
library_name: transformers
---

![logo.jpg](logo.jpg)

<code>Ivy-VL</code> is a lightweight multimodal model with only 3B parameters. 

It accepts both image and text inputs to generate text outputs. 

Thanks to its lightweight design, it can be deployed on edge devices such as AI glasses and smartphones, offering low memory usage and high speed while maintaining strong performance on multimodal tasks. Some well-known small models include [PaliGemma 3B](https://huggingface.co/google/paligemma-3b-mix-448), [Moondream2](https://huggingface.co/vikhyatk/moondream2), [Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B), [InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B), and [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B). Ivy-VL outperforms them on multiple benchmarks.

# Model Summary:

*   Developed: AI Safeguard, CMU, Standford
    
*   Model type: Multi-modal model (image+text)
    
*   Language: Engligh and Chinese
    
*   License: Apache 2.0
    
*   Architecture: Based on LLaVA-One-Vision

*   LLM: Qwen/Qwen2.5-3B-Instruct

*   Vision Encoder: google/siglip-so400m-patch14-384

*   Notebook demo: [Ivy-VL-demo.ipynb](https://colab.research.google.com/drive/1D5_8sDRcP1HKlWtlqTH7s64xG8OH9NH0?usp=sharing)

# Evaluation:

![evaluation.jpg](evaluation.jpg)

Most of the performance data comes from the VLMEvalKit leaderboard or the original papers. We conducted evaluations using VLMEvalKit. Due to differences in environments and the LLMs used for evaluation, there may be slight variations in performance.

# How to use:


```python
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import requests
import copy
import torch
import warnings

warnings.filterwarnings("ignore")

pretrained = "AI-Safeguard/Ivy-VL-llava"

model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args

model.eval()

# load image from url
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

# load image from local environment
# url = "./local_image.jpg"
# image = Image.open(url)

image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]

cont = model.generate(
    input_ids,
    images=image_tensor,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)

text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)

print(text_outputs)
```

# Future Plan:

* We plan to release more versions of LLMs in different sizes.

* We will focus on improving the performance of the video modality.

# Contact:
Feel free to contact us if you have any questions or suggestions📧:
* Email (David Qiu): David[email protected]

# Citation:

If you find our work helpful, please consider citing our Model:
```plaintext
@misc{ivy2024ivy-vl,
    title={Ivy-VL:Compact Vision-Language Models Achieving SOTA with Optimal Data},
    url={https://huggingface.co/AI-Safeguard/Ivy-VL-llava},
    author={Ivy Zhang,Jenny N,Theresa Yu and David Qiu},
    month={December},
    year={2024}
}
```