File size: 6,379 Bytes
91685ba 1e942e2 89ee2a5 4f867c0 89ee2a5 4f867c0 89ee2a5 4f867c0 89ee2a5 4f867c0 89ee2a5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
---
license: cc-by-nc-sa-4.0
datasets:
- PengxiangLi/MAT
language:
- en
metrics:
- accuracy
base_model:
- openbmb/MiniCPM-V-2_6
pipeline_tag: visual-question-answering
---
---
pipeline_tag: image-text-to-text
datasets:
- openbmb/RLAIF-V-Dataset
library_name: transformers
language:
- multilingual
tags:
- minicpm-v
- vision
- ocr
- multi-image
- video
- custom_code
---
<h1>Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage</h1>
[GitHub](https://github.com/mat-agent/MAT-Agent.git) | [Project](https://mat-agent.github.io/)</a>
## MAT-MiniCPM-V 2.6
This model is a fine-tuned version of the [MiniCPM V2.6 7B](https://huggingface.co/openbmb/MiniCPM-V-2_6) model on the MM-traj dataset. On GTA and GAIA benchmarks, it achieved improvements of β18.59% and β7.78%β respectively compared to the non-fine-tuned baseline.
## Usage
Our model inherits the inference architecture from [MiniCPM-V-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6). The following implementation is adapted from their original inference code with full compatibility.
Requirements (tested on Python 3.10):
```
Pillow==10.1.0
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99
decord
```
### Basic Inference
```python
# test.py
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
# Load our fine-tuned model (based on MiniCPM-V-2.6 architecture)
model = AutoModel.from_pretrained('PengxiangLi/MAT', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # Maintain original implementation choices
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('PengxiangLi/MAT', trust_remote_code=True)
image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]
# The chat interface follows MiniCPM's original implementation
response = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(response)
## Streaming output (inherited from MiniCPM's implementation)
response_stream = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
stream=True
)
generated_text = ""
for new_text in response_stream:
generated_text += new_text
print(new_text, flush=True, end='')
```
### Multi-image Chat
<details>
<summary>Implementation adapted from MiniCPM's original multi-image handling</summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('PengxiangLi/MAT', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('PengxiangLi/MAT', trust_remote_code=True)
# The message format follows MiniCPM's original schema
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare the two images...'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
# Using the original chat interface design
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
</details>
### Few-shot Learning
<details>
<summary>Adapted from MiniCPM's few-shot implementation</summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
# Maintain original model loading parameters
model = AutoModel.from_pretrained('PengxiangLi/MAT', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('PengxiangLi/MAT', trust_remote_code=True)
# Following MiniCPM's message structure
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
{'role': 'user', 'content': [image1, question]},
{'role': 'assistant', 'content': [answer1]},
{'role': 'user', 'content': [image2, question]},
{'role': 'assistant', 'content': [answer2]},
{'role': 'user', 'content': [image_test, question]}
]
# Using the unmodified chat interface from original implementation
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
</details>
#### Implementation Notes:
1. All core inference logic is directly inherited from [MiniCPM-V-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6)
2. The `chat()` interface remains unchanged from the original implementation
3. Model loading parameters maintain compatibility with the base architecture
4. Message formatting follows MiniCPM's original schema
## License
#### Model License
- The code in this repository is licensed under the β[Apache-2.0 License](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE).
- Usage of our fine-tuned MiniCPM-based model weights must strictly adhere to the β[MiniCPM Model License](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
#### Usage Terms
- β**Academic Research**: The model weights are freely available for academic use without restrictions.
- β**Commercial Use**:
- After completing the official β[registration questionnaire](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)β and obtaining authorization, the MiniCPM-V 2.6 based weights (including our fine-tuned version) are available for commercial use free of charge.
- Commercial users must maintain compliance with all terms outlined in the MiniCPM Model License.
#### Inheritance Clause
As a derivative work of MiniCPM, our model inherits and is bound by all original licensing requirements from the base model. Users are responsible for ensuring compliance with both our terms and the upstream MiniCPM license terms.
## Citation
If you find our work helpful, please consider citing our papers π and liking this project β€οΈοΌ
```bib
@article{gao2024multi,
title={Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage},
author={Gao, Zhi and Zhang, Bofei and Li, Pengxiang and Ma, Xiaojian and Yuan, Tao and Fan, Yue and Wu, Yuwei and Jia, Yunde and Zhu, Song-Chun and Li, Qing},
journal={arXiv preprint arXiv:2412.15606},
year={2024}
}
``` |