|
--- |
|
license: cc-by-nc-sa-4.0 |
|
datasets: |
|
- PengxiangLi/MAT |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- openbmb/MiniCPM-V-2_6 |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
--- |
|
pipeline_tag: image-text-to-text |
|
datasets: |
|
- openbmb/RLAIF-V-Dataset |
|
library_name: transformers |
|
language: |
|
- multilingual |
|
tags: |
|
- minicpm-v |
|
- vision |
|
- ocr |
|
- multi-image |
|
- video |
|
- custom_code |
|
--- |
|
|
|
<h1>Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage</h1> |
|
|
|
[GitHub](https://github.com/mat-agent/MAT-Agent.git) | [Project](https://mat-agent.github.io/)</a> |
|
|
|
## MAT-MiniCPM-V 2.6 |
|
|
|
This model is a fine-tuned version of the [MiniCPM V2.6 7B](https://huggingface.co/openbmb/MiniCPM-V-2_6) model on the MM-traj dataset. On GTA and GAIA benchmarks, it achieved improvements of β18.59% and β7.78%β respectively compared to the non-fine-tuned baseline. |
|
|
|
## Usage |
|
Our model inherits the inference architecture from [MiniCPM-V-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6). The following implementation is adapted from their original inference code with full compatibility. |
|
|
|
Requirements (tested on Python 3.10): |
|
``` |
|
Pillow==10.1.0 |
|
torch==2.1.2 |
|
torchvision==0.16.2 |
|
transformers==4.40.0 |
|
sentencepiece==0.1.99 |
|
decord |
|
``` |
|
|
|
### Basic Inference |
|
```python |
|
# test.py |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
# Load our fine-tuned model (based on MiniCPM-V-2.6 architecture) |
|
model = AutoModel.from_pretrained('PengxiangLi/MAT', trust_remote_code=True, |
|
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # Maintain original implementation choices |
|
model = model.eval().cuda() |
|
tokenizer = AutoTokenizer.from_pretrained('PengxiangLi/MAT', trust_remote_code=True) |
|
|
|
image = Image.open('xx.jpg').convert('RGB') |
|
question = 'What is in the image?' |
|
msgs = [{'role': 'user', 'content': [image, question]}] |
|
|
|
# The chat interface follows MiniCPM's original implementation |
|
response = model.chat( |
|
image=None, |
|
msgs=msgs, |
|
tokenizer=tokenizer |
|
) |
|
print(response) |
|
|
|
## Streaming output (inherited from MiniCPM's implementation) |
|
response_stream = model.chat( |
|
image=None, |
|
msgs=msgs, |
|
tokenizer=tokenizer, |
|
sampling=True, |
|
stream=True |
|
) |
|
|
|
generated_text = "" |
|
for new_text in response_stream: |
|
generated_text += new_text |
|
print(new_text, flush=True, end='') |
|
``` |
|
|
|
### Multi-image Chat |
|
<details> |
|
<summary>Implementation adapted from MiniCPM's original multi-image handling</summary> |
|
|
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
model = AutoModel.from_pretrained('PengxiangLi/MAT', trust_remote_code=True, |
|
attn_implementation='sdpa', torch_dtype=torch.bfloat16) |
|
model = model.eval().cuda() |
|
tokenizer = AutoTokenizer.from_pretrained('PengxiangLi/MAT', trust_remote_code=True) |
|
|
|
# The message format follows MiniCPM's original schema |
|
image1 = Image.open('image1.jpg').convert('RGB') |
|
image2 = Image.open('image2.jpg').convert('RGB') |
|
question = 'Compare the two images...' |
|
|
|
msgs = [{'role': 'user', 'content': [image1, image2, question]}] |
|
|
|
# Using the original chat interface design |
|
answer = model.chat( |
|
image=None, |
|
msgs=msgs, |
|
tokenizer=tokenizer |
|
) |
|
print(answer) |
|
``` |
|
</details> |
|
|
|
### Few-shot Learning |
|
<details> |
|
<summary>Adapted from MiniCPM's few-shot implementation</summary> |
|
|
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
# Maintain original model loading parameters |
|
model = AutoModel.from_pretrained('PengxiangLi/MAT', trust_remote_code=True, |
|
attn_implementation='sdpa', torch_dtype=torch.bfloat16) |
|
model = model.eval().cuda() |
|
tokenizer = AutoTokenizer.from_pretrained('PengxiangLi/MAT', trust_remote_code=True) |
|
|
|
# Following MiniCPM's message structure |
|
question = "production date" |
|
image1 = Image.open('example1.jpg').convert('RGB') |
|
answer1 = "2023.08.04" |
|
image2 = Image.open('example2.jpg').convert('RGB') |
|
answer2 = "2007.04.24" |
|
image_test = Image.open('test.jpg').convert('RGB') |
|
|
|
msgs = [ |
|
{'role': 'user', 'content': [image1, question]}, |
|
{'role': 'assistant', 'content': [answer1]}, |
|
{'role': 'user', 'content': [image2, question]}, |
|
{'role': 'assistant', 'content': [answer2]}, |
|
{'role': 'user', 'content': [image_test, question]} |
|
] |
|
|
|
# Using the unmodified chat interface from original implementation |
|
answer = model.chat( |
|
image=None, |
|
msgs=msgs, |
|
tokenizer=tokenizer |
|
) |
|
print(answer) |
|
``` |
|
</details> |
|
|
|
#### Implementation Notes: |
|
1. All core inference logic is directly inherited from [MiniCPM-V-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6) |
|
2. The `chat()` interface remains unchanged from the original implementation |
|
3. Model loading parameters maintain compatibility with the base architecture |
|
4. Message formatting follows MiniCPM's original schema |
|
|
|
|
|
|
|
|
|
|
|
## License |
|
|
|
#### Model License |
|
- The code in this repository is licensed under the β[Apache-2.0 License](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE). |
|
- Usage of our fine-tuned MiniCPM-based model weights must strictly adhere to the β[MiniCPM Model License](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). |
|
|
|
#### Usage Terms |
|
- β**Academic Research**: The model weights are freely available for academic use without restrictions. |
|
- β**Commercial Use**: |
|
- After completing the official β[registration questionnaire](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)β and obtaining authorization, the MiniCPM-V 2.6 based weights (including our fine-tuned version) are available for commercial use free of charge. |
|
- Commercial users must maintain compliance with all terms outlined in the MiniCPM Model License. |
|
|
|
#### Inheritance Clause |
|
As a derivative work of MiniCPM, our model inherits and is bound by all original licensing requirements from the base model. Users are responsible for ensuring compliance with both our terms and the upstream MiniCPM license terms. |
|
|
|
|
|
|
|
|
|
## Citation |
|
|
|
If you find our work helpful, please consider citing our papers π and liking this project β€οΈοΌ |
|
|
|
```bib |
|
@article{gao2024multi, |
|
title={Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage}, |
|
author={Gao, Zhi and Zhang, Bofei and Li, Pengxiang and Ma, Xiaojian and Yuan, Tao and Fan, Yue and Wu, Yuwei and Jia, Yunde and Zhu, Song-Chun and Li, Qing}, |
|
journal={arXiv preprint arXiv:2412.15606}, |
|
year={2024} |
|
} |
|
``` |