PengxiangLi commited on
Commit
89ee2a5
Β·
verified Β·
1 Parent(s): 1e942e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +203 -1
README.md CHANGED
@@ -9,4 +9,206 @@ metrics:
9
  base_model:
10
  - openbmb/MiniCPM-V-2_6
11
  pipeline_tag: visual-question-answering
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  base_model:
10
  - openbmb/MiniCPM-V-2_6
11
  pipeline_tag: visual-question-answering
12
+ ---
13
+ ---
14
+ license: cc-by-nc-sa-4.0
15
+ datasets:
16
+ - PengxiangLi/MAT
17
+ language:
18
+ - en
19
+ metrics:
20
+ - accuracy
21
+ base_model:
22
+ - openbmb/MiniCPM-V-2_6
23
+ pipeline_tag: visual-question-answering
24
+ ---
25
+
26
+ ---
27
+ pipeline_tag: image-text-to-text
28
+ datasets:
29
+ - openbmb/RLAIF-V-Dataset
30
+ library_name: transformers
31
+ language:
32
+ - multilingual
33
+ tags:
34
+ - minicpm-v
35
+ - vision
36
+ - ocr
37
+ - multi-image
38
+ - video
39
+ - custom_code
40
+ ---
41
+
42
+ <h1>Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage</h1>
43
+
44
+ [GitHub](https://github.com/mat-agent/MAT-Agent.git) | [Project](https://mat-agent.github.io/)</a>
45
+
46
+ ## MAT-MiniCPM-V 2.6
47
+
48
+ This model is a fine-tuned version of the [MiniCPM V2.6 7B](https://huggingface.co/openbmb/MiniCPM-V-2_6) model on the MM-traj dataset. On GTA and GAIA benchmarks, it achieved improvements of ​**18.59%**​ and ​**7.78%**​ respectively compared to the non-fine-tuned baseline.
49
+
50
+ ## Usage
51
+ Our model inherits the inference architecture from [MiniCPM-V-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6). The following implementation is adapted from their original inference code with full compatibility.
52
+
53
+ Requirements (tested on Python 3.10):
54
+ ```
55
+ Pillow==10.1.0
56
+ torch==2.1.2
57
+ torchvision==0.16.2
58
+ transformers==4.40.0
59
+ sentencepiece==0.1.99
60
+ decord
61
+ ```
62
+
63
+ ### Basic Inference
64
+ ```python
65
+ # test.py
66
+ import torch
67
+ from PIL import Image
68
+ from transformers import AutoModel, AutoTokenizer
69
+
70
+ # Load our fine-tuned model (based on MiniCPM-V-2.6 architecture)
71
+ model = AutoModel.from_pretrained('PengxiangLi/MAT', trust_remote_code=True,
72
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # Maintain original implementation choices
73
+ model = model.eval().cuda()
74
+ tokenizer = AutoTokenizer.from_pretrained('PengxiangLi/MAT', trust_remote_code=True)
75
+
76
+ image = Image.open('xx.jpg').convert('RGB')
77
+ question = 'What is in the image?'
78
+ msgs = [{'role': 'user', 'content': [image, question]}]
79
+
80
+ # The chat interface follows MiniCPM's original implementation
81
+ response = model.chat(
82
+ image=None,
83
+ msgs=msgs,
84
+ tokenizer=tokenizer
85
+ )
86
+ print(response)
87
+
88
+ ## Streaming output (inherited from MiniCPM's implementation)
89
+ response_stream = model.chat(
90
+ image=None,
91
+ msgs=msgs,
92
+ tokenizer=tokenizer,
93
+ sampling=True,
94
+ stream=True
95
+ )
96
+
97
+ generated_text = ""
98
+ for new_text in response_stream:
99
+ generated_text += new_text
100
+ print(new_text, flush=True, end='')
101
+ ```
102
+
103
+ ### Multi-image Chat
104
+ <details>
105
+ <summary>Implementation adapted from MiniCPM's original multi-image handling</summary>
106
+
107
+ ```python
108
+ import torch
109
+ from PIL import Image
110
+ from transformers import AutoModel, AutoTokenizer
111
+
112
+ model = AutoModel.from_pretrained('PengxiangLi/MAT', trust_remote_code=True,
113
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16)
114
+ model = model.eval().cuda()
115
+ tokenizer = AutoTokenizer.from_pretrained('PengxiangLi/MAT', trust_remote_code=True)
116
+
117
+ # The message format follows MiniCPM's original schema
118
+ image1 = Image.open('image1.jpg').convert('RGB')
119
+ image2 = Image.open('image2.jpg').convert('RGB')
120
+ question = 'Compare the two images...'
121
+
122
+ msgs = [{'role': 'user', 'content': [image1, image2, question]}]
123
+
124
+ # Using the original chat interface design
125
+ answer = model.chat(
126
+ image=None,
127
+ msgs=msgs,
128
+ tokenizer=tokenizer
129
+ )
130
+ print(answer)
131
+ ```
132
+ </details>
133
+
134
+ ### Few-shot Learning
135
+ <details>
136
+ <summary>Adapted from MiniCPM's few-shot implementation</summary>
137
+
138
+ ```python
139
+ import torch
140
+ from PIL import Image
141
+ from transformers import AutoModel, AutoTokenizer
142
+
143
+ # Maintain original model loading parameters
144
+ model = AutoModel.from_pretrained('PengxiangLi/MAT', trust_remote_code=True,
145
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16)
146
+ model = model.eval().cuda()
147
+ tokenizer = AutoTokenizer.from_pretrained('PengxiangLi/MAT', trust_remote_code=True)
148
+
149
+ # Following MiniCPM's message structure
150
+ question = "production date"
151
+ image1 = Image.open('example1.jpg').convert('RGB')
152
+ answer1 = "2023.08.04"
153
+ image2 = Image.open('example2.jpg').convert('RGB')
154
+ answer2 = "2007.04.24"
155
+ image_test = Image.open('test.jpg').convert('RGB')
156
+
157
+ msgs = [
158
+ {'role': 'user', 'content': [image1, question]},
159
+ {'role': 'assistant', 'content': [answer1]},
160
+ {'role': 'user', 'content': [image2, question]},
161
+ {'role': 'assistant', 'content': [answer2]},
162
+ {'role': 'user', 'content': [image_test, question]}
163
+ ]
164
+
165
+ # Using the unmodified chat interface from original implementation
166
+ answer = model.chat(
167
+ image=None,
168
+ msgs=msgs,
169
+ tokenizer=tokenizer
170
+ )
171
+ print(answer)
172
+ ```
173
+ </details>
174
+
175
+ #### Implementation Notes:
176
+ 1. All core inference logic is directly inherited from [MiniCPM-V-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6)
177
+ 2. The `chat()` interface remains unchanged from the original implementation
178
+ 3. Model loading parameters maintain compatibility with the base architecture
179
+ 4. Message formatting follows MiniCPM's original schema
180
+
181
+
182
+
183
+
184
+
185
+ ## License
186
+
187
+ #### Model License
188
+ - The code in this repository is licensed under the ​**[Apache-2.0 License](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)**.
189
+ - Usage of our fine-tuned MiniCPM-based model weights must strictly adhere to the ​**[MiniCPM Model License](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)**.
190
+
191
+ #### Usage Terms
192
+ - ​**Academic Research**: The model weights are freely available for academic use without restrictions.
193
+ - ​**Commercial Use**:
194
+ - After completing the official ​**[registration questionnaire](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)**​ and obtaining authorization, the MiniCPM-V 2.6 based weights (including our fine-tuned version) are available for commercial use free of charge.
195
+ - Commercial users must maintain compliance with all terms outlined in the MiniCPM Model License.
196
+
197
+ #### Inheritance Clause
198
+ As a derivative work of MiniCPM, our model inherits and is bound by all original licensing requirements from the base model. Users are responsible for ensuring compliance with both our terms and the upstream MiniCPM license terms.
199
+
200
+
201
+
202
+
203
+ ## Citation
204
+
205
+ If you find our work helpful, please consider citing our papers πŸ“ and liking this project ❀️!
206
+
207
+ ```bib
208
+ @article{gao2024multi,
209
+ title={Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage},
210
+ author={Gao, Zhi and Zhang, Bofei and Li, Pengxiang and Ma, Xiaojian and Yuan, Tao and Fan, Yue and Wu, Yuwei and Jia, Yunde and Zhu, Song-Chun and Li, Qing},
211
+ journal={arXiv preprint arXiv:2412.15606},
212
+ year={2024}
213
+ }
214
+ ```