Safetensors
English
qwen2_vl
remote-sensing
File size: 4,826 Bytes
494c349
 
 
 
 
 
 
 
0399d20
 
494c349
 
 
04d4d19
494c349
455dfe2
494c349
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2-VL-2B-Instruct
tags:
- remote-sensing
datasets:
- AdaptLLM/remote-sensing-visual-instructions
---
# Adapting Multimodal Large Language Models to Domains via Post-Training

This repos contains the **remote sensing MLLM developed from Qwen-2-VL-2B-Instruct** in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930). The correspoding training dataset is in [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/remote-sensing-visual-instructions).

The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains)

## 1. To Chat with AdaMLLM  

Our model architecture aligns with the base model: Qwen-2-VL-Instruct. We provide a usage example below, and you may refer to the official [Qwen-2-VL-Instruct repository](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) for more advanced usage instructions.

**Note:** For AdaMLLM, always place the image at the beginning of the input instruction in the messages.  

<details>
<summary> Click to expand </summary>

1. Set up
```bash
pip install qwen-vl-utils
```
2. Inference
```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "AdaptLLM/food-Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "AdaptLLM/food-Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

# NOTE: For AdaMLLM, always place the image at the beginning of the input instruction in the messages.
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

</details>  

## 2. To Evaluate Any MLLM on Domain-Specific Benchmarks  

Refer to the [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/remote-sensing-VQA-benchmark) to reproduce our results and evaluate many other MLLMs on domain-specific benchmarks.  

## 3. To Reproduce this Domain-Adapted MLLM  

See [Post-Train Guide](https://github.com/bigai-ai/QA-Synthesizer/blob/main/docs/Post_Train.md) to adapt MLLMs to domains.


## Citation
If you find our work helpful, please cite us.

[AdaMLLM](https://huggingface.co/papers/2411.19930)
```bibtex
@article{adamllm,
  title={On Domain-Specific Post-Training for Multimodal Large Language Models},
  author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
  journal={arXiv preprint arXiv:2411.19930},
  year={2024}
}
```

[Adapt LLM to Domains](https://huggingface.co/papers/2309.09530) (ICLR 2024)
```bibtex
@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}
```