Safetensors
English
qwen2_vl
remote-sensing
AdaptLLM commited on
Commit
494c349
·
verified ·
1 Parent(s): e4b7afd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -3
README.md CHANGED
@@ -1,3 +1,146 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen2-VL-2B-Instruct
7
+ tags:
8
+ - remote-sensing
9
+ ---
10
+ # Adapting Multimodal Large Language Models to Domains via Post-Training
11
+
12
+ This repos contains the **remote sensing MLLM developed from Qwen-2-VL-2B-Instruct** in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930).
13
+
14
+ The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains/edit/main/README.md)
15
+
16
+ ## Resources
17
+ **🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗**
18
+
19
+ | Model | Repo ID in HF 🤗 | Domain | Base Model | Training Data | Evaluation Benchmark |
20
+ |:----------------------------------------------------------------------------|:--------------------------------------------|:--------------|:-------------------------|:------------------------------------------------------------------------------------------------|-----------------------|
21
+ | [Visual Instruction Synthesizer](https://huggingface.co/AdaptLLM/visual-instruction-synthesizer) | AdaptLLM/visual-instruction-synthesizer | - | open-llava-next-llama3-8b | VisionFLAN and ALLaVA | - |
22
+ | [AdaMLLM-med-2B](https://huggingface.co/AdaptLLM/biomed-Qwen2-VL-2B-Instruct) | AdaptLLM/biomed-Qwen2-VL-2B-Instruct | Biomedicine | Qwen2-VL-2B-Instruct | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
23
+ | [AdaMLLM-food-2B](https://huggingface.co/AdaptLLM/food-Qwen2-VL-2B-Instruct) | AdaptLLM/food-Qwen2-VL-2B-Instruct | Food | Qwen2-VL-2B-Instruct | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
24
+ | [AdaMLLM-remote-sensing-2B](https://huggingface.co/AdaptLLM/food-Qwen2-VL-2B-Instruct) | AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct | Remote Sensing | Qwen2-VL-2B-Instruct | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
25
+ | [AdaMLLM-med-8B](https://huggingface.co/AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B) | AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B | Biomedicine | open-llava-next-llama3-8b | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
26
+ | [AdaMLLM-food-8B](https://huggingface.co/AdaptLLM/food-LLaVA-NeXT-Llama3-8B) |AdaptLLM/food-LLaVA-NeXT-Llama3-8B | Food | open-llava-next-llama3-8b | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
27
+ | [AdaMLLM-remote-sensing-8B](https://huggingface.co/AdaptLLM/food-LLaVA-NeXT-Llama3-8B) |AdaptLLM/remote-sensing-LLaVA-NeXT-Llama3-8B | Remote Sensing | open-llava-next-llama3-8b | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
28
+ | [AdaMLLM-med-11B](https://huggingface.co/AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct | Biomedicine | Llama-3.2-11B-Vision-Instruct | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
29
+ | [AdaMLLM-food-11B](https://huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/food-Llama-3.2-11B-Vision-Instruct | Food | Llama-3.2-11B-Vision-Instruct | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
30
+ | [AdaMLLM-remote-sensing-11B](https://huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/remote-sensing-Llama-3.2-11B-Vision-Instruct | Remote Sensing | Llama-3.2-11B-Vision-Instruct | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
31
+
32
+ **Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)
33
+
34
+ ## 1. To Chat with AdaMLLM
35
+
36
+ Our model architecture aligns with the base model: Qwen-2-VL-Instruct. We provide a usage example below, and you may refer to the official [Qwen-2-VL-Instruct repository](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) for more advanced usage instructions.
37
+
38
+ **Note:** For AdaMLLM, always place the image at the beginning of the input instruction in the messages.
39
+
40
+ <details>
41
+ <summary> Click to expand </summary>
42
+
43
+ 1. Set up
44
+ ```bash
45
+ pip install qwen-vl-utils
46
+ ```
47
+ 2. Inference
48
+ ```python
49
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
50
+ from qwen_vl_utils import process_vision_info
51
+
52
+ # default: Load the model on the available device(s)
53
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
54
+ "AdaptLLM/food-Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
55
+ )
56
+
57
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
58
+ # model = Qwen2VLForConditionalGeneration.from_pretrained(
59
+ # "AdaptLLM/food-Qwen2-VL-2B-Instruct",
60
+ # torch_dtype=torch.bfloat16,
61
+ # attn_implementation="flash_attention_2",
62
+ # device_map="auto",
63
+ # )
64
+
65
+ # default processer
66
+ processor = AutoProcessor.from_pretrained("AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct")
67
+
68
+ # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
69
+ # min_pixels = 256*28*28
70
+ # max_pixels = 1280*28*28
71
+ # processor = AutoProcessor.from_pretrained("AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
72
+
73
+ # NOTE: For AdaMLLM, always place the image at the beginning of the input instruction in the messages.
74
+ messages = [
75
+ {
76
+ "role": "user",
77
+ "content": [
78
+ {
79
+ "type": "image",
80
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
81
+ },
82
+ {"type": "text", "text": "Describe this image."},
83
+ ],
84
+ }
85
+ ]
86
+
87
+ # Preparation for inference
88
+ text = processor.apply_chat_template(
89
+ messages, tokenize=False, add_generation_prompt=True
90
+ )
91
+ image_inputs, video_inputs = process_vision_info(messages)
92
+ inputs = processor(
93
+ text=[text],
94
+ images=image_inputs,
95
+ videos=video_inputs,
96
+ padding=True,
97
+ return_tensors="pt",
98
+ )
99
+ inputs = inputs.to("cuda")
100
+
101
+ # Inference: Generation of the output
102
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
103
+ generated_ids_trimmed = [
104
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
105
+ ]
106
+ output_text = processor.batch_decode(
107
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
108
+ )
109
+ print(output_text)
110
+ ```
111
+
112
+ </details>
113
+
114
+ ## 2. To Evaluate Any MLLM on Domain-Specific Benchmarks
115
+
116
+ Refer to the [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/remote-sensing-VQA-benchmark) to reproduce our results and evaluate many other MLLMs on domain-specific benchmarks.
117
+
118
+ ## 3. To Reproduce this Domain-Adapted MLLM
119
+
120
+ See [Post-Train Guide](https://github.com/bigai-ai/QA-Synthesizer/blob/main/docs/Post_Train.md) to adapt MLLMs to domains.
121
+
122
+
123
+ ## Citation
124
+ If you find our work helpful, please cite us.
125
+
126
+ [AdaMLLM](https://huggingface.co/papers/2411.19930)
127
+ ```bibtex
128
+ @article{adamllm,
129
+ title={On Domain-Specific Post-Training for Multimodal Large Language Models},
130
+ author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
131
+ journal={arXiv preprint arXiv:2411.19930},
132
+ year={2024}
133
+ }
134
+ ```
135
+
136
+ [Adapt LLM to Domains](https://huggingface.co/papers/2309.09530) (ICLR 2024)
137
+ ```bibtex
138
+ @inproceedings{
139
+ cheng2024adapting,
140
+ title={Adapting Large Language Models via Reading Comprehension},
141
+ author={Daixuan Cheng and Shaohan Huang and Furu Wei},
142
+ booktitle={The Twelfth International Conference on Learning Representations},
143
+ year={2024},
144
+ url={https://openreview.net/forum?id=y886UXPEZ0}
145
+ }
146
+ ```