chenkq commited on
Commit
821a16f
Β·
verified Β·
1 Parent(s): 649c61c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -5
README.md CHANGED
@@ -102,25 +102,25 @@ from qwen_vl_utils import process_vision_info
102
 
103
  # default: Load the model on the available device(s)
104
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
105
- "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
106
  )
107
 
108
  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
109
  # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
110
- # "Qwen/Qwen2.5-VL-3B-Instruct",
111
  # torch_dtype=torch.bfloat16,
112
  # attn_implementation="flash_attention_2",
113
  # device_map="auto",
114
  # )
115
 
116
  # default processer
117
- processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
118
 
119
  # The default range for the number of visual tokens per image in the model is 4-16384.
120
  # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
121
  # min_pixels = 256*28*28
122
  # max_pixels = 1280*28*28
123
- # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
124
 
125
  messages = [
126
  {
@@ -209,7 +209,7 @@ The model supports a wide range of resolution inputs. By default, it uses the na
209
  min_pixels = 256 * 28 * 28
210
  max_pixels = 1280 * 28 * 28
211
  processor = AutoProcessor.from_pretrained(
212
- "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
213
  )
214
  ```
215
 
@@ -277,6 +277,27 @@ However, it should be noted that this method has a significant impact on the per
277
 
278
  At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
279
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
 
281
 
282
  ## Citation
 
102
 
103
  # default: Load the model on the available device(s)
104
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
105
+ "Qwen/Qwen2.5-VL-3B-Instruct-AWQ", torch_dtype="auto", device_map="auto"
106
  )
107
 
108
  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
109
  # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
110
+ # "Qwen/Qwen2.5-VL-3B-Instruct-AWQ",
111
  # torch_dtype=torch.bfloat16,
112
  # attn_implementation="flash_attention_2",
113
  # device_map="auto",
114
  # )
115
 
116
  # default processer
117
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct-AWQ")
118
 
119
  # The default range for the number of visual tokens per image in the model is 4-16384.
120
  # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
121
  # min_pixels = 256*28*28
122
  # max_pixels = 1280*28*28
123
+ # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels)
124
 
125
  messages = [
126
  {
 
209
  min_pixels = 256 * 28 * 28
210
  max_pixels = 1280 * 28 * 28
211
  processor = AutoProcessor.from_pretrained(
212
+ "Qwen/Qwen2.5-VL-3B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels
213
  )
214
  ```
215
 
 
277
 
278
  At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
279
 
280
+ ### Benchmark
281
+ #### Performance of Quantized Models
282
+ This section reports the generation performance of quantized models (including GPTQ and AWQ) of the Qwen2.5-VL series. Specifically, we report:
283
+
284
+ - MMMU_VAL (Accuracy)
285
+ - DocVQA_VAL (Accuracy)
286
+ - MMBench_DEV_EN (Accuracy)
287
+ - MathVista_MINI (Accuracy)
288
+
289
+ We use [VLMEvalkit](https://github.com/open-compass/VLMEvalKit) to evaluate all models.
290
+
291
+ | Model Size | Quantization | MMMU_VAL | DocVQA_VAL | MMBench_EDV_EN | MathVista_MINI |
292
+ | --- | --- | --- | --- | --- | --- |
293
+ | Qwen2.5-VL-72B-Instruct | BF16<br><sup>([πŸ€—](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)[πŸ€–](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct)) | 70.0 | 96.1 | 88.2 | 75.3 |
294
+ | | AWQ<br><sup>([πŸ€—](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ)[πŸ€–](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct-AWQ)) | 69.1 | 96.0 | 87.9 | 73.8 |
295
+ | Qwen2.5-VL-7B-Instruct | BF16<br><sup>([πŸ€—](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)[πŸ€–](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct)) | 58.4 | 94.9 | 84.1 | 67.9 |
296
+ | | AWQ<br><sup>([πŸ€—](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ)[πŸ€–](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct-AWQ)) | 55.6 | 94.6 | 84.2 | 64.7 |
297
+ | Qwen2.5-VL-3B-Instruct | BF16<br><sup>([πŸ€—](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)[πŸ€–](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct)) | 51.7 | 93.0 | 79.8 | 61.4 |
298
+ | | AWQ<br><sup>([πŸ€—](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ)[πŸ€–](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct-AWQ)) | 49.1 | 91.8 | 78.0 | 58.8 |
299
+
300
+
301
 
302
 
303
  ## Citation