imjliao commited on
Commit
3c8925a
·
verified ·
1 Parent(s): 910b1b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -51
README.md CHANGED
@@ -1,5 +1,8 @@
 
1
  ---
2
- license: apache-2.0
 
 
3
  language:
4
  - en
5
  pipeline_tag: image-text-to-text
@@ -41,13 +44,12 @@ We extend dynamic resolution to the temporal dimension by adopting dynamic FPS s
41
  <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
42
  <p>
43
 
44
-
45
  * **Streamlined and Efficient Vision Encoder**
46
 
47
  We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
48
 
49
 
50
- We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
51
 
52
 
53
 
@@ -55,50 +57,51 @@ We have three models with 3, 7 and 72 billion parameters. This repo contains the
55
 
56
  ### Image benchmark
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
- | Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
60
- | :--- | :---: | :---: | :---: | :---: | :---: |
61
- | MMMU<sub>val</sub> | 56 | 50.4 | **60**| 54.1 | 58.6|
62
- | MMMU-Pro<sub>val</sub> | 34.3 | - | 37.6| 30.5 | 41.0|
63
- | DocVQA<sub>test</sub> | 93 | 93 | - | 94.5 | **95.7** |
64
- | InfoVQA<sub>test</sub> | 77.6 | - | - |76.5 | **82.6** |
65
- | ChartQA<sub>test</sub> | 84.8 | - |- | 83.0 |**87.3** |
66
- | TextVQA<sub>val</sub> | 79.1 | 80.1 | -| 84.3 | **84.9**|
67
- | OCRBench | 822 | 852 | 785 | 845 | **864** |
68
- | CC_OCR | 57.7 | | | 61.6 | **77.8**|
69
- | MMStar | 62.8| | |60.7| **63.9**|
70
- | MMBench-V1.1-En<sub>test</sub> | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
71
- | MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
72
- | MMStar | **61.5** | 57.5 | 54.8 | 60.7 |63.9 |
73
- | MMVet<sub>GPT-4-Turbo</sub> | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
74
- | HallBench<sub>avg</sub> | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
75
- | MathVista<sub>testmini</sub> | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
76
- | MathVision | - | - | - | 16.3 | **25.07** |
77
-
78
- ### Video Benchmarks
79
-
80
- | Benchmark | Qwen2-VL-7B | **Qwen2.5-VL-7B** |
81
- | :--- | :---: | :---: |
82
- | MVBench | 67.0 | **69.6** |
83
- | PerceptionTest<sub>test</sub> | 66.9 | **70.5** |
84
- | Video-MME<sub>wo/w subs</sub> | 63.3/69.0 | **65.1**/**71.6** |
85
- | LVBench | | 45.3 |
86
- | LongVideoBench | | 54.7 |
87
- | MMBench-Video | 1.44 | 1.79 |
88
- | TempCompass | | 71.7 |
89
- | MLVU | | 70.2 |
90
- | CharadesSTA/mIoU | 43.6|
91
 
92
  ### Agent benchmark
93
- | Benchmarks | Qwen2.5-VL-7B |
94
- |-------------------------|---------------|
95
- | ScreenSpot | 84.7 |
96
- | ScreenSpot Pro | 29.0 |
97
- | AITZ_EM | 81.9 |
98
- | Android Control High_EM | 60.1 |
99
- | Android Control Low_EM | 93.7 |
100
- | AndroidWorld_SR | 25.5 |
101
- | MobileMiniWob++_SR | 91.4 |
 
 
 
102
 
103
  ## Requirements
104
  The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
@@ -144,25 +147,25 @@ from qwen_vl_utils import process_vision_info
144
 
145
  # default: Load the model on the available device(s)
146
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
147
- "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
148
  )
149
 
150
  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
151
  # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
152
- # "Qwen/Qwen2.5-VL-7B-Instruct",
153
  # torch_dtype=torch.bfloat16,
154
  # attn_implementation="flash_attention_2",
155
  # device_map="auto",
156
  # )
157
 
158
  # default processer
159
- processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
160
 
161
  # The default range for the number of visual tokens per image in the model is 4-16384.
162
  # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
163
  # min_pixels = 256*28*28
164
  # max_pixels = 1280*28*28
165
- # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
166
 
167
  messages = [
168
  {
@@ -431,7 +434,7 @@ The model supports a wide range of resolution inputs. By default, it uses the na
431
  min_pixels = 256 * 28 * 28
432
  max_pixels = 1280 * 28 * 28
433
  processor = AutoProcessor.from_pretrained(
434
- "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
435
  )
436
  ```
437
 
@@ -481,6 +484,7 @@ To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://ar
481
 
482
  For supported frameworks, you could add the following to `config.json` to enable YaRN:
483
 
 
484
  {
485
  ...,
486
  "type": "yarn",
@@ -492,6 +496,7 @@ For supported frameworks, you could add the following to `config.json` to enable
492
  "factor": 4,
493
  "original_max_position_embeddings": 32768
494
  }
 
495
 
496
  However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
497
 
@@ -499,7 +504,6 @@ At the same time, for long video inputs, since MRoPE itself is more economical w
499
 
500
 
501
 
502
-
503
  ## Citation
504
 
505
  If you find our work helpful, feel free to give us a cite.
@@ -526,4 +530,4 @@ If you find our work helpful, feel free to give us a cite.
526
  journal={arXiv preprint arXiv:2308.12966},
527
  year={2023}
528
  }
529
- ```
 
1
+
2
  ---
3
+ license: other
4
+ license_name: qwen
5
+ license_link: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE
6
  language:
7
  - en
8
  pipeline_tag: image-text-to-text
 
44
  <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
45
  <p>
46
 
 
47
  * **Streamlined and Efficient Vision Encoder**
48
 
49
  We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
50
 
51
 
52
+ We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 72B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
53
 
54
 
55
 
 
57
 
58
  ### Image benchmark
59
 
60
+ | Benchmarks | GPT4o | Claude3.5 Sonnet | Gemini-2-flash | InternVL2.5-78B | Qwen2-VL-72B | Qwen2.5-VL-72B |
61
+ |-----------------------|-----------|-------------------|-----------------|-----------------|--------------|----------------|
62
+ | MMMU<sub>val</sub> | 70.3 | 70.4 | 70.7 | 70.1 | 64.5 | 70.2 |
63
+ | MMMU_Pro | 54.5 | 54.7 | 57.0 | 48.6 | 46.2 | 51.1 |
64
+ | MathVista_MINI | 63.8 | 65.4 | 73.1 | 76.6 | 70.5 | 74.8 |
65
+ | MathVision_FULL | 30.4 | 38.3 | 41.3 | 32.2 | 25.9 | 38.1 |
66
+ | Hallusion Bench | 55.0 | 55.16 | | 57.4 | 58.1 | 55.16 |
67
+ | MMBench_DEV_EN_V11 | 82.1 | 83.4 | 83.0 | 88.5 | 86.6 | 88 |
68
+ | AI2D_TEST | 84.6 | 81.2 | | 89.1 | 88.1 | 88.4 |
69
+ | ChartQA_TEST | 86.7 | 90.8 | 85.2 | 88.3 | 88.3 | 89.5 |
70
+ | DocVQA_VAL | 91.1 | 95.2 | 92.1 | 96.5 | 96.1 | 96.4 |
71
+ | MMStar | 64.7 | 65.1 | 69.4 | 69.5 | 68.3 | 70.8 |
72
+ | MMVet_turbo | 69.1 | 70.1 | | 72.3 | 74.0 | 76.19 |
73
+ | OCRBench | 736 | 788 | | 854 | 877 | 885 |
74
+ | OCRBench-V2(en/zh) | 46.5/32.3 | 45.2/39.6 | 51.9/43.1 | 45/46.2 | 47.8/46.1 | 61.5/63.7 |
75
+ | CC-OCR | 66.6 | 62.7 | 73.0 | 64.7 | 68.7 |79.8 |
76
+
77
+
78
+ ### Video benchmark
79
+ | Benchmarks | GPT4o | Gemini-1.5-Pro | InternVL2.5-78B | Qwen2VL-72B | Qwen2.5VL-72B |
80
+ |---------------------|-------|----------------|-----------------|-------------|---------------|
81
+ | VideoMME w/o sub. | 71.9 | 75.0 | 72.1 | 71.2 | 73.3 |
82
+ | VideoMME w sub. | 77.2 | 81.3 | 74.0 | 77.8 | 79.1 |
83
+ | MVBench | 64.6 | 60.5 | 76.4 | 73.6 | 70.4 |
84
+ | MMBench-Video | 1.63 | 1.30 | 1.97 | 1.70 | 2.02 |
85
+ | LVBench | 30.8 | 33.1 | - | 41.3 | 47.3 |
86
+ | EgoSchema | 72.2 | 71.2 | - | 77.9 | 76.2 |
87
+ | PerceptionTest_test | - | - | - | 68.0 | 73.2 |
88
+ | MLVU_M-Avg_dev | 64.6 | - | 75.7 | | 74.6 |
89
+ | TempCompass_overall | 73.8 | - | - | | 74.8 |
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ### Agent benchmark
93
+
94
+ | Benchmarks | GPT4o | Gemini 2.0 | Claude | Aguvis-72B | Qwen2VL-72B | Qwen2.5VL-72B |
95
+ |-------------------------|-------------|------------|--------|------------|-------------|---------------|
96
+ | ScreenSpot | 18.1 | 84.0 | 83.0 | | | 87.1 |
97
+ | ScreenSpot Pro | | | 17.1 | | 1.6 | 43.6 |
98
+ | AITZ_EM | 35.3 | | | | 72.8 | 83.2 |
99
+ | Android Control High_EM | | | | 66.4 | 59.1 | 67.36 |
100
+ | Android Control Low_EM | | | | 84.4 | 59.2 | 93.7 |
101
+ | AndroidWorld_SR | 34.5% (SoM) | | 27.9% | 26.1% | | 35% |
102
+ | MobileMiniWob++_SR | | | | 66% | | 68% |
103
+ | OSWorld | | | 14.90 | 10.26 | | 8.83 |
104
+
105
 
106
  ## Requirements
107
  The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
 
147
 
148
  # default: Load the model on the available device(s)
149
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
150
+ "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
151
  )
152
 
153
  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
154
  # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
155
+ # "Qwen/Qwen2.5-VL-72B-Instruct",
156
  # torch_dtype=torch.bfloat16,
157
  # attn_implementation="flash_attention_2",
158
  # device_map="auto",
159
  # )
160
 
161
  # default processer
162
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")
163
 
164
  # The default range for the number of visual tokens per image in the model is 4-16384.
165
  # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
166
  # min_pixels = 256*28*28
167
  # max_pixels = 1280*28*28
168
+ # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
169
 
170
  messages = [
171
  {
 
434
  min_pixels = 256 * 28 * 28
435
  max_pixels = 1280 * 28 * 28
436
  processor = AutoProcessor.from_pretrained(
437
+ "Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
438
  )
439
  ```
440
 
 
484
 
485
  For supported frameworks, you could add the following to `config.json` to enable YaRN:
486
 
487
+ ```json
488
  {
489
  ...,
490
  "type": "yarn",
 
496
  "factor": 4,
497
  "original_max_position_embeddings": 32768
498
  }
499
+ ```
500
 
501
  However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
502
 
 
504
 
505
 
506
 
 
507
  ## Citation
508
 
509
  If you find our work helpful, feel free to give us a cite.
 
530
  journal={arXiv preprint arXiv:2308.12966},
531
  year={2023}
532
  }
533
+ ```