Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,8 @@
|
|
|
|
1 |
---
|
2 |
-
license:
|
|
|
|
|
3 |
language:
|
4 |
- en
|
5 |
pipeline_tag: image-text-to-text
|
@@ -41,13 +44,12 @@ We extend dynamic resolution to the temporal dimension by adopting dynamic FPS s
|
|
41 |
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
|
42 |
<p>
|
43 |
|
44 |
-
|
45 |
* **Streamlined and Efficient Vision Encoder**
|
46 |
|
47 |
We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
|
48 |
|
49 |
|
50 |
-
We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned
|
51 |
|
52 |
|
53 |
|
@@ -55,50 +57,51 @@ We have three models with 3, 7 and 72 billion parameters. This repo contains the
|
|
55 |
|
56 |
### Image benchmark
|
57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
-
| Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
|
60 |
-
| :--- | :---: | :---: | :---: | :---: | :---: |
|
61 |
-
| MMMU<sub>val</sub> | 56 | 50.4 | **60**| 54.1 | 58.6|
|
62 |
-
| MMMU-Pro<sub>val</sub> | 34.3 | - | 37.6| 30.5 | 41.0|
|
63 |
-
| DocVQA<sub>test</sub> | 93 | 93 | - | 94.5 | **95.7** |
|
64 |
-
| InfoVQA<sub>test</sub> | 77.6 | - | - |76.5 | **82.6** |
|
65 |
-
| ChartQA<sub>test</sub> | 84.8 | - |- | 83.0 |**87.3** |
|
66 |
-
| TextVQA<sub>val</sub> | 79.1 | 80.1 | -| 84.3 | **84.9**|
|
67 |
-
| OCRBench | 822 | 852 | 785 | 845 | **864** |
|
68 |
-
| CC_OCR | 57.7 | | | 61.6 | **77.8**|
|
69 |
-
| MMStar | 62.8| | |60.7| **63.9**|
|
70 |
-
| MMBench-V1.1-En<sub>test</sub> | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
|
71 |
-
| MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
|
72 |
-
| MMStar | **61.5** | 57.5 | 54.8 | 60.7 |63.9 |
|
73 |
-
| MMVet<sub>GPT-4-Turbo</sub> | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
|
74 |
-
| HallBench<sub>avg</sub> | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
|
75 |
-
| MathVista<sub>testmini</sub> | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
|
76 |
-
| MathVision | - | - | - | 16.3 | **25.07** |
|
77 |
-
|
78 |
-
### Video Benchmarks
|
79 |
-
|
80 |
-
| Benchmark | Qwen2-VL-7B | **Qwen2.5-VL-7B** |
|
81 |
-
| :--- | :---: | :---: |
|
82 |
-
| MVBench | 67.0 | **69.6** |
|
83 |
-
| PerceptionTest<sub>test</sub> | 66.9 | **70.5** |
|
84 |
-
| Video-MME<sub>wo/w subs</sub> | 63.3/69.0 | **65.1**/**71.6** |
|
85 |
-
| LVBench | | 45.3 |
|
86 |
-
| LongVideoBench | | 54.7 |
|
87 |
-
| MMBench-Video | 1.44 | 1.79 |
|
88 |
-
| TempCompass | | 71.7 |
|
89 |
-
| MLVU | | 70.2 |
|
90 |
-
| CharadesSTA/mIoU | 43.6|
|
91 |
|
92 |
### Agent benchmark
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
| ScreenSpot
|
97 |
-
|
|
98 |
-
|
|
99 |
-
| Android Control
|
100 |
-
|
|
101 |
-
|
|
|
|
|
|
|
|
102 |
|
103 |
## Requirements
|
104 |
The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
|
@@ -144,25 +147,25 @@ from qwen_vl_utils import process_vision_info
|
|
144 |
|
145 |
# default: Load the model on the available device(s)
|
146 |
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
147 |
-
"Qwen/Qwen2.5-VL-
|
148 |
)
|
149 |
|
150 |
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
|
151 |
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
152 |
-
# "Qwen/Qwen2.5-VL-
|
153 |
# torch_dtype=torch.bfloat16,
|
154 |
# attn_implementation="flash_attention_2",
|
155 |
# device_map="auto",
|
156 |
# )
|
157 |
|
158 |
# default processer
|
159 |
-
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-
|
160 |
|
161 |
# The default range for the number of visual tokens per image in the model is 4-16384.
|
162 |
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
|
163 |
# min_pixels = 256*28*28
|
164 |
# max_pixels = 1280*28*28
|
165 |
-
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-
|
166 |
|
167 |
messages = [
|
168 |
{
|
@@ -431,7 +434,7 @@ The model supports a wide range of resolution inputs. By default, it uses the na
|
|
431 |
min_pixels = 256 * 28 * 28
|
432 |
max_pixels = 1280 * 28 * 28
|
433 |
processor = AutoProcessor.from_pretrained(
|
434 |
-
"Qwen/Qwen2.5-VL-
|
435 |
)
|
436 |
```
|
437 |
|
@@ -481,6 +484,7 @@ To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://ar
|
|
481 |
|
482 |
For supported frameworks, you could add the following to `config.json` to enable YaRN:
|
483 |
|
|
|
484 |
{
|
485 |
...,
|
486 |
"type": "yarn",
|
@@ -492,6 +496,7 @@ For supported frameworks, you could add the following to `config.json` to enable
|
|
492 |
"factor": 4,
|
493 |
"original_max_position_embeddings": 32768
|
494 |
}
|
|
|
495 |
|
496 |
However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
|
497 |
|
@@ -499,7 +504,6 @@ At the same time, for long video inputs, since MRoPE itself is more economical w
|
|
499 |
|
500 |
|
501 |
|
502 |
-
|
503 |
## Citation
|
504 |
|
505 |
If you find our work helpful, feel free to give us a cite.
|
@@ -526,4 +530,4 @@ If you find our work helpful, feel free to give us a cite.
|
|
526 |
journal={arXiv preprint arXiv:2308.12966},
|
527 |
year={2023}
|
528 |
}
|
529 |
-
```
|
|
|
1 |
+
|
2 |
---
|
3 |
+
license: other
|
4 |
+
license_name: qwen
|
5 |
+
license_link: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE
|
6 |
language:
|
7 |
- en
|
8 |
pipeline_tag: image-text-to-text
|
|
|
44 |
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
|
45 |
<p>
|
46 |
|
|
|
47 |
* **Streamlined and Efficient Vision Encoder**
|
48 |
|
49 |
We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
|
50 |
|
51 |
|
52 |
+
We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 72B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
|
53 |
|
54 |
|
55 |
|
|
|
57 |
|
58 |
### Image benchmark
|
59 |
|
60 |
+
| Benchmarks | GPT4o | Claude3.5 Sonnet | Gemini-2-flash | InternVL2.5-78B | Qwen2-VL-72B | Qwen2.5-VL-72B |
|
61 |
+
|-----------------------|-----------|-------------------|-----------------|-----------------|--------------|----------------|
|
62 |
+
| MMMU<sub>val</sub> | 70.3 | 70.4 | 70.7 | 70.1 | 64.5 | 70.2 |
|
63 |
+
| MMMU_Pro | 54.5 | 54.7 | 57.0 | 48.6 | 46.2 | 51.1 |
|
64 |
+
| MathVista_MINI | 63.8 | 65.4 | 73.1 | 76.6 | 70.5 | 74.8 |
|
65 |
+
| MathVision_FULL | 30.4 | 38.3 | 41.3 | 32.2 | 25.9 | 38.1 |
|
66 |
+
| Hallusion Bench | 55.0 | 55.16 | | 57.4 | 58.1 | 55.16 |
|
67 |
+
| MMBench_DEV_EN_V11 | 82.1 | 83.4 | 83.0 | 88.5 | 86.6 | 88 |
|
68 |
+
| AI2D_TEST | 84.6 | 81.2 | | 89.1 | 88.1 | 88.4 |
|
69 |
+
| ChartQA_TEST | 86.7 | 90.8 | 85.2 | 88.3 | 88.3 | 89.5 |
|
70 |
+
| DocVQA_VAL | 91.1 | 95.2 | 92.1 | 96.5 | 96.1 | 96.4 |
|
71 |
+
| MMStar | 64.7 | 65.1 | 69.4 | 69.5 | 68.3 | 70.8 |
|
72 |
+
| MMVet_turbo | 69.1 | 70.1 | | 72.3 | 74.0 | 76.19 |
|
73 |
+
| OCRBench | 736 | 788 | | 854 | 877 | 885 |
|
74 |
+
| OCRBench-V2(en/zh) | 46.5/32.3 | 45.2/39.6 | 51.9/43.1 | 45/46.2 | 47.8/46.1 | 61.5/63.7 |
|
75 |
+
| CC-OCR | 66.6 | 62.7 | 73.0 | 64.7 | 68.7 |79.8 |
|
76 |
+
|
77 |
+
|
78 |
+
### Video benchmark
|
79 |
+
| Benchmarks | GPT4o | Gemini-1.5-Pro | InternVL2.5-78B | Qwen2VL-72B | Qwen2.5VL-72B |
|
80 |
+
|---------------------|-------|----------------|-----------------|-------------|---------------|
|
81 |
+
| VideoMME w/o sub. | 71.9 | 75.0 | 72.1 | 71.2 | 73.3 |
|
82 |
+
| VideoMME w sub. | 77.2 | 81.3 | 74.0 | 77.8 | 79.1 |
|
83 |
+
| MVBench | 64.6 | 60.5 | 76.4 | 73.6 | 70.4 |
|
84 |
+
| MMBench-Video | 1.63 | 1.30 | 1.97 | 1.70 | 2.02 |
|
85 |
+
| LVBench | 30.8 | 33.1 | - | 41.3 | 47.3 |
|
86 |
+
| EgoSchema | 72.2 | 71.2 | - | 77.9 | 76.2 |
|
87 |
+
| PerceptionTest_test | - | - | - | 68.0 | 73.2 |
|
88 |
+
| MLVU_M-Avg_dev | 64.6 | - | 75.7 | | 74.6 |
|
89 |
+
| TempCompass_overall | 73.8 | - | - | | 74.8 |
|
90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
|
92 |
### Agent benchmark
|
93 |
+
|
94 |
+
| Benchmarks | GPT4o | Gemini 2.0 | Claude | Aguvis-72B | Qwen2VL-72B | Qwen2.5VL-72B |
|
95 |
+
|-------------------------|-------------|------------|--------|------------|-------------|---------------|
|
96 |
+
| ScreenSpot | 18.1 | 84.0 | 83.0 | | | 87.1 |
|
97 |
+
| ScreenSpot Pro | | | 17.1 | | 1.6 | 43.6 |
|
98 |
+
| AITZ_EM | 35.3 | | | | 72.8 | 83.2 |
|
99 |
+
| Android Control High_EM | | | | 66.4 | 59.1 | 67.36 |
|
100 |
+
| Android Control Low_EM | | | | 84.4 | 59.2 | 93.7 |
|
101 |
+
| AndroidWorld_SR | 34.5% (SoM) | | 27.9% | 26.1% | | 35% |
|
102 |
+
| MobileMiniWob++_SR | | | | 66% | | 68% |
|
103 |
+
| OSWorld | | | 14.90 | 10.26 | | 8.83 |
|
104 |
+
|
105 |
|
106 |
## Requirements
|
107 |
The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
|
|
|
147 |
|
148 |
# default: Load the model on the available device(s)
|
149 |
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
150 |
+
"Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
|
151 |
)
|
152 |
|
153 |
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
|
154 |
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
155 |
+
# "Qwen/Qwen2.5-VL-72B-Instruct",
|
156 |
# torch_dtype=torch.bfloat16,
|
157 |
# attn_implementation="flash_attention_2",
|
158 |
# device_map="auto",
|
159 |
# )
|
160 |
|
161 |
# default processer
|
162 |
+
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")
|
163 |
|
164 |
# The default range for the number of visual tokens per image in the model is 4-16384.
|
165 |
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
|
166 |
# min_pixels = 256*28*28
|
167 |
# max_pixels = 1280*28*28
|
168 |
+
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
|
169 |
|
170 |
messages = [
|
171 |
{
|
|
|
434 |
min_pixels = 256 * 28 * 28
|
435 |
max_pixels = 1280 * 28 * 28
|
436 |
processor = AutoProcessor.from_pretrained(
|
437 |
+
"Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
|
438 |
)
|
439 |
```
|
440 |
|
|
|
484 |
|
485 |
For supported frameworks, you could add the following to `config.json` to enable YaRN:
|
486 |
|
487 |
+
```json
|
488 |
{
|
489 |
...,
|
490 |
"type": "yarn",
|
|
|
496 |
"factor": 4,
|
497 |
"original_max_position_embeddings": 32768
|
498 |
}
|
499 |
+
```
|
500 |
|
501 |
However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
|
502 |
|
|
|
504 |
|
505 |
|
506 |
|
|
|
507 |
## Citation
|
508 |
|
509 |
If you find our work helpful, feel free to give us a cite.
|
|
|
530 |
journal={arXiv preprint arXiv:2308.12966},
|
531 |
year={2023}
|
532 |
}
|
533 |
+
```
|