zijian.kang
commited on
Commit
·
979314b
1
Parent(s):
f59e1ec
fix readme
Browse files
README.md
CHANGED
@@ -16,7 +16,7 @@ SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Byted
|
|
16 |
In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
|
17 |
|
18 |
## News🚀🚀🚀
|
19 |
-
- 2024-2-19: 📖 We released our 8B model,
|
20 |
- 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
|
21 |
](https://arxiv.org/abs/2501.05952)
|
22 |
- 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
|
@@ -29,7 +29,7 @@ In a word, SAIL-VL is a foundational VLM for vision-language applications. Welco
|
|
29 |
| Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
|
30 |
| --- | --- | --- | --- | --- | --- |
|
31 |
| [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
32 |
-
| [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-
|
33 |
|
34 |
### Training Recipes Overview:
|
35 |
|
@@ -40,7 +40,8 @@ Sail-VL benefits from high-quality data and carefully curated training recipes.
|
|
40 |
|
41 |
## Evaluation
|
42 |
|
43 |
-
SAIL-VL not only outperforms the Qwen2-VL and
|
|
|
44 |
|
45 |
### Detail Evaluations:
|
46 |
|
@@ -53,30 +54,30 @@ SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comp
|
|
53 |
| **Hallucination** | *68.7* | *67.5* | *69.7* | *65.3* |
|
54 |
| **General VQA** | | | | |
|
55 |
| MMStar | 64.2 | 58.3 | 65.3 | 57.7 |
|
56 |
-
|
|
57 |
-
|
|
58 |
| MME | 2244 | 2321 | 2321 | 2149 |
|
59 |
-
|
|
60 |
| RealWorldQA | 71.9 | 69.7 | 70.2 | 70.2 |
|
61 |
-
|
|
62 |
| **OCR VQA** | | | | |
|
63 |
-
|
|
64 |
-
|
|
65 |
-
|
|
66 |
-
|
|
67 |
-
|
|
68 |
-
|
|
69 |
| OCRBench | 835 | 833 | 880 | 834 |
|
70 |
| **Math&Knowledge** | | | | |
|
71 |
| MathVistaMini | 68.4 | 57.3 | 68.5 | 61.8 |
|
72 |
-
|
|
73 |
| **Hallucination** | | | | |
|
74 |
| HallucinationBench | 52.2 | 48.5 | 50.3 | 41.2 |
|
75 |
| POPE | 85.2 | 86.5 | 89.1 | 89.4 |
|
76 |
|
77 |
|
78 |
## Demo Cases
|
79 |
-
We visualize some
|
80 |
|
81 |
|
82 |
| Image | Question | Answer |
|
@@ -226,12 +227,19 @@ Our model is built upon numerous outstanding open-source projects, and we are gr
|
|
226 |
@misc{
|
227 |
sailvl,
|
228 |
title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
|
229 |
-
url = {https://huggingface.co/BytedanceDouyinContent/SAIL-VL-
|
230 |
author = {Bytedance Douyin Content Team},
|
231 |
month = {December},
|
232 |
year = {2024}
|
233 |
}
|
234 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
235 |
```
|
236 |
|
237 |
## Contributions
|
|
|
16 |
In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
|
17 |
|
18 |
## News🚀🚀🚀
|
19 |
+
- 2024-2-19: 📖 We released our 8B model, check out at [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) ~
|
20 |
- 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
|
21 |
](https://arxiv.org/abs/2501.05952)
|
22 |
- 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
|
|
|
29 |
| Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
|
30 |
| --- | --- | --- | --- | --- | --- |
|
31 |
| [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
32 |
+
| [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
33 |
|
34 |
### Training Recipes Overview:
|
35 |
|
|
|
40 |
|
41 |
## Evaluation
|
42 |
|
43 |
+
SAIL-VL not only outperforms the Qwen2-VL and
|
44 |
+
series of models of comparable size, but is also competitive compared with recently released SoTAs.
|
45 |
|
46 |
### Detail Evaluations:
|
47 |
|
|
|
54 |
| **Hallucination** | *68.7* | *67.5* | *69.7* | *65.3* |
|
55 |
| **General VQA** | | | | |
|
56 |
| MMStar | 64.2 | 58.3 | 65.3 | 57.7 |
|
57 |
+
| MMBench_DEV | 79.5 | 79.5 | 83.3 | 78.1 |
|
58 |
+
| MMMU_VAL | 48.2 | 50.9 | 52.8 | 47.6 |
|
59 |
| MME | 2244 | 2321 | 2321 | 2149 |
|
60 |
+
| SEEDBench_IMG | 75.5 | 75.3 | 76.9 | 76.8 |
|
61 |
| RealWorldQA | 71.9 | 69.7 | 70.2 | 70.2 |
|
62 |
+
| MMVET | 58.3 | 62.6 | 66.8 | 60.3 |
|
63 |
| **OCR VQA** | | | | |
|
64 |
+
| AI2D_TEST | 83.7 | 82.9 | 84.1 | 82.0 |
|
65 |
+
| DocVQA_Val | 92.2 | 93.7 | 92.1 | 92.3 |
|
66 |
+
| InfoVQA_Val | 75.2 | 75.9 | 76.2 | 72.5 |
|
67 |
+
| ChartQA_Test | 84.6 | 81.6 | 77.6 | 84.6 |
|
68 |
+
| TextVQA_Val | 77.7 | 83.8 | 79.2 | 83.3 |
|
69 |
+
| OCRVQA_Test | 61.4 | 56.2 | 36.7 | 54.5 |
|
70 |
| OCRBench | 835 | 833 | 880 | 834 |
|
71 |
| **Math&Knowledge** | | | | |
|
72 |
| MathVistaMini | 68.4 | 57.3 | 68.5 | 61.8 |
|
73 |
+
| ScienceQA_Val | 98.2 | 84.6 | 97.9 | 96.2 |
|
74 |
| **Hallucination** | | | | |
|
75 |
| HallucinationBench | 52.2 | 48.5 | 50.3 | 41.2 |
|
76 |
| POPE | 85.2 | 86.5 | 89.1 | 89.4 |
|
77 |
|
78 |
|
79 |
## Demo Cases
|
80 |
+
We visualize some examples to show the capabilities of our model. Our model is able to give detail and complex answer for a variety of questions.
|
81 |
|
82 |
|
83 |
| Image | Question | Answer |
|
|
|
227 |
@misc{
|
228 |
sailvl,
|
229 |
title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
|
230 |
+
url = {https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B/},
|
231 |
author = {Bytedance Douyin Content Team},
|
232 |
month = {December},
|
233 |
year = {2024}
|
234 |
}
|
235 |
+
```
|
236 |
+
```
|
237 |
+
@article{dong2025scalable,
|
238 |
+
title={Scalable vision language model training via high quality data curation},
|
239 |
+
author={Dong, Hongyuan and Kang, Zijian and Yin, Weijie and Liang, Xiao and Feng, Chao and Ran, Jiao},
|
240 |
+
journal={arXiv preprint arXiv:2501.05952},
|
241 |
+
year={2025}
|
242 |
+
}
|
243 |
```
|
244 |
|
245 |
## Contributions
|