zijian.kang
commited on
Commit
·
85a99e2
1
Parent(s):
0ce0168
update readme
Browse files
README.md
CHANGED
@@ -10,17 +10,15 @@ base_model:
|
|
10 |
|
11 |

|
12 |
|
13 |
-
SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. The goal of SAIL-VL is to develope a high-performance vision language model that facilitates deployment on mobile devices and ensures accessibility and affordability for a broad audience. Through careful tuning of data and training recipes, SAIL-VL demonstrates that even a small VLM can benefit significantly from data scaling. Our model outperforms Qwen2-VL, InternVL2 and even recent SoTA models of comparable sizes.
|
14 |
-
|
15 |
|
16 |
In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
|
17 |
|
18 |
## News🚀🚀🚀
|
|
|
19 |
- 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
|
20 |
](https://arxiv.org/abs/2501.05952)
|
21 |
- 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
|
22 |
-
|
23 |
-
|
24 |
## Model Card
|
25 |
|
26 |
### Model Architecture:
|
@@ -28,6 +26,7 @@ In a word, SAIL-VL is a foundational VLM for vision-language applications. Welco
|
|
28 |
| Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
|
29 |
| --- | --- | --- | --- | --- | --- |
|
30 |
| [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
|
|
31 |
|
32 |
### Training Recipes Overview:
|
33 |
|
@@ -36,7 +35,6 @@ Sail-VL benefits from high-quality data and carefully curated training recipes.
|
|
36 |

|
37 |
|
38 |
|
39 |
-
|
40 |
## Evaluation
|
41 |
|
42 |
SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comparable size, but is also competitive compared with recently released SoTAs, Aquila and InternVL-2.5.
|
@@ -71,6 +69,7 @@ The result is evaluated by our team with a VLMEvalKit variant.
|
|
71 |
| DocVQA_VAL | 86.23 | 85.38 | 74.31 | 87.67 | 86.06 |
|
72 |
| TextVQA_VAL | 73.48 | 79.66 | 76.27 | 76.76 | 77.21 |
|
73 |
|
|
|
74 |
Details for average performance section:
|
75 |
- OpenCompass-Avg includes public avaliable validation sets from OpenCompass: AI2D_TEST, HallusionBench, MMBench_DEV_CN_V11, MMBench_DEV_EN_V11, MME, MMMU_DEV_VAL, MMStar, MMVet, MathVista_MINI, evaluated by our team.
|
76 |
|
@@ -225,6 +224,14 @@ Our model is built upon numerous outstanding open-source projects, and we are gr
|
|
225 |
|
226 |
## Citation
|
227 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
228 |
@misc{
|
229 |
sailvl,
|
230 |
title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
|
@@ -233,13 +240,11 @@ Our model is built upon numerous outstanding open-source projects, and we are gr
|
|
233 |
month = {December},
|
234 |
year = {2024}
|
235 |
}
|
236 |
-
|
237 |
```
|
238 |
-
|
239 |
## Contributions
|
240 |
This work is conducted by Bytedance Douyin Content Team, authored by:
|
241 |
```
|
242 |
-
{Hongyuan Dong, Zijian Kang, Weijie Yin}, Xiao Liang, Feng
|
243 |
|
244 |
{*} Equal Contributions.
|
245 |
```
|
|
|
10 |
|
11 |

|
12 |
|
13 |
+
SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. The goal of SAIL-VL is to develope a high-performance vision language model that facilitates deployment on mobile devices and ensures accessibility and affordability for a broad audience. Through careful tuning of data and training recipes, SAIL-VL demonstrates that even a small VLM can benefit significantly from data scaling. Our model outperforms Qwen2-VL, InternVL2 and even recent SoTA models of comparable sizes.
|
|
|
14 |
|
15 |
In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
|
16 |
|
17 |
## News🚀🚀🚀
|
18 |
+
- 2024-2-19: 📖 We released our 8B model, check out at [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) ~
|
19 |
- 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
|
20 |
](https://arxiv.org/abs/2501.05952)
|
21 |
- 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
|
|
|
|
|
22 |
## Model Card
|
23 |
|
24 |
### Model Architecture:
|
|
|
26 |
| Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
|
27 |
| --- | --- | --- | --- | --- | --- |
|
28 |
| [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
29 |
+
| [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
|
30 |
|
31 |
### Training Recipes Overview:
|
32 |
|
|
|
35 |

|
36 |
|
37 |
|
|
|
38 |
## Evaluation
|
39 |
|
40 |
SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comparable size, but is also competitive compared with recently released SoTAs, Aquila and InternVL-2.5.
|
|
|
69 |
| DocVQA_VAL | 86.23 | 85.38 | 74.31 | 87.67 | 86.06 |
|
70 |
| TextVQA_VAL | 73.48 | 79.66 | 76.27 | 76.76 | 77.21 |
|
71 |
|
72 |
+
|
73 |
Details for average performance section:
|
74 |
- OpenCompass-Avg includes public avaliable validation sets from OpenCompass: AI2D_TEST, HallusionBench, MMBench_DEV_CN_V11, MMBench_DEV_EN_V11, MME, MMMU_DEV_VAL, MMStar, MMVet, MathVista_MINI, evaluated by our team.
|
75 |
|
|
|
224 |
|
225 |
## Citation
|
226 |
```
|
227 |
+
@article{dong2025scalable,
|
228 |
+
title={Scalable vision language model training via high quality data curation},
|
229 |
+
author={Dong, Hongyuan and Kang, Zijian and Yin, Weijie and Liang, Xiao and Feng, Chao and Ran, Jiao},
|
230 |
+
journal={arXiv preprint arXiv:2501.05952},
|
231 |
+
year={2025}
|
232 |
+
}
|
233 |
+
```
|
234 |
+
```
|
235 |
@misc{
|
236 |
sailvl,
|
237 |
title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
|
|
|
240 |
month = {December},
|
241 |
year = {2024}
|
242 |
}
|
|
|
243 |
```
|
|
|
244 |
## Contributions
|
245 |
This work is conducted by Bytedance Douyin Content Team, authored by:
|
246 |
```
|
247 |
+
{Hongyuan Dong, Zijian Kang, Weijie Yin}, Xiao Liang, Chao Feng, Jiao Ran
|
248 |
|
249 |
{*} Equal Contributions.
|
250 |
```
|