Safetensors
English
Chinese
zijian.kang commited on
Commit
85a99e2
·
1 Parent(s): 0ce0168

update readme

Browse files
Files changed (1) hide show
  1. README.md +13 -8
README.md CHANGED
@@ -10,17 +10,15 @@ base_model:
10
 
11
  ![lidar_map](statics/sail.png)
12
 
13
- SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. The goal of SAIL-VL is to develope a high-performance vision language model that facilitates deployment on mobile devices and ensures accessibility and affordability for a broad audience. Through careful tuning of data and training recipes, SAIL-VL demonstrates that even a small VLM can benefit significantly from data scaling. Our model outperforms Qwen2-VL, InternVL2 and even recent SoTA models of comparable sizes. Stronger models are comming soon~
14
-
15
 
16
  In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
17
 
18
  ## News🚀🚀🚀
 
19
  - 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
20
  ](https://arxiv.org/abs/2501.05952)
21
  - 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
22
-
23
-
24
  ## Model Card
25
 
26
  ### Model Architecture:
@@ -28,6 +26,7 @@ In a word, SAIL-VL is a foundational VLM for vision-language applications. Welco
28
  | Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
29
  | --- | --- | --- | --- | --- | --- |
30
  | [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
 
31
 
32
  ### Training Recipes Overview:
33
 
@@ -36,7 +35,6 @@ Sail-VL benefits from high-quality data and carefully curated training recipes.
36
  ![](statics/paper_page.png)
37
 
38
 
39
-
40
  ## Evaluation
41
 
42
  SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comparable size, but is also competitive compared with recently released SoTAs, Aquila and InternVL-2.5.
@@ -71,6 +69,7 @@ The result is evaluated by our team with a VLMEvalKit variant.
71
  | DocVQA_VAL | 86.23 | 85.38 | 74.31 | 87.67 | 86.06 |
72
  | TextVQA_VAL | 73.48 | 79.66 | 76.27 | 76.76 | 77.21 |
73
 
 
74
  Details for average performance section:
75
  - OpenCompass-Avg includes public avaliable validation sets from OpenCompass: AI2D_TEST, HallusionBench, MMBench_DEV_CN_V11, MMBench_DEV_EN_V11, MME, MMMU_DEV_VAL, MMStar, MMVet, MathVista_MINI, evaluated by our team.
76
 
@@ -225,6 +224,14 @@ Our model is built upon numerous outstanding open-source projects, and we are gr
225
 
226
  ## Citation
227
  ```
 
 
 
 
 
 
 
 
228
  @misc{
229
  sailvl,
230
  title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
@@ -233,13 +240,11 @@ Our model is built upon numerous outstanding open-source projects, and we are gr
233
  month = {December},
234
  year = {2024}
235
  }
236
-
237
  ```
238
-
239
  ## Contributions
240
  This work is conducted by Bytedance Douyin Content Team, authored by:
241
  ```
242
- {Hongyuan Dong, Zijian Kang, Weijie Yin}, Xiao Liang, Feng Chen, Jiao Ran
243
 
244
  {*} Equal Contributions.
245
  ```
 
10
 
11
  ![lidar_map](statics/sail.png)
12
 
13
+ SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. The goal of SAIL-VL is to develope a high-performance vision language model that facilitates deployment on mobile devices and ensures accessibility and affordability for a broad audience. Through careful tuning of data and training recipes, SAIL-VL demonstrates that even a small VLM can benefit significantly from data scaling. Our model outperforms Qwen2-VL, InternVL2 and even recent SoTA models of comparable sizes.
 
14
 
15
  In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
16
 
17
  ## News🚀🚀🚀
18
+ - 2024-2-19: 📖 We released our 8B model, check out at [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) ~
19
  - 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
20
  ](https://arxiv.org/abs/2501.05952)
21
  - 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
 
 
22
  ## Model Card
23
 
24
  ### Model Architecture:
 
26
  | Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
27
  | --- | --- | --- | --- | --- | --- |
28
  | [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
29
+ | [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
30
 
31
  ### Training Recipes Overview:
32
 
 
35
  ![](statics/paper_page.png)
36
 
37
 
 
38
  ## Evaluation
39
 
40
  SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comparable size, but is also competitive compared with recently released SoTAs, Aquila and InternVL-2.5.
 
69
  | DocVQA_VAL | 86.23 | 85.38 | 74.31 | 87.67 | 86.06 |
70
  | TextVQA_VAL | 73.48 | 79.66 | 76.27 | 76.76 | 77.21 |
71
 
72
+
73
  Details for average performance section:
74
  - OpenCompass-Avg includes public avaliable validation sets from OpenCompass: AI2D_TEST, HallusionBench, MMBench_DEV_CN_V11, MMBench_DEV_EN_V11, MME, MMMU_DEV_VAL, MMStar, MMVet, MathVista_MINI, evaluated by our team.
75
 
 
224
 
225
  ## Citation
226
  ```
227
+ @article{dong2025scalable,
228
+ title={Scalable vision language model training via high quality data curation},
229
+ author={Dong, Hongyuan and Kang, Zijian and Yin, Weijie and Liang, Xiao and Feng, Chao and Ran, Jiao},
230
+ journal={arXiv preprint arXiv:2501.05952},
231
+ year={2025}
232
+ }
233
+ ```
234
+ ```
235
  @misc{
236
  sailvl,
237
  title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
 
240
  month = {December},
241
  year = {2024}
242
  }
 
243
  ```
 
244
  ## Contributions
245
  This work is conducted by Bytedance Douyin Content Team, authored by:
246
  ```
247
+ {Hongyuan Dong, Zijian Kang, Weijie Yin}, Xiao Liang, Chao Feng, Jiao Ran
248
 
249
  {*} Equal Contributions.
250
  ```