BytedanceDouyinContent
/

SAIL-VL-2B

Safetensors

English

Chinese

Model card Files Files and versions Community

zijian.kang commited on Feb 19

Commit

85a99e2

1 Parent(s): 0ce0168

update readme

Browse files

Files changed (1) hide show

README.md +13 -8

README.md CHANGED Viewed

@@ -10,17 +10,15 @@ base_model:
 ![lidar_map](statics/sail.png)
-SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. The goal of SAIL-VL is to develope a high-performance vision language model that facilitates deployment on mobile devices and ensures accessibility and affordability for a broad audience. Through careful tuning of data and training recipes, SAIL-VL demonstrates that even a small VLM can benefit significantly from data scaling. Our model outperforms Qwen2-VL, InternVL2 and even recent SoTA models of comparable sizes. Stronger models are comming soon~
 In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
 ## News🚀🚀🚀
 - 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
 ](https://arxiv.org/abs/2501.05952)
 - 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
 ## Model Card
 ### Model Architecture:
@@ -28,6 +26,7 @@ In a word, SAIL-VL is a foundational VLM for vision-language applications. Welco
 | Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
 | --- | --- | --- | --- | --- | --- |
 | [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
 ### Training Recipes Overview:
@@ -36,7 +35,6 @@ Sail-VL benefits from high-quality data and carefully curated training recipes.
 ![](statics/paper_page.png)
 ## Evaluation
 SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comparable size, but is also competitive compared with recently released SoTAs, Aquila and InternVL-2.5.
@@ -71,6 +69,7 @@ The result is evaluated by our team with a VLMEvalKit variant.
 | DocVQA_VAL            | 86.23      | 85.38    | 74.31     | 87.67        | 86.06          |
 | TextVQA_VAL           | 73.48      | 79.66    | 76.27     | 76.76        | 77.21          |
 Details for average performance section:
 - OpenCompass-Avg includes public avaliable validation sets from OpenCompass: AI2D_TEST, HallusionBench, MMBench_DEV_CN_V11, MMBench_DEV_EN_V11, MME, MMMU_DEV_VAL, MMStar, MMVet, MathVista_MINI, evaluated by our team.
@@ -225,6 +224,14 @@ Our model is built upon numerous outstanding open-source projects, and we are gr
 ## Citation
 ```
 @misc{
     sailvl,
     title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
@@ -233,13 +240,11 @@ Our model is built upon numerous outstanding open-source projects, and we are gr
     month = {December},
     year = {2024}
 }
 ```
 ## Contributions
 This work is conducted by Bytedance Douyin Content Team, authored by:
 ```
-{Hongyuan Dong, Zijian Kang, Weijie Yin}, Xiao Liang, Feng Chen, Jiao Ran
 {*} Equal Contributions.
 ```

 ![lidar_map](statics/sail.png)
+SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Bytedance Douyin Content Team. The goal of SAIL-VL is to develope a high-performance vision language model that facilitates deployment on mobile devices and ensures accessibility and affordability for a broad audience. Through careful tuning of data and training recipes, SAIL-VL demonstrates that even a small VLM can benefit significantly from data scaling. Our model outperforms Qwen2-VL, InternVL2 and even recent SoTA models of comparable sizes.
 In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
 ## News🚀🚀🚀
+- 2024-2-19: 📖 We released our 8B model, check out at [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) ~
 - 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
 ](https://arxiv.org/abs/2501.05952)
 - 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
 ## Model Card
 ### Model Architecture:
 | Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
 | --- | --- | --- | --- | --- | --- |
 | [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
+| [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
 ### Training Recipes Overview:
 ![](statics/paper_page.png)
 ## Evaluation
 SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comparable size, but is also competitive compared with recently released SoTAs, Aquila and InternVL-2.5.
 | DocVQA_VAL            | 86.23      | 85.38    | 74.31     | 87.67        | 86.06          |
 | TextVQA_VAL           | 73.48      | 79.66    | 76.27     | 76.76        | 77.21          |
 Details for average performance section:
 - OpenCompass-Avg includes public avaliable validation sets from OpenCompass: AI2D_TEST, HallusionBench, MMBench_DEV_CN_V11, MMBench_DEV_EN_V11, MME, MMMU_DEV_VAL, MMStar, MMVet, MathVista_MINI, evaluated by our team.
 ## Citation
 ```
+@article{dong2025scalable,
+  title={Scalable vision language model training via high quality data curation},
+  author={Dong, Hongyuan and Kang, Zijian and Yin, Weijie and Liang, Xiao and Feng, Chao and Ran, Jiao},
+  journal={arXiv preprint arXiv:2501.05952},
+  year={2025}
+}
+```
+```
 @misc{
     sailvl,
     title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
     month = {December},
     year = {2024}
 }
 ```
 ## Contributions
 This work is conducted by Bytedance Douyin Content Team, authored by:
 ```
+{Hongyuan Dong, Zijian Kang, Weijie Yin}, Xiao Liang, Chao Feng, Jiao Ran
 {*} Equal Contributions.
 ```