BytedanceDouyinContent
/

SAIL-VL-8B

Safetensors

English

Chinese

Model card Files Files and versions Community

zijian.kang commited on Feb 18

Commit

979314b

1 Parent(s): f59e1ec

fix readme

Browse files

Files changed (1) hide show

README.md +25 -17

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Byted
 In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
 ## News🚀🚀🚀
-- 2024-2-19: 📖 We released our 8B model, checkout at [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) ~
 - 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
 ](https://arxiv.org/abs/2501.05952)
 - 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
@@ -29,7 +29,7 @@ In a word, SAIL-VL is a foundational VLM for vision-language applications. Welco
 | Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
 | --- | --- | --- | --- | --- | --- |
 | [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
-| [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
 ### Training Recipes Overview:
@@ -40,7 +40,8 @@ Sail-VL benefits from high-quality data and carefully curated training recipes.
 ## Evaluation
-SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comparable size, but is also competitive compared with recently released SoTAs.
 ### Detail Evaluations:
@@ -53,30 +54,30 @@ SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comp
 | **Hallucination** | *68.7* | *67.5* | *69.7* | *65.3* |
 | **General VQA** |  |  |  |  |
 | MMStar | 64.2 | 58.3 | 65.3 | 57.7 |
-| MMBenchDEV | 79.5 | 79.5 | 83.3 | 78.1 |
-| MMUVAL | 48.2 | 50.9 | 52.8 | 47.6 |
 | MME | 2244 | 2321 | 2321 | 2149 |
-| SEEDBenchMG | 75.5 | 75.3 | 76.9 | 76.8 |
 | RealWorldQA | 71.9 | 69.7 | 70.2 | 70.2 |
-| MMVQATest | 58.3 | 62.6 | 66.8 | 60.3 |
 | **OCR VQA** |  |  |  |  |
-| AI2DTEST | 83.7 | 82.9 | 84.1 | 82.0 |
-| DocVQAVal | 92.2 | 93.7 | 92.1 | 92.3 |
-| InfoVQAVal | 75.2 | 75.9 | 76.2 | 72.5 |
-| ChartQATest | 84.6 | 81.6 | 77.6 | 84.6 |
-| TextVQAVal | 77.7 | 83.8 | 79.2 | 83.3 |
-| OCRVQATest | 61.4 | 56.2 | 36.7 | 54.5 |
 | OCRBench | 835 | 833 | 880 | 834 |
 | **Math&Knowledge** |  |  |  |  |
 | MathVistaMini | 68.4 | 57.3 | 68.5 | 61.8 |
-| ScienceQAVal | 98.2 | 84.6 | 97.9 | 96.2 |
 | **Hallucination** |  |  |  |  |
 | HallucinationBench | 52.2 | 48.5 | 50.3 | 41.2 |
 | POPE | 85.2 | 86.5 | 89.1 | 89.4 |
 ## Demo Cases
-We visualize some of examples from LLaVA-Bench to show the capabilities of our model. Our model is able to give detail and complex answer for a variety of questions.
 | Image | Question | Answer |
@@ -226,12 +227,19 @@ Our model is built upon numerous outstanding open-source projects, and we are gr
 @misc{
     sailvl,
     title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
-    url = {https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B/},
     author = {Bytedance Douyin Content Team},
     month = {December},
     year = {2024}
 }
 ```
 ## Contributions

 In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
 ## News🚀🚀🚀
+- 2024-2-19: 📖 We released our 8B model, check out at [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) ~
 - 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
 ](https://arxiv.org/abs/2501.05952)
 - 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
 | Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
 | --- | --- | --- | --- | --- | --- |
 | [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
+| [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
 ### Training Recipes Overview:
 ## Evaluation
+SAIL-VL not only outperforms the Qwen2-VL and
+ series of models of comparable size, but is also competitive compared with recently released SoTAs.
 ### Detail Evaluations:
 | **Hallucination** | *68.7* | *67.5* | *69.7* | *65.3* |
 | **General VQA** |  |  |  |  |
 | MMStar | 64.2 | 58.3 | 65.3 | 57.7 |
+| MMBench_DEV | 79.5 | 79.5 | 83.3 | 78.1 |
+| MMMU_VAL | 48.2 | 50.9 | 52.8 | 47.6 |
 | MME | 2244 | 2321 | 2321 | 2149 |
+| SEEDBench_IMG | 75.5 | 75.3 | 76.9 | 76.8 |
 | RealWorldQA | 71.9 | 69.7 | 70.2 | 70.2 |
+| MMVET | 58.3 | 62.6 | 66.8 | 60.3 |
 | **OCR VQA** |  |  |  |  |
+| AI2D_TEST | 83.7 | 82.9 | 84.1 | 82.0 |
+| DocVQA_Val | 92.2 | 93.7 | 92.1 | 92.3 |
+| InfoVQA_Val | 75.2 | 75.9 | 76.2 | 72.5 |
+| ChartQA_Test | 84.6 | 81.6 | 77.6 | 84.6 |
+| TextVQA_Val | 77.7 | 83.8 | 79.2 | 83.3 |
+| OCRVQA_Test | 61.4 | 56.2 | 36.7 | 54.5 |
 | OCRBench | 835 | 833 | 880 | 834 |
 | **Math&Knowledge** |  |  |  |  |
 | MathVistaMini | 68.4 | 57.3 | 68.5 | 61.8 |
+| ScienceQA_Val | 98.2 | 84.6 | 97.9 | 96.2 |
 | **Hallucination** |  |  |  |  |
 | HallucinationBench | 52.2 | 48.5 | 50.3 | 41.2 |
 | POPE | 85.2 | 86.5 | 89.1 | 89.4 |
 ## Demo Cases
+We visualize some examples to show the capabilities of our model. Our model is able to give detail and complex answer for a variety of questions.
 | Image | Question | Answer |
 @misc{
     sailvl,
     title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
+    url = {https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B/},
     author = {Bytedance Douyin Content Team},
     month = {December},
     year = {2024}
 }
+```
+```
+@article{dong2025scalable,
+  title={Scalable vision language model training via high quality data curation},
+  author={Dong, Hongyuan and Kang, Zijian and Yin, Weijie and Liang, Xiao and Feng, Chao and Ran, Jiao},
+  journal={arXiv preprint arXiv:2501.05952},
+  year={2025}
+}
 ```
 ## Contributions