Safetensors
English
Chinese
zijian.kang commited on
Commit
979314b
·
1 Parent(s): f59e1ec

fix readme

Browse files
Files changed (1) hide show
  1. README.md +25 -17
README.md CHANGED
@@ -16,7 +16,7 @@ SAIL-VL is a state-of-the-art vision-language model (VLM) developed by the Byted
16
  In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
17
 
18
  ## News🚀🚀🚀
19
- - 2024-2-19: 📖 We released our 8B model, checkout at [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) ~
20
  - 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
21
  ](https://arxiv.org/abs/2501.05952)
22
  - 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
@@ -29,7 +29,7 @@ In a word, SAIL-VL is a foundational VLM for vision-language applications. Welco
29
  | Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
30
  | --- | --- | --- | --- | --- | --- |
31
  | [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
32
- | [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
33
 
34
  ### Training Recipes Overview:
35
 
@@ -40,7 +40,8 @@ Sail-VL benefits from high-quality data and carefully curated training recipes.
40
 
41
  ## Evaluation
42
 
43
- SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comparable size, but is also competitive compared with recently released SoTAs.
 
44
 
45
  ### Detail Evaluations:
46
 
@@ -53,30 +54,30 @@ SAIL-VL not only outperforms the Qwen2-VL and InternVL2 series of models of comp
53
  | **Hallucination** | *68.7* | *67.5* | *69.7* | *65.3* |
54
  | **General VQA** | | | | |
55
  | MMStar | 64.2 | 58.3 | 65.3 | 57.7 |
56
- | MMBenchDEV | 79.5 | 79.5 | 83.3 | 78.1 |
57
- | MMUVAL | 48.2 | 50.9 | 52.8 | 47.6 |
58
  | MME | 2244 | 2321 | 2321 | 2149 |
59
- | SEEDBenchMG | 75.5 | 75.3 | 76.9 | 76.8 |
60
  | RealWorldQA | 71.9 | 69.7 | 70.2 | 70.2 |
61
- | MMVQATest | 58.3 | 62.6 | 66.8 | 60.3 |
62
  | **OCR VQA** | | | | |
63
- | AI2DTEST | 83.7 | 82.9 | 84.1 | 82.0 |
64
- | DocVQAVal | 92.2 | 93.7 | 92.1 | 92.3 |
65
- | InfoVQAVal | 75.2 | 75.9 | 76.2 | 72.5 |
66
- | ChartQATest | 84.6 | 81.6 | 77.6 | 84.6 |
67
- | TextVQAVal | 77.7 | 83.8 | 79.2 | 83.3 |
68
- | OCRVQATest | 61.4 | 56.2 | 36.7 | 54.5 |
69
  | OCRBench | 835 | 833 | 880 | 834 |
70
  | **Math&Knowledge** | | | | |
71
  | MathVistaMini | 68.4 | 57.3 | 68.5 | 61.8 |
72
- | ScienceQAVal | 98.2 | 84.6 | 97.9 | 96.2 |
73
  | **Hallucination** | | | | |
74
  | HallucinationBench | 52.2 | 48.5 | 50.3 | 41.2 |
75
  | POPE | 85.2 | 86.5 | 89.1 | 89.4 |
76
 
77
 
78
  ## Demo Cases
79
- We visualize some of examples from LLaVA-Bench to show the capabilities of our model. Our model is able to give detail and complex answer for a variety of questions.
80
 
81
 
82
  | Image | Question | Answer |
@@ -226,12 +227,19 @@ Our model is built upon numerous outstanding open-source projects, and we are gr
226
  @misc{
227
  sailvl,
228
  title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
229
- url = {https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B/},
230
  author = {Bytedance Douyin Content Team},
231
  month = {December},
232
  year = {2024}
233
  }
234
-
 
 
 
 
 
 
 
235
  ```
236
 
237
  ## Contributions
 
16
  In a word, SAIL-VL is a foundational VLM for vision-language applications. Welcome to explore its capabilities and feel free to contact us for any questions or opportunities.
17
 
18
  ## News🚀🚀🚀
19
+ - 2024-2-19: 📖 We released our 8B model, check out at [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) ~
20
  - 2024-1-10: 📖 We released our paper on Arxiv: [Scalable Vision Language Model Training via High Quality Data Curation
21
  ](https://arxiv.org/abs/2501.05952)
22
  - 2024-12-25: 🚀 We ranked the 1st in [OpenCompass Multi-modal Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME) among models of 2B parameters.
 
29
  | Architecture | ViT | LLM | Adapter | Token Merge | Resolution |
30
  | --- | --- | --- | --- | --- | --- |
31
  | [🤗SAIL-VL-2B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
32
+ | [🤗SAIL-VL-8B](https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B) | [🤗InternViT-300M](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [🤗Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | 2-layer MLP | 2x2 | 448x448xN |
33
 
34
  ### Training Recipes Overview:
35
 
 
40
 
41
  ## Evaluation
42
 
43
+ SAIL-VL not only outperforms the Qwen2-VL and
44
+ series of models of comparable size, but is also competitive compared with recently released SoTAs.
45
 
46
  ### Detail Evaluations:
47
 
 
54
  | **Hallucination** | *68.7* | *67.5* | *69.7* | *65.3* |
55
  | **General VQA** | | | | |
56
  | MMStar | 64.2 | 58.3 | 65.3 | 57.7 |
57
+ | MMBench_DEV | 79.5 | 79.5 | 83.3 | 78.1 |
58
+ | MMMU_VAL | 48.2 | 50.9 | 52.8 | 47.6 |
59
  | MME | 2244 | 2321 | 2321 | 2149 |
60
+ | SEEDBench_IMG | 75.5 | 75.3 | 76.9 | 76.8 |
61
  | RealWorldQA | 71.9 | 69.7 | 70.2 | 70.2 |
62
+ | MMVET | 58.3 | 62.6 | 66.8 | 60.3 |
63
  | **OCR VQA** | | | | |
64
+ | AI2D_TEST | 83.7 | 82.9 | 84.1 | 82.0 |
65
+ | DocVQA_Val | 92.2 | 93.7 | 92.1 | 92.3 |
66
+ | InfoVQA_Val | 75.2 | 75.9 | 76.2 | 72.5 |
67
+ | ChartQA_Test | 84.6 | 81.6 | 77.6 | 84.6 |
68
+ | TextVQA_Val | 77.7 | 83.8 | 79.2 | 83.3 |
69
+ | OCRVQA_Test | 61.4 | 56.2 | 36.7 | 54.5 |
70
  | OCRBench | 835 | 833 | 880 | 834 |
71
  | **Math&Knowledge** | | | | |
72
  | MathVistaMini | 68.4 | 57.3 | 68.5 | 61.8 |
73
+ | ScienceQA_Val | 98.2 | 84.6 | 97.9 | 96.2 |
74
  | **Hallucination** | | | | |
75
  | HallucinationBench | 52.2 | 48.5 | 50.3 | 41.2 |
76
  | POPE | 85.2 | 86.5 | 89.1 | 89.4 |
77
 
78
 
79
  ## Demo Cases
80
+ We visualize some examples to show the capabilities of our model. Our model is able to give detail and complex answer for a variety of questions.
81
 
82
 
83
  | Image | Question | Answer |
 
227
  @misc{
228
  sailvl,
229
  title = {SAIL-VL: Scalable Vision Language Model Training with High Quality Data Curation},
230
+ url = {https://huggingface.co/BytedanceDouyinContent/SAIL-VL-8B/},
231
  author = {Bytedance Douyin Content Team},
232
  month = {December},
233
  year = {2024}
234
  }
235
+ ```
236
+ ```
237
+ @article{dong2025scalable,
238
+ title={Scalable vision language model training via high quality data curation},
239
+ author={Dong, Hongyuan and Kang, Zijian and Yin, Weijie and Liang, Xiao and Feng, Chao and Ran, Jiao},
240
+ journal={arXiv preprint arXiv:2501.05952},
241
+ year={2025}
242
+ }
243
  ```
244
 
245
  ## Contributions