HuggingFaceTB
/

SmolVLM2-256M-Video-Instruct

@@ -203,12 +203,67 @@ You can cite us in the following way:
   url = {https://huggingface.co/blog/smolvlm2}
 }
 ```
 ## Training Data
 SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
 In the following plots we give a general overview of the samples across modalities and the source of those samples.
 <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
 </center>
 ### Details
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">

   url = {https://huggingface.co/blog/smolvlm2}
 }
 ```
 ## Training Data
 SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
 In the following plots we give a general overview of the samples across modalities and the source of those samples.
+<!--
 <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
 </center>
 ### Details
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description"> -->
+## Data Split per modality
+| Data Type    | Percentage |
+|--------------|------------|
+| Image        | 34.4%      |
+| Text         | 20.2%      |
+| Video        | 33.0%      |
+| Multi-image  | 12.3%      |
+## Granular dataset slices per modality
+### Text Datasets
+| Dataset                                    | Percentage |
+|--------------------------------------------|------------|
+| llava-onevision/magpie_pro_ft3_80b_mt      | 6.8%       |
+| llava-onevision/magpie_pro_ft3_80b_tt      | 6.8%       |
+| llava-onevision/magpie_pro_qwen2_72b_tt    | 5.8%       |
+| llava-onevision/mathqa                     | 0.9%       |
+### Multi-image Datasets
+| Dataset                                    | Percentage |
+|--------------------------------------------|------------|
+| m4-instruct-data/m4_instruct_multiimage    | 10.4%      |
+| mammoth/multiimage-cap6                    | 1.9%       |
+### Image Datasets
+| Dataset                                    | Percentage |
+|--------------------------------------------|------------|
+| llava-onevision/other                      | 17.4%      |
+| llava-onevision/vision_flan                | 3.9%       |
+| llava-onevision/mavis_math_metagen         | 2.6%       |
+| llava-onevision/mavis_math_rule_geo        | 2.5%       |
+| llava-onevision/sharegpt4o                 | 1.7%       |
+| llava-onevision/sharegpt4v_coco            | 1.5%       |
+| llava-onevision/image_textualization       | 1.3%       |
+| llava-onevision/sharegpt4v_llava           | 0.9%       |
+| llava-onevision/mapqa                      | 0.9%       |
+| llava-onevision/qa                         | 0.8%       |
+| llava-onevision/textocr                    | 0.8%       |
+### Video Datasets
+| Dataset                                    | Percentage |
+|--------------------------------------------|------------|
+| llava-video-178k/1-2m                      | 7.3%       |
+| llava-video-178k/2-3m                      | 7.0%       |
+| other-video/combined                       | 5.7%       |
+| llava-video-178k/hound                     | 4.4%       |
+| llava-video-178k/0-30s                     | 2.4%       |
+| video-star/starb                           | 2.2%       |
+| vista-400k/combined                        | 2.2%       |
+| vript/long                                 | 1.0%       |
+| ShareGPT4Video/all                         | 0.8%       |