mfarre HF staff commited on
Commit
ac099c8
·
verified ·
1 Parent(s): 459eb79

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -2
README.md CHANGED
@@ -203,12 +203,67 @@ You can cite us in the following way:
203
  url = {https://huggingface.co/blog/smolvlm2}
204
  }
205
  ```
 
206
  ## Training Data
207
  SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
208
  In the following plots we give a general overview of the samples across modalities and the source of those samples.
209
-
210
  <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
211
  </center>
212
 
213
  ### Details
214
- <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
  url = {https://huggingface.co/blog/smolvlm2}
204
  }
205
  ```
206
+
207
  ## Training Data
208
  SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
209
  In the following plots we give a general overview of the samples across modalities and the source of those samples.
210
+ <!--
211
  <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
212
  </center>
213
 
214
  ### Details
215
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description"> -->
216
+
217
+ ## Data Split per modality
218
+
219
+ | Data Type | Percentage |
220
+ |--------------|------------|
221
+ | Image | 34.4% |
222
+ | Text | 20.2% |
223
+ | Video | 33.0% |
224
+ | Multi-image | 12.3% |
225
+
226
+
227
+ ## Granular dataset slices per modality
228
+
229
+ ### Text Datasets
230
+ | Dataset | Percentage |
231
+ |--------------------------------------------|------------|
232
+ | llava-onevision/magpie_pro_ft3_80b_mt | 6.8% |
233
+ | llava-onevision/magpie_pro_ft3_80b_tt | 6.8% |
234
+ | llava-onevision/magpie_pro_qwen2_72b_tt | 5.8% |
235
+ | llava-onevision/mathqa | 0.9% |
236
+
237
+ ### Multi-image Datasets
238
+ | Dataset | Percentage |
239
+ |--------------------------------------------|------------|
240
+ | m4-instruct-data/m4_instruct_multiimage | 10.4% |
241
+ | mammoth/multiimage-cap6 | 1.9% |
242
+
243
+ ### Image Datasets
244
+ | Dataset | Percentage |
245
+ |--------------------------------------------|------------|
246
+ | llava-onevision/other | 17.4% |
247
+ | llava-onevision/vision_flan | 3.9% |
248
+ | llava-onevision/mavis_math_metagen | 2.6% |
249
+ | llava-onevision/mavis_math_rule_geo | 2.5% |
250
+ | llava-onevision/sharegpt4o | 1.7% |
251
+ | llava-onevision/sharegpt4v_coco | 1.5% |
252
+ | llava-onevision/image_textualization | 1.3% |
253
+ | llava-onevision/sharegpt4v_llava | 0.9% |
254
+ | llava-onevision/mapqa | 0.9% |
255
+ | llava-onevision/qa | 0.8% |
256
+ | llava-onevision/textocr | 0.8% |
257
+
258
+ ### Video Datasets
259
+ | Dataset | Percentage |
260
+ |--------------------------------------------|------------|
261
+ | llava-video-178k/1-2m | 7.3% |
262
+ | llava-video-178k/2-3m | 7.0% |
263
+ | other-video/combined | 5.7% |
264
+ | llava-video-178k/hound | 4.4% |
265
+ | llava-video-178k/0-30s | 2.4% |
266
+ | video-star/starb | 2.2% |
267
+ | vista-400k/combined | 2.2% |
268
+ | vript/long | 1.0% |
269
+ | ShareGPT4Video/all | 0.8% |