# Dataset Prepare - [Dataset Prepare](#dataset-prepare) - [HuggingFace datasets](#huggingface-datasets) - [Others](#others) - [Arxiv Gentitle](#arxiv-gentitle) - [MOSS-003-SFT](#moss-003-sft) - [Chinese Lawyer](#chinese-lawyer) - [LLaVA dataset](#llava-dataset) - [File structure](#file-structure) - [Pretrain](#pretrain) - [Finetune](#finetune) - [RefCOCO dataset](#refcoco-dataset) - [File structure](#file-structure-1) ## HuggingFace datasets For datasets on HuggingFace Hub, such as [alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), you can quickly utilize them. For more details, please refer to [single_turn_conversation.md](./single_turn_conversation.md) and [multi_turn_conversation.md](./multi_turn_conversation.md). ## Others ### Arxiv Gentitle Arxiv dataset is not released on HuggingFace Hub, but you can download it from Kaggle. **Step 0**, download raw data from https://kaggle.com/datasets/Cornell-University/arxiv. **Step 1**, process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} [optional arguments]`. For example, get all `cs.AI`, `cs.CL`, `cs.CV` papers from `2020-01-01`: ```shell xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} --categories cs.AI cs.CL cs.CV --start-date 2020-01-01 ``` **Step 2**, all Arixv Gentitle configs assume the dataset path to be `./data/arxiv_data.json`. You can move and rename your data, or make changes to these configs. ### MOSS-003-SFT MOSS-003-SFT dataset can be downloaded from https://huggingface.co/datasets/fnlp/moss-003-sft-data. **Step 0**, download data. ```shell # Make sure you have git-lfs installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/fnlp/moss-003-sft-data ``` **Step 1**, unzip. ```shell cd moss-003-sft-data unzip moss-003-sft-no-tools.jsonl.zip unzip moss-003-sft-with-tools-no-text2image.zip ``` **Step 2**, all moss-003-sft configs assume the dataset path to be `./data/moss-003-sft-no-tools.jsonl` and `./data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl`. You can move and rename your data, or make changes to these configs. ### Chinese Lawyer Chinese Lawyer dataset has two sub-dataset, and can be downloaded form https://github.com/LiuHC0428/LAW-GPT. All lawyer configs assume the dataset path to be `./data/CrimeKgAssitant清洗后_52k.json` and `./data/训练数据_带法律依据_92k.json`. You can move and rename your data, or make changes to these configs. ### LLaVA dataset #### File structure ``` ./data/llava_data ├── LLaVA-Pretrain │   ├── blip_laion_cc_sbu_558k.json │   ├── blip_laion_cc_sbu_558k_meta.json │   └── images ├── LLaVA-Instruct-150K │   └── llava_v1_5_mix665k.json └── llava_images    ├── coco    │ └── train2017    ├── gqa    │ └── images    ├── ocr_vqa    │ └── images    ├── textvqa    │ └── train_images    └── vg       ├── VG_100K    └── VG_100K_2 ``` #### Pretrain LLaVA-Pretrain ```shell # Make sure you have git-lfs installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1 ``` #### Finetune 1. Text data 1. LLaVA-Instruct-150K ```shell # Make sure you have git-lfs installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1 ``` 2. Image data 1. COCO (coco): [train2017](http://images.cocodataset.org/zips/train2017.zip) 2. GQA (gqa): [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip) 3. OCR-VQA (ocr_vqa): [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing) 1. ⚠️ Modify the name of OCR-VQA's images to keep the extension as `.jpg`! ```shell #!/bin/bash ocr_vqa_path="" find "$target_dir" -type f | while read file; do extension="${file##*.}" if [ "$extension" != "jpg" ] then cp -- "$file" "${file%.*}.jpg" fi done ``` 4. TextVQA (textvqa): [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) 5. VisualGenome (VG): [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip) ### RefCOCO dataset #### File structure ``` ./data ├── refcoco_annotations │ ├── refcoco │ │ ├── instances.json │ │ ├── refs(google).p │ │ └── refs(unc).p │ ├── refcoco+ │ │ ├── instances.json │ │ └── refs(unc).p │ └── refcocog │ ├── instances.json │ ├── refs(google).p │ └─── refs(und).p ├── coco_images | ├── *.jpg ... ``` Download the RefCOCO, RefCOCO+, RefCOCOg annotation files using below links. Both of coco train 2017 and 2014 are valid for coco_images. | Image source | Download path | | ------------ | :------------------------------------------------------------------------------------------: | | RefCOCO | annotations | | RefCOCO+ | annotations | | RefCOCOg | annotations | After downloading the annotations, unzip the files and place them in the `./data/refcoco_annotations` directory. Then, we convert the annotations to json format using the below command. This command saves the converted json files in the `./data/llava_data/RefCOCOJson/` directory. ```shell xtuner preprocess refcoco --ann-path $RefCOCO_ANN_PATH --image-path $COCO_IMAGE_PATH \ --save-path $SAVE_PATH # ./data/llava_data/RefCOCOJson/ ```