Spaces:
Runtime error
Runtime error
File size: 6,270 Bytes
8f8a944 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
# Dataset Prepare
- [Dataset Prepare](#dataset-prepare)
- [HuggingFace datasets](#huggingface-datasets)
- [Others](#others)
- [Arxiv Gentitle](#arxiv-gentitle)
- [MOSS-003-SFT](#moss-003-sft)
- [Chinese Lawyer](#chinese-lawyer)
- [LLaVA dataset](#llava-dataset)
- [File structure](#file-structure)
- [Pretrain](#pretrain)
- [Finetune](#finetune)
- [RefCOCO dataset](#refcoco-dataset)
- [File structure](#file-structure-1)
## HuggingFace datasets
For datasets on HuggingFace Hub, such as [alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), you can quickly utilize them. For more details, please refer to [single_turn_conversation.md](./single_turn_conversation.md) and [multi_turn_conversation.md](./multi_turn_conversation.md).
## Others
### Arxiv Gentitle
Arxiv dataset is not released on HuggingFace Hub, but you can download it from Kaggle.
**Step 0**, download raw data from https://kaggle.com/datasets/Cornell-University/arxiv.
**Step 1**, process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} [optional arguments]`.
For example, get all `cs.AI`, `cs.CL`, `cs.CV` papers from `2020-01-01`:
```shell
xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} --categories cs.AI cs.CL cs.CV --start-date 2020-01-01
```
**Step 2**, all Arixv Gentitle configs assume the dataset path to be `./data/arxiv_data.json`. You can move and rename your data, or make changes to these configs.
### MOSS-003-SFT
MOSS-003-SFT dataset can be downloaded from https://huggingface.co/datasets/fnlp/moss-003-sft-data.
**Step 0**, download data.
```shell
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/fnlp/moss-003-sft-data
```
**Step 1**, unzip.
```shell
cd moss-003-sft-data
unzip moss-003-sft-no-tools.jsonl.zip
unzip moss-003-sft-with-tools-no-text2image.zip
```
**Step 2**, all moss-003-sft configs assume the dataset path to be `./data/moss-003-sft-no-tools.jsonl` and `./data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl`. You can move and rename your data, or make changes to these configs.
### Chinese Lawyer
Chinese Lawyer dataset has two sub-dataset, and can be downloaded form https://github.com/LiuHC0428/LAW-GPT.
All lawyer configs assume the dataset path to be `./data/CrimeKgAssitantๆธ
ๆดๅ_52k.json` and `./data/่ฎญ็ปๆฐๆฎ_ๅธฆๆณๅพไพๆฎ_92k.json`. You can move and rename your data, or make changes to these configs.
### LLaVA dataset
#### File structure
```
./data/llava_data
โโโ LLaVA-Pretrain
โย ย โโโ blip_laion_cc_sbu_558k.json
โย ย โโโ blip_laion_cc_sbu_558k_meta.json
โย ย โโโ images
โโโ LLaVA-Instruct-150K
โย ย โโโ llava_v1_5_mix665k.json
โโโ llava_images
ย ย โโโ coco
ย ย โ โโโ train2017
ย ย โโโ gqa
ย ย โ โโโ images
ย ย โโโ ocr_vqa
ย ย โ โโโ images
ย ย โโโ textvqa
ย ย โ โโโ train_images
ย ย โโโ vg
ย ย ย ย โโโ VG_100K
ย ย โโโ VG_100K_2
```
#### Pretrain
LLaVA-Pretrain
```shell
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1
```
#### Finetune
1. Text data
1. LLaVA-Instruct-150K
```shell
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
```
2. Image data
1. COCO (coco): [train2017](http://images.cocodataset.org/zips/train2017.zip)
2. GQA (gqa): [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
3. OCR-VQA (ocr_vqa): [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)
1. โ ๏ธ Modify the name of OCR-VQA's images to keep the extension as `.jpg`!
```shell
#!/bin/bash
ocr_vqa_path="<your-directory-path>"
find "$target_dir" -type f | while read file; do
extension="${file##*.}"
if [ "$extension" != "jpg" ]
then
cp -- "$file" "${file%.*}.jpg"
fi
done
```
4. TextVQA (textvqa): [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
5. VisualGenome (VG): [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
### RefCOCO dataset
#### File structure
```
./data
โโโ refcoco_annotations
โ โโโ refcoco
โ โ โโโ instances.json
โ โ โโโ refs(google).p
โ โ โโโ refs(unc).p
โ โโโ refcoco+
โ โ โโโ instances.json
โ โ โโโ refs(unc).p
โ โโโ refcocog
โ โโโ instances.json
โ โโโ refs(google).p
โ โโโโ refs(und).p
โโโ coco_images
| โโโ *.jpg
...
```
Download the RefCOCO, RefCOCO+, RefCOCOg annotation files using below links.
Both of coco train 2017 and 2014 are valid for coco_images.
| Image source | Download path |
| ------------ | :------------------------------------------------------------------------------------------: |
| RefCOCO | <a href="https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip"> annotations </a> |
| RefCOCO+ | <a href="https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip"> annotations </a> |
| RefCOCOg | <a href="https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip"> annotations </a> |
After downloading the annotations, unzip the files and place them in the `./data/refcoco_annotations` directory.
Then, we convert the annotations to json format using the below command. This command saves the converted json files in the `./data/llava_data/RefCOCOJson/` directory.
```shell
xtuner preprocess refcoco --ann-path $RefCOCO_ANN_PATH --image-path $COCO_IMAGE_PATH \
--save-path $SAVE_PATH # ./data/llava_data/RefCOCOJson/
```
|