File size: 6,270 Bytes
8f8a944
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# Dataset Prepare

- [Dataset Prepare](#dataset-prepare)
  - [HuggingFace datasets](#huggingface-datasets)
  - [Others](#others)
    - [Arxiv Gentitle](#arxiv-gentitle)
    - [MOSS-003-SFT](#moss-003-sft)
    - [Chinese Lawyer](#chinese-lawyer)
    - [LLaVA dataset](#llava-dataset)
      - [File structure](#file-structure)
      - [Pretrain](#pretrain)
      - [Finetune](#finetune)
    - [RefCOCO dataset](#refcoco-dataset)
      - [File structure](#file-structure-1)

## HuggingFace datasets

For datasets on HuggingFace Hub, such as [alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), you can quickly utilize them. For more details, please refer to [single_turn_conversation.md](./single_turn_conversation.md) and [multi_turn_conversation.md](./multi_turn_conversation.md).

## Others

### Arxiv Gentitle

Arxiv dataset is not released on HuggingFace Hub, but you can download it from Kaggle.

**Step 0**, download raw data from https://kaggle.com/datasets/Cornell-University/arxiv.

**Step 1**, process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} [optional arguments]`.

For example, get all `cs.AI`, `cs.CL`, `cs.CV` papers from `2020-01-01`:

```shell
xtuner preprocess arxiv ${DOWNLOADED_DATA} ${SAVE_DATA_PATH} --categories cs.AI cs.CL cs.CV --start-date 2020-01-01
```

**Step 2**, all Arixv Gentitle configs assume the dataset path to be `./data/arxiv_data.json`. You can move and rename your data, or make changes to these configs.

### MOSS-003-SFT

MOSS-003-SFT dataset can be downloaded from https://huggingface.co/datasets/fnlp/moss-003-sft-data.

**Step 0**, download data.

```shell
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/fnlp/moss-003-sft-data
```

**Step 1**, unzip.

```shell
cd moss-003-sft-data
unzip moss-003-sft-no-tools.jsonl.zip
unzip moss-003-sft-with-tools-no-text2image.zip
```

**Step 2**, all moss-003-sft configs assume the dataset path to be `./data/moss-003-sft-no-tools.jsonl` and `./data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl`. You can move and rename your data, or make changes to these configs.

### Chinese Lawyer

Chinese Lawyer dataset has two sub-dataset, and can be downloaded form https://github.com/LiuHC0428/LAW-GPT.

All lawyer configs assume the dataset path to be `./data/CrimeKgAssitantๆธ…ๆด—ๅŽ_52k.json` and `./data/่ฎญ็ปƒๆ•ฐๆฎ_ๅธฆๆณ•ๅพ‹ไพๆฎ_92k.json`. You can move and rename your data, or make changes to these configs.

### LLaVA dataset

#### File structure

```
./data/llava_data
โ”œโ”€โ”€ LLaVA-Pretrain
โ”‚ย ย  โ”œโ”€โ”€ blip_laion_cc_sbu_558k.json
โ”‚ย ย  โ”œโ”€โ”€ blip_laion_cc_sbu_558k_meta.json
โ”‚ย ย  โ””โ”€โ”€ images
โ”œโ”€โ”€ LLaVA-Instruct-150K
โ”‚ย ย  โ””โ”€โ”€ llava_v1_5_mix665k.json
โ””โ”€โ”€ llava_images
 ย ย  โ”œโ”€โ”€ coco
 ย ย  โ”‚   โ””โ”€โ”€ train2017
 ย ย  โ”œโ”€โ”€ gqa
 ย ย  โ”‚   โ””โ”€โ”€ images
 ย ย  โ”œโ”€โ”€ ocr_vqa
 ย ย  โ”‚   โ””โ”€โ”€ images
 ย ย  โ”œโ”€โ”€ textvqa
 ย ย  โ”‚   โ””โ”€โ”€ train_images
 ย ย  โ””โ”€โ”€ vg
 ย ย   ย ย  โ”œโ”€โ”€ VG_100K
 ย ย      โ””โ”€โ”€ VG_100K_2
```

#### Pretrain

LLaVA-Pretrain

```shell
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1
```

#### Finetune

1. Text data

   1. LLaVA-Instruct-150K

      ```shell
      # Make sure you have git-lfs installed (https://git-lfs.com)
      git lfs install
      git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
      ```

2. Image data

   1. COCO (coco): [train2017](http://images.cocodataset.org/zips/train2017.zip)

   2. GQA (gqa): [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)

   3. OCR-VQA (ocr_vqa): [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)

      1. โš ๏ธ Modify the name of OCR-VQA's images to keep the extension as `.jpg`!

         ```shell
         #!/bin/bash
         ocr_vqa_path="<your-directory-path>"

         find "$target_dir" -type f | while read file; do
             extension="${file##*.}"
             if [ "$extension" != "jpg" ]
             then
                 cp -- "$file" "${file%.*}.jpg"
             fi
         done
         ```

   4. TextVQA (textvqa): [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)

   5. VisualGenome (VG): [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)

### RefCOCO dataset

#### File structure

```

./data
โ”œโ”€โ”€ refcoco_annotations
โ”‚   โ”œโ”€โ”€ refcoco
โ”‚   โ”‚   โ”œโ”€โ”€ instances.json
โ”‚   โ”‚   โ”œโ”€โ”€ refs(google).p
โ”‚   โ”‚   โ””โ”€โ”€ refs(unc).p
โ”‚   โ”œโ”€โ”€ refcoco+
โ”‚   โ”‚   โ”œโ”€โ”€ instances.json
โ”‚   โ”‚   โ””โ”€โ”€ refs(unc).p
โ”‚   โ””โ”€โ”€ refcocog
โ”‚       โ”œโ”€โ”€ instances.json
โ”‚       โ”œโ”€โ”€ refs(google).p
โ”‚       โ””โ”€โ”€โ”€ refs(und).p
โ”œโ”€โ”€ coco_images
|    โ”œโ”€โ”€ *.jpg
...
```

Download the RefCOCO, RefCOCO+, RefCOCOg annotation files using below links.
Both of coco train 2017 and 2014 are valid for coco_images.

| Image source |                                        Download path                                         |
| ------------ | :------------------------------------------------------------------------------------------: |
| RefCOCO      | <a href="https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip"> annotations </a>  |
| RefCOCO+     | <a href="https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip"> annotations </a> |
| RefCOCOg     | <a href="https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip"> annotations </a> |

After downloading the annotations, unzip the files and place them in the `./data/refcoco_annotations` directory.
Then, we convert the annotations to json format using the below command. This command saves the converted json files in the `./data/llava_data/RefCOCOJson/` directory.

```shell
xtuner preprocess refcoco --ann-path $RefCOCO_ANN_PATH --image-path $COCO_IMAGE_PATH \
--save-path $SAVE_PATH # ./data/llava_data/RefCOCOJson/
```