Spaces:

Vision-CAIR
/

MiniGPT-v2

Runtime error

App Files Files Community

Vision-CAIR commited on Oct 15, 2023

Commit

db67f95

1 Parent(s): 51fc377

Delete dataset

Browse files

Files changed (6) hide show

dataset/README_1_STAGE.md +0 -96
dataset/README_2_STAGE.md +0 -19
dataset/convert_cc_sbu.py +0 -20
dataset/convert_laion.py +0 -20
dataset/download_cc_sbu.sh +0 -6
dataset/download_laion.sh +0 -6

dataset/README_1_STAGE.md DELETED Viewed

@@ -1,96 +0,0 @@
-## Download the filtered Conceptual Captions, SBU, LAION datasets
-### Pre-training datasets download:
-We use the filtered synthetic captions prepared by BLIP. For more details about the dataset, please refer to [BLIP](https://github.com/salesforce/BLIP).
-It requires ~2.3T to store LAION and CC3M+CC12M+SBU datasets
-Image source | Filtered synthetic caption by ViT-L
---- | :---:
-CC3M+CC12M+SBU | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json">Download</a>
-LAION115M |  <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/laion_synthetic_filtered_large.json">Download</a>
-This will download two json files
-```
-ccs_synthetic_filtered_large.json
-laion_synthetic_filtered_large.json
-```
-## prepare the data step-by-step
-### setup the dataset folder and move the annotation file to the data storage folder
-```
-export MINIGPT4_DATASET=/YOUR/PATH/FOR/LARGE/DATASET/
-mkdir ${MINIGPT4_DATASET}/cc_sbu
-mkdir ${MINIGPT4_DATASET}/laion
-mv ccs_synthetic_filtered_large.json ${MINIGPT4_DATASET}/cc_sbu
-mv laion_synthetic_filtered_large.json ${MINIGPT4_DATASET}/laion
-```
-### Convert the scripts to data storate folder
-```
-cp convert_cc_sbu.py ${MINIGPT4_DATASET}/cc_sbu
-cp download_cc_sbu.sh ${MINIGPT4_DATASET}/cc_sbu
-cp convert_laion.py ${MINIGPT4_DATASET}/laion
-cp download_laion.sh ${MINIGPT4_DATASET}/laion
-```
-### Convert the laion and cc_sbu annotation file format to be img2dataset format
-```
-cd ${MINIGPT4_DATASET}/cc_sbu
-python convert_cc_sbu.py
-cd ${MINIGPT4_DATASET}/laion
-python convert_laion.py
-```
-### Download the datasets with img2dataset
-```
-cd ${MINIGPT4_DATASET}/cc_sbu
-sh download_cc_sbu.sh
-cd ${MINIGPT4_DATASET}/laion
-sh download_laion.sh
-```
-The final dataset structure
-```
-.
-├── ${MINIGPT4_DATASET}
-│   ├── cc_sbu
-│       ├── convert_cc_sbu.py
-│       ├── download_cc_sbu.sh
-│       ├── ccs_synthetic_filtered_large.json
-│       ├── ccs_synthetic_filtered_large.tsv
-│       └── cc_sbu_dataset
-│           ├── 00000.tar
-│           ├── 00000.parquet
-│           ...
-│   ├── laion
-│       ├── convert_laion.py
-│       ├── download_laion.sh
-│       ├── laion_synthetic_filtered_large.json
-│       ├── laion_synthetic_filtered_large.tsv
-│       └── laion_dataset
-│           ├── 00000.tar
-│           ├── 00000.parquet
-│           ...
-...
-```
-## Set up the dataset configuration files
-Then, set up the LAION dataset loading path in
-[here](../minigpt4/configs/datasets/laion/defaults.yaml#L5) at Line 5 as
-${MINIGPT4_DATASET}/laion/laion_dataset/{00000..10488}.tar
-and the Conceptual Captoin and SBU datasets loading path in
-[here](../minigpt4/configs/datasets/cc_sbu/defaults.yaml#L5) at Line 5 as
-${MINIGPT4_DATASET}/cc_sbu/cc_sbu_dataset/{00000..01255}.tar

dataset/README_2_STAGE.md DELETED Viewed

@@ -1,19 +0,0 @@
-## Second Stage Data Preparation
-Our second stage dataset can be downloaded from
-[here](https://drive.google.com/file/d/1nJXhoEcy3KTExr17I7BXqY5Y9Lx_-n-9/view?usp=share_link)
-After extraction, you will get a data follder with the following structure:
-```
-cc_sbu_align
-├── filter_cap.json
-└── image
-    ├── 2.jpg
-    ├── 3.jpg
-    ...
-```
-Put the folder to any path you want.
-Then, set up the dataset path in the dataset config file
-[here](../minigpt4/configs/datasets/cc_sbu/align.yaml#L5) at Line 5.

dataset/convert_cc_sbu.py DELETED Viewed

@@ -1,20 +0,0 @@
-import json
-import csv
-# specify input and output file paths
-input_file = 'ccs_synthetic_filtered_large.json'
-output_file = 'ccs_synthetic_filtered_large.tsv'
-# load JSON data from input file
-with open(input_file, 'r') as f:
-    data = json.load(f)
-# extract header and data from JSON
-header = data[0].keys()
-rows = [x.values() for x in data]
-# write data to TSV file
-with open(output_file, 'w') as f:
-    writer = csv.writer(f, delimiter='\t')
-    writer.writerow(header)
-    writer.writerows(rows)

dataset/convert_laion.py DELETED Viewed

@@ -1,20 +0,0 @@
-import json
-import csv
-# specify input and output file paths
-input_file = 'laion_synthetic_filtered_large.json'
-output_file = 'laion_synthetic_filtered_large.tsv'
-# load JSON data from input file
-with open(input_file, 'r') as f:
-    data = json.load(f)
-# extract header and data from JSON
-header = data[0].keys()
-rows = [x.values() for x in data]
-# write data to TSV file
-with open(output_file, 'w') as f:
-    writer = csv.writer(f, delimiter='\t')
-    writer.writerow(header)
-    writer.writerows(rows)

dataset/download_cc_sbu.sh DELETED Viewed

@@ -1,6 +0,0 @@
-#!/bin/bash
-img2dataset --url_list ccs_synthetic_filtered_large.tsv --input_format "tsv"\
-         --url_col "url" --caption_col "caption" --output_format webdataset\
-           --output_folder cc_sbu_dataset --processes_count 16 --thread_count 128 --image_size 224 \
-             --enable_wandb True

dataset/download_laion.sh DELETED Viewed

@@ -1,6 +0,0 @@
-#!/bin/bash
-img2dataset --url_list laion_synthetic_filtered_large.tsv --input_format "tsv"\
-         --url_col "url" --caption_col "caption" --output_format webdataset\
-           --output_folder laion_dataset --processes_count 16 --thread_count 128 --image_size 224 \
-             --enable_wandb True