XiaohuJoshua commited on
Commit
76038a7
·
verified ·
1 Parent(s): 8202ce8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -0
README.md CHANGED
@@ -1,3 +1,215 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ This is the official repo for paper [Supervised Fine-tuning *in turn* Improves Visual Foundation Models]().
5
+
6
+ <div align="center">
7
+
8
+ 📃[**Paper (ArXiv)**]() **|** [**Code**]() **|** 🤗[**Huggingface**](https://huggingface.co/TencentARC/ViSFT)
9
+
10
+
11
+
12
+ </div>
13
+
14
+ ## News
15
+ * [2024/01/17] We open source the [ViSFT]() including training scripts and weights. Evaluation codes will be released soon.
16
+
17
+ ## Introduction
18
+ Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP’s pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method **ViSFT** (**Vi**sion **SFT**) is proposed to unleash the fine-grained knowledge of vision foun- dation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.
19
+
20
+
21
+ ## Installation
22
+
23
+ ### creating a conda environment
24
+ ```
25
+ conda create -n ViSFT python=3.8
26
+
27
+ conda activate ViSFT
28
+ ```
29
+ ### Install pytorch
30
+ we use torch1.12 with CUDA11.3 on 8 NVIDIA Volta V100- SXM2-32GB GPUs
31
+ ```
32
+ pip install --extra-index-url https://download.pytorch.org/whl/cu113 torch==1.12.0
33
+
34
+ pip install --extra-index-url https://download.pytorch.org/whl/cu113 torchvision==0.13.0
35
+
36
+ pip install --extra-index-url https://download.pytorch.org/whl/cu113 torchaudio==0.12.0
37
+ ```
38
+
39
+
40
+ ### xformers installation
41
+
42
+ Flash attention is required for running EVA-ViT-E.
43
+ please refer to [xformers](https://github.com/facebookresearch/xformers)
44
+
45
+ ### loralib installation
46
+
47
+ ```
48
+ pip install --user git+https://github.com/microsoft/LoRA
49
+ ```
50
+
51
+ ### compile MSDeform for Mask2former head
52
+ ```
53
+ cd ./mmf/models/visft/ops
54
+ sudo sh make.sh
55
+ # back to root dir
56
+ cd ../../../../
57
+ ```
58
+
59
+ ### Other packages installation
60
+ ```
61
+ pip install -r requirements.txt
62
+ ```
63
+
64
+ ## Dataset Preparation
65
+
66
+ export DATA_PATH=your_data_path
67
+
68
+ ### image caption
69
+ Generating hdf5 files for image caption following [hdf5](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning/blob/master/create_input_files.py)
70
+
71
+ file strcture:
72
+
73
+ ```
74
+ DATA_PATH/
75
+ └── processed_datasets/
76
+ └─── coco_caption_hdf5_files
77
+ ├──TEST_CAPLENS_coco_5_cap_per_img_5_min_word_freq.json
78
+ ├──TEST_CAPTIONS_coco_5_cap_per_img_5_min_word_freq.json
79
+ ├──TEST_IMAGES_coco_5_cap_per_img_5_min_word_freq.hdf5
80
+ ├──TRAIN_CAPLENS_coco_5_cap_per_img_5_min_word_freq.json
81
+ ├──TRAIN_CAPTIONS_coco_5_cap_per_img_5_min_word_freq.json
82
+ ├──TRAIN_IMAGES_coco_5_cap_per_img_5_min_word_freq.hdf5
83
+ ├──VAL_CAPLENS_coco_5_cap_per_img_5_min_word_freq.json
84
+ ├──VAL_CAPTIONS_coco_5_cap_per_img_5_min_word_freq.json
85
+ ├──VAL_IMAGES_coco_5_cap_per_img_5_min_word_freq.hdf5
86
+ └───WORDMAP_coco_5_cap_per_img_5_min_word_freq.json
87
+ ```
88
+ ### Detection & Segmentation
89
+
90
+ file strcture:
91
+
92
+ ```
93
+ DATA_PATH/
94
+ └── public_datasets/
95
+ └─── coco
96
+ ├──train2017
97
+ ├──val2017
98
+ ├──test2017
99
+ └───annotations
100
+ ├──instances_train2017.json
101
+ ├──instances_val2017.json
102
+ └───image_info_test-dev2017.json
103
+ ```
104
+
105
+ ## Training
106
+ ### Stage1
107
+ To get compatible in-domain task heads. Using 8 NVIDIA Volta V100-SXM2-32GB GPUs for every in-domain task head.
108
+
109
+ **For eva-vit-g**
110
+
111
+ Preparing weights from [LAVIS](https://github.com/salesforce/LAVIS)
112
+ ```
113
+ wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth
114
+ ```
115
+ Adding your weights path to configs under dir:./projects/visft/configs/stage1/eva_g/
116
+ ```
117
+ backbone_dir: path/eva_vit_g.pth
118
+ ```
119
+ Implementing training
120
+ ```
121
+ bash ./scripts/stage1_train/eva_g/caption.sh
122
+ bash ./scripts/stage1_train/eva_g/detection.sh
123
+ bash ./scripts/stage1_train/eva_g/segment.sh
124
+ ```
125
+
126
+ **For eva-vit-e**
127
+
128
+ Preparing EVA-CLIP weights from [EVA](https://huggingface.co/QuanSun/EVA-CLIP/blob/main/EVA02_CLIP_E_psz14_plus_s9B.pt)
129
+
130
+ Extact ViT weights
131
+ ```
132
+ python ./scripts/preprocess/extract_eva_e_vit.py
133
+ ```
134
+ Adding your weights path to configs under dir:./projects/visft/configs/stage1/eva_e/
135
+ ```
136
+ backbone_dir: path/EVA02_CLIP_E_psz14_plus_s9B_Visual.pt
137
+ ```
138
+ Implementing training
139
+ ```
140
+ # can be executed in parallel
141
+ bash ./scripts/stage1_train/eva_e/caption.sh
142
+ bash ./scripts/stage1_train/eva_e/detection.sh
143
+ bash ./scripts/stage1_train/eva_e/segment.sh
144
+ ```
145
+
146
+ Or you can use the weights we provided.
147
+
148
+ | In-domain Heads | | |
149
+ |----------|:-------------:|:-------------:|
150
+ | | EVA-G | EVA-E|
151
+ | Caption Head | [weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_g_caption_heads.ckpt) | [weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_e_caption_heads.ckpt)|
152
+ | Segment Head | [weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_g_segment_heads.ckpt) |[weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_e_segment_heads.ckpt)|
153
+ | Detection Head | [weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_g_detection_heads.ckpt) |[weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_e_detection_heads.ckpt)|
154
+
155
+
156
+ ### Stage2
157
+
158
+ **For eva-vit-g**
159
+
160
+ Adding your weights path to configs under dir:./projects/visft/configs/stage2/eva_g/stage2.yaml
161
+ ```
162
+ backbone_dir: path/eva_vit_g.pth
163
+ caption_ckpt_path: 'path/eva_g_caption_heads.ckpt'
164
+ segment_ckpt_path:'path/eva_g_segment_heads.ckpt'
165
+ detection_ckpt_path: 'path/eva_g_detection_heads.ckpt'
166
+ ```
167
+ Implementing training
168
+ ```
169
+ bash ./scripts/stage2_train/eva_g/stage2.sh
170
+ ```
171
+
172
+ **For eva-vit-e**
173
+
174
+ Adding your weights path to configs under dir:./projects/visft/configs/stage2/eva_e/stage2.yaml
175
+ ```
176
+ backbone_dir: path/EVA02_CLIP_E_psz14_plus_s9B_Visual.pt
177
+ caption_ckpt_path: 'path/eva_e_caption_heads.ckpt'
178
+ segment_ckpt_path:'path/eva_e_segment_heads.ckpt'
179
+ detection_ckpt_path: 'path/eva_e_detection_heads.ckpt'
180
+ ```
181
+ Implementing training
182
+ ```
183
+ bash ./scripts/stage2_train/eva_e/stage2.sh
184
+ ```
185
+ ### Get LoRA Weights
186
+ You can extract expected LoRA weights by
187
+
188
+ ```
189
+ python ./scripts/postprocess/extract_lora_weights.py
190
+ ```
191
+
192
+ Or use the LoRA weights we provide:
193
+ | LoRA weights | | |
194
+ |----------|:-------------:|:-------------:|
195
+ | Iters| EVA-G | EVA-E|
196
+ | 5k | [weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_g_lora_5000.pt) | [weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_e_lora_5000.pt)|
197
+ | 10k | [weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_g_lora_10000.pt) |[weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_e_lora_10000.pt)|
198
+ | 15k | [weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_g_lora_15000.pt) |[weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_e_lora_15000.pt)|
199
+ | 20k | [weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_g_lora_20000.pt) |[weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_e_lora_20000.pt)|
200
+ | 50k | [weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_g_lora_50000.pt) |[weights](https://huggingface.co/TencentARC/ViSFT/blob/main/eva_e_lora_50000.pt)|
201
+ ## Evaluation Benchmarks
202
+ - [] Zero-shot Image Classification
203
+ - [] Zero-shot Image-text Retrieval
204
+ - [] OCR
205
+ - [] Grounded Object Indentification
206
+ - [] VQA
207
+ - [] Image Captioning on NoCaps
208
+
209
+ ## Acknowledgement
210
+ The code of ViSFT is based on the official implementation of [mmf](https://github.com/facebookresearch/mmf), [EVA](https://github.com/baaivision/EVA/tree/master) and [LAVIS](https://github.com/salesforce/LAVIS/tree/main)
211
+
212
+ ## Citation
213
+
214
+
215
+