x-lai commited on
Commit
be3b66a
·
2 Parent(s): 3d9fba4 c5bcdef

Release training script

Browse files

Former-commit-id: 147c3800d440932a62df8b099a4315b18ea4aa05

README.md CHANGED
@@ -1,17 +1,76 @@
1
  # LISA: Reasoning Segmentation via Large Language Model
2
 
3
- <font size=10><div align='center'><b>LISA</b>: Large <b>L</b>anguage <b>I</b>nstructed <b>S</b>egmentation <b>A</b>ssistant</div></font>
4
-
5
- <font size=10><div align='center' > <a href=https://arxiv.org/pdf/2308.00692.pdf>**Paper**</a> | <a href="https://huggingface.co/xinlai">**Models**</a> | [**Inference**](#inference) | [**Dataset**](#dataset) | <a href="http://103.170.5.190:7860/">**Online Demo**</a></div></font>
6
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  <p align="center"> <img src="imgs/fig_overview.jpg" width="100%"> </p>
9
 
10
- <p align="center"> <img src="imgs/teaser.jpg" width="100%"> </p>
11
-
12
  ## News
13
  - [x] [2023.8.4] [Online Demo](http://103.170.5.190:7860/) is released!
14
- - [x] [2023.8.4] [*ReasonSeg* Dataset](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing) and the [LISA-13B-llama2-v0-explainatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explainatory) model are released!
15
  - [x] [2023.8.3] Inference code and the [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) model are released. Welcome to check out!
16
  - [x] [2023.8.2] [Paper](https://arxiv.org/pdf/2308.00692.pdf) is released and GitHub repo is created.
17
 
@@ -126,7 +185,7 @@ deepspeed --master_port=24999 train_ds.py --version="PATH_TO_LLaVA_Wegihts" --da
126
 
127
 
128
  ## Inference
129
- To chat with [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) or [LISA-13B-llama2-v0-explainatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explainatory): (Note that LISA-13B-llama2-v0 currently does not support explanatory answers.)
130
  ```
131
  CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0'
132
  ```
@@ -182,7 +241,7 @@ Besides, we leveraged GPT-3.5 for rephrasing instructions, so images in the trai
182
  If you find this project useful in your research, please consider citing:
183
 
184
  ```
185
- @article{reason-seg,
186
  title={LISA: Reasoning Segmentation via Large Language Model},
187
  author={Xin Lai and Zhuotao Tian and Yukang Chen and Yanwei Li and Yuhui Yuan and Shu Liu and Jiaya Jia},
188
  journal={arXiv:2308.00692},
 
1
  # LISA: Reasoning Segmentation via Large Language Model
2
 
3
+ <font size=7><div align='center'><b>LISA</b>: Large <b>L</b>anguage <b>I</b>nstructed <b>S</b>egmentation <b>A</b>ssistant</div></font>
4
+
5
+ <font size=7><div align='center' > <a href=https://arxiv.org/pdf/2308.00692.pdf>**Paper**</a> | <a href="https://huggingface.co/xinlai">**Models**</a> | **Training** (Coming Soon) | [**Inference**](#inference) | [**Dataset**](#dataset) | <a href="http://103.170.5.190:7860/">**Online Demo**</a></div></font>
6
+
7
+ <!-- <p align="center"> <img src="imgs/teaser.jpg" width="100%"> </p> -->
8
+
9
+ <table class="center">
10
+ <tr>
11
+ <td style="text-align:center;"><b>Input</b></td>
12
+ <td style="text-align:center;"><b>Output</b></td>
13
+ <td style="text-align:center;"><b>Input</b></td>
14
+ <td style="text-align:center;"><b>Output</b></td>
15
+ </tr>
16
+ <tr>
17
+ <td><img src="imgs/obama.jpg"></td>
18
+ <td><img src="vis_output/obama.jpg"></td>
19
+ <td><img src="imgs/trump.jpg"></td>
20
+ <td><img src="vis_output/trump.jpg"></td>
21
+ </tr>
22
+ <tr>
23
+ <td width=25% style="text-align:center;color:gray;">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain the reason."</td>
24
+ <td width=25% style="text-align:center;">"Sure, the segmentation result is [SEG]. The President of the United States in the image is <ins>President Obama.</ins>”</td>
25
+ <td width=25% style="text-align:center;color:gray">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain why."</td>
26
+ <td width=25% style="text-align:center;">"Sure, [SEG]. In the image, the President of the United States is <ins>President Trump.</ins>"</td>
27
+ </tr>
28
+
29
+ <tr>
30
+ <td><img src="imgs/stand_higher.jpg"></td>
31
+ <td><img src="vis_output/stand_higher.jpg"></td>
32
+ <td><img src="imgs/camera_lens.jpg"></td>
33
+ <td><img src="vis_output/camera_lens.jpg"></td>
34
+ </tr>
35
+ <tr>
36
+ <td width=25% style="text-align:center;color:gray;">"What can <ins>make the woman stand higher</ins>? Please output segmentation mask and explain why."</td>
37
+ <td width=25% style="text-align:center;">"Sure, [SEG]. The woman is <ins>standing higher by using a ladder</ins>..."</td>
38
+ <td width=25% style="text-align:center;color:gray">"Can you segment <ins>the camera lens that is more suitable for photographing nearby objects</ins> in this image?"</td>
39
+ <td width=25% style="text-align:center;">"Sure, [SEG]."</td>
40
+ </tr>
41
+
42
+ <tr>
43
+ <td><img src="imgs/dog_with_horn.jpg"></td>
44
+ <td><img src="vis_output/dog_with_horn.jpg"></td>
45
+ <td><img src="imgs/wash_hands.jpg"></td>
46
+ <td><img src="vis_output/wash_hands.jpg"></td>
47
+ </tr>
48
+ <tr>
49
+ <td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the unusual part</ins> in this image and explain why."</td>
50
+ <td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the unusual part is <ins>the dog wearing a reindeer antler headband</ins>..."</td>
51
+ <td width=25% style="text-align:center;color:gray">"Where to <ins>wash hands</ins> in this image? Please output segmentation mask."</td>
52
+ <td width=25% style="text-align:center;">"Sure, [SEG]."</td>
53
+ </tr>
54
+
55
+ <tr>
56
+ <td><img src="imgs/jackma.jpg"></td>
57
+ <td><img src="vis_output/jackma.jpg"></td>
58
+ <td><img src="imgs/blackpink.jpg"></td>
59
+ <td><img src="vis_output/blackpink.jpg"></td>
60
+ </tr>
61
+ <tr>
62
+ <td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the founder of Alibaba</ins> in this image and explain why?"</td>
63
+ <td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the man wearing a red tie and sitting on one of the chairs is <ins>Jack Ma</ins>, the co-founder of Alibaba Group..."</td>
64
+ <td width=25% style="text-align:center;color:gray">"Please segment <ins>Lisa</ins> in this figure."</td>
65
+ <td width=25% style="text-align:center;">"Sure, [SEG]."</td>
66
+ </tr>
67
+ </table>
68
 
69
  <p align="center"> <img src="imgs/fig_overview.jpg" width="100%"> </p>
70
 
 
 
71
  ## News
72
  - [x] [2023.8.4] [Online Demo](http://103.170.5.190:7860/) is released!
73
+ - [x] [2023.8.4] [*ReasonSeg* Dataset](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing) and the [LISA-13B-llama2-v0-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explanatory) model are released!
74
  - [x] [2023.8.3] Inference code and the [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) model are released. Welcome to check out!
75
  - [x] [2023.8.2] [Paper](https://arxiv.org/pdf/2308.00692.pdf) is released and GitHub repo is created.
76
 
 
185
 
186
 
187
  ## Inference
188
+ To chat with [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) or [LISA-13B-llama2-v0-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explanatory): (Note that LISA-13B-llama2-v0 currently does not support explanatory answers.)
189
  ```
190
  CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0'
191
  ```
 
241
  If you find this project useful in your research, please consider citing:
242
 
243
  ```
244
+ @article{reason_seg,
245
  title={LISA: Reasoning Segmentation via Large Language Model},
246
  author={Xin Lai and Zhuotao Tian and Yukang Chen and Yanwei Li and Yuhui Yuan and Shu Liu and Jiaya Jia},
247
  journal={arXiv:2308.00692},
model/LISA.py CHANGED
@@ -6,6 +6,7 @@ import torch.nn.functional as F
6
  from peft import (LoraConfig, get_peft_model)
7
  from transformers import BitsAndBytesConfig, CLIPVisionModel
8
 
 
9
  from .llava.model.llava import LlavaLlamaForCausalLM
10
  from .segment_anything import build_sam_vit_h
11
  from utils.utils import (DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN,
 
6
  from peft import (LoraConfig, get_peft_model)
7
  from transformers import BitsAndBytesConfig, CLIPVisionModel
8
 
9
+ from transformers import LlamaForCausalLM, CLIPVisionModel, BitsAndBytesConfig
10
  from .llava.model.llava import LlavaLlamaForCausalLM
11
  from .segment_anything import build_sam_vit_h
12
  from utils.utils import (DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN,
requirements.txt CHANGED
@@ -6,9 +6,7 @@ markdown2==2.4.10
6
  numpy==1.24.2
7
  openai==0.27.8
8
  opencv_python==4.8.0.74
9
- peft==0.3.0
10
  Pillow==9.4.0
11
- Pillow==10.0.0
12
  pycocotools==2.0.6
13
  ray==2.6.1
14
  Requests==2.31.0
 
6
  numpy==1.24.2
7
  openai==0.27.8
8
  opencv_python==4.8.0.74
 
9
  Pillow==9.4.0
 
10
  pycocotools==2.0.6
11
  ray==2.6.1
12
  Requests==2.31.0
vis_output/dog_with_horn.jpg ADDED
vis_output/example1_mask_0.jpg CHANGED
vis_output/example1_masked_img_0.jpg CHANGED
vis_output/example2_mask_0.jpg DELETED
Binary file (13.8 kB)
 
vis_output/example2_masked_img_0.jpg DELETED
Binary file (182 kB)