Spaces:
Paused
Paused
Release training script
Browse filesFormer-commit-id: 147c3800d440932a62df8b099a4315b18ea4aa05
- README.md +68 -9
- model/LISA.py +1 -0
- requirements.txt +0 -2
- vis_output/dog_with_horn.jpg +0 -0
- vis_output/example1_mask_0.jpg +0 -0
- vis_output/example1_masked_img_0.jpg +0 -0
- vis_output/example2_mask_0.jpg +0 -0
- vis_output/example2_masked_img_0.jpg +0 -0
README.md
CHANGED
@@ -1,17 +1,76 @@
|
|
1 |
# LISA: Reasoning Segmentation via Large Language Model
|
2 |
|
3 |
-
<font size=
|
4 |
-
|
5 |
-
<font size=
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
|
8 |
<p align="center"> <img src="imgs/fig_overview.jpg" width="100%"> </p>
|
9 |
|
10 |
-
<p align="center"> <img src="imgs/teaser.jpg" width="100%"> </p>
|
11 |
-
|
12 |
## News
|
13 |
- [x] [2023.8.4] [Online Demo](http://103.170.5.190:7860/) is released!
|
14 |
-
- [x] [2023.8.4] [*ReasonSeg* Dataset](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing) and the [LISA-13B-llama2-v0-
|
15 |
- [x] [2023.8.3] Inference code and the [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) model are released. Welcome to check out!
|
16 |
- [x] [2023.8.2] [Paper](https://arxiv.org/pdf/2308.00692.pdf) is released and GitHub repo is created.
|
17 |
|
@@ -126,7 +185,7 @@ deepspeed --master_port=24999 train_ds.py --version="PATH_TO_LLaVA_Wegihts" --da
|
|
126 |
|
127 |
|
128 |
## Inference
|
129 |
-
To chat with [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) or [LISA-13B-llama2-v0-
|
130 |
```
|
131 |
CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0'
|
132 |
```
|
@@ -182,7 +241,7 @@ Besides, we leveraged GPT-3.5 for rephrasing instructions, so images in the trai
|
|
182 |
If you find this project useful in your research, please consider citing:
|
183 |
|
184 |
```
|
185 |
-
@article{
|
186 |
title={LISA: Reasoning Segmentation via Large Language Model},
|
187 |
author={Xin Lai and Zhuotao Tian and Yukang Chen and Yanwei Li and Yuhui Yuan and Shu Liu and Jiaya Jia},
|
188 |
journal={arXiv:2308.00692},
|
|
|
1 |
# LISA: Reasoning Segmentation via Large Language Model
|
2 |
|
3 |
+
<font size=7><div align='center'><b>LISA</b>: Large <b>L</b>anguage <b>I</b>nstructed <b>S</b>egmentation <b>A</b>ssistant</div></font>
|
4 |
+
|
5 |
+
<font size=7><div align='center' > <a href=https://arxiv.org/pdf/2308.00692.pdf>**Paper**</a> | <a href="https://huggingface.co/xinlai">**Models**</a> | **Training** (Coming Soon) | [**Inference**](#inference) | [**Dataset**](#dataset) | <a href="http://103.170.5.190:7860/">**Online Demo**</a></div></font>
|
6 |
+
|
7 |
+
<!-- <p align="center"> <img src="imgs/teaser.jpg" width="100%"> </p> -->
|
8 |
+
|
9 |
+
<table class="center">
|
10 |
+
<tr>
|
11 |
+
<td style="text-align:center;"><b>Input</b></td>
|
12 |
+
<td style="text-align:center;"><b>Output</b></td>
|
13 |
+
<td style="text-align:center;"><b>Input</b></td>
|
14 |
+
<td style="text-align:center;"><b>Output</b></td>
|
15 |
+
</tr>
|
16 |
+
<tr>
|
17 |
+
<td><img src="imgs/obama.jpg"></td>
|
18 |
+
<td><img src="vis_output/obama.jpg"></td>
|
19 |
+
<td><img src="imgs/trump.jpg"></td>
|
20 |
+
<td><img src="vis_output/trump.jpg"></td>
|
21 |
+
</tr>
|
22 |
+
<tr>
|
23 |
+
<td width=25% style="text-align:center;color:gray;">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain the reason."</td>
|
24 |
+
<td width=25% style="text-align:center;">"Sure, the segmentation result is [SEG]. The President of the United States in the image is <ins>President Obama.</ins>”</td>
|
25 |
+
<td width=25% style="text-align:center;color:gray">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain why."</td>
|
26 |
+
<td width=25% style="text-align:center;">"Sure, [SEG]. In the image, the President of the United States is <ins>President Trump.</ins>"</td>
|
27 |
+
</tr>
|
28 |
+
|
29 |
+
<tr>
|
30 |
+
<td><img src="imgs/stand_higher.jpg"></td>
|
31 |
+
<td><img src="vis_output/stand_higher.jpg"></td>
|
32 |
+
<td><img src="imgs/camera_lens.jpg"></td>
|
33 |
+
<td><img src="vis_output/camera_lens.jpg"></td>
|
34 |
+
</tr>
|
35 |
+
<tr>
|
36 |
+
<td width=25% style="text-align:center;color:gray;">"What can <ins>make the woman stand higher</ins>? Please output segmentation mask and explain why."</td>
|
37 |
+
<td width=25% style="text-align:center;">"Sure, [SEG]. The woman is <ins>standing higher by using a ladder</ins>..."</td>
|
38 |
+
<td width=25% style="text-align:center;color:gray">"Can you segment <ins>the camera lens that is more suitable for photographing nearby objects</ins> in this image?"</td>
|
39 |
+
<td width=25% style="text-align:center;">"Sure, [SEG]."</td>
|
40 |
+
</tr>
|
41 |
+
|
42 |
+
<tr>
|
43 |
+
<td><img src="imgs/dog_with_horn.jpg"></td>
|
44 |
+
<td><img src="vis_output/dog_with_horn.jpg"></td>
|
45 |
+
<td><img src="imgs/wash_hands.jpg"></td>
|
46 |
+
<td><img src="vis_output/wash_hands.jpg"></td>
|
47 |
+
</tr>
|
48 |
+
<tr>
|
49 |
+
<td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the unusual part</ins> in this image and explain why."</td>
|
50 |
+
<td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the unusual part is <ins>the dog wearing a reindeer antler headband</ins>..."</td>
|
51 |
+
<td width=25% style="text-align:center;color:gray">"Where to <ins>wash hands</ins> in this image? Please output segmentation mask."</td>
|
52 |
+
<td width=25% style="text-align:center;">"Sure, [SEG]."</td>
|
53 |
+
</tr>
|
54 |
+
|
55 |
+
<tr>
|
56 |
+
<td><img src="imgs/jackma.jpg"></td>
|
57 |
+
<td><img src="vis_output/jackma.jpg"></td>
|
58 |
+
<td><img src="imgs/blackpink.jpg"></td>
|
59 |
+
<td><img src="vis_output/blackpink.jpg"></td>
|
60 |
+
</tr>
|
61 |
+
<tr>
|
62 |
+
<td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the founder of Alibaba</ins> in this image and explain why?"</td>
|
63 |
+
<td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the man wearing a red tie and sitting on one of the chairs is <ins>Jack Ma</ins>, the co-founder of Alibaba Group..."</td>
|
64 |
+
<td width=25% style="text-align:center;color:gray">"Please segment <ins>Lisa</ins> in this figure."</td>
|
65 |
+
<td width=25% style="text-align:center;">"Sure, [SEG]."</td>
|
66 |
+
</tr>
|
67 |
+
</table>
|
68 |
|
69 |
<p align="center"> <img src="imgs/fig_overview.jpg" width="100%"> </p>
|
70 |
|
|
|
|
|
71 |
## News
|
72 |
- [x] [2023.8.4] [Online Demo](http://103.170.5.190:7860/) is released!
|
73 |
+
- [x] [2023.8.4] [*ReasonSeg* Dataset](https://drive.google.com/drive/folders/125mewyg5Ao6tZ3ZdJ-1-E3n04LGVELqy?usp=sharing) and the [LISA-13B-llama2-v0-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explanatory) model are released!
|
74 |
- [x] [2023.8.3] Inference code and the [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) model are released. Welcome to check out!
|
75 |
- [x] [2023.8.2] [Paper](https://arxiv.org/pdf/2308.00692.pdf) is released and GitHub repo is created.
|
76 |
|
|
|
185 |
|
186 |
|
187 |
## Inference
|
188 |
+
To chat with [LISA-13B-llama2-v0](https://huggingface.co/xinlai/LISA-13B-llama2-v0) or [LISA-13B-llama2-v0-explanatory](https://huggingface.co/xinlai/LISA-13B-llama2-v0-explanatory): (Note that LISA-13B-llama2-v0 currently does not support explanatory answers.)
|
189 |
```
|
190 |
CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0'
|
191 |
```
|
|
|
241 |
If you find this project useful in your research, please consider citing:
|
242 |
|
243 |
```
|
244 |
+
@article{reason_seg,
|
245 |
title={LISA: Reasoning Segmentation via Large Language Model},
|
246 |
author={Xin Lai and Zhuotao Tian and Yukang Chen and Yanwei Li and Yuhui Yuan and Shu Liu and Jiaya Jia},
|
247 |
journal={arXiv:2308.00692},
|
model/LISA.py
CHANGED
@@ -6,6 +6,7 @@ import torch.nn.functional as F
|
|
6 |
from peft import (LoraConfig, get_peft_model)
|
7 |
from transformers import BitsAndBytesConfig, CLIPVisionModel
|
8 |
|
|
|
9 |
from .llava.model.llava import LlavaLlamaForCausalLM
|
10 |
from .segment_anything import build_sam_vit_h
|
11 |
from utils.utils import (DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN,
|
|
|
6 |
from peft import (LoraConfig, get_peft_model)
|
7 |
from transformers import BitsAndBytesConfig, CLIPVisionModel
|
8 |
|
9 |
+
from transformers import LlamaForCausalLM, CLIPVisionModel, BitsAndBytesConfig
|
10 |
from .llava.model.llava import LlavaLlamaForCausalLM
|
11 |
from .segment_anything import build_sam_vit_h
|
12 |
from utils.utils import (DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN,
|
requirements.txt
CHANGED
@@ -6,9 +6,7 @@ markdown2==2.4.10
|
|
6 |
numpy==1.24.2
|
7 |
openai==0.27.8
|
8 |
opencv_python==4.8.0.74
|
9 |
-
peft==0.3.0
|
10 |
Pillow==9.4.0
|
11 |
-
Pillow==10.0.0
|
12 |
pycocotools==2.0.6
|
13 |
ray==2.6.1
|
14 |
Requests==2.31.0
|
|
|
6 |
numpy==1.24.2
|
7 |
openai==0.27.8
|
8 |
opencv_python==4.8.0.74
|
|
|
9 |
Pillow==9.4.0
|
|
|
10 |
pycocotools==2.0.6
|
11 |
ray==2.6.1
|
12 |
Requests==2.31.0
|
vis_output/dog_with_horn.jpg
ADDED
![]() |
vis_output/example1_mask_0.jpg
CHANGED
![]() |
![]() |
vis_output/example1_masked_img_0.jpg
CHANGED
![]() |
![]() |
vis_output/example2_mask_0.jpg
DELETED
Binary file (13.8 kB)
|
|
vis_output/example2_masked_img_0.jpg
DELETED
Binary file (182 kB)
|
|