Spaces:
Paused
LISA: Reasoning Segmentation via Large Language Model
News
- [2023.8.3] Inference code and the LISA-13B-llama2-v0 model are released. Welcome to check out!
- [2023.8.2] Paper is released and GitHub repo is created.
TODO
- Hugging Face Demo
- ReasonSeg Dataset Release
- Training Code Release
LISA: Reasoning Segmentation Via Large Language Model [Paper]
Xin Lai,
Zhuotao Tian,
Yukang Chen,
Yanwei Li,
Yuhui Yuan,
Shu Liu,
Jiaya Jia
Abstract
In this work, we propose a new segmentation task --- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. We establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: Large-language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks. For more details, please refer to:
Highlights
LISA unlocks the new segmentation capabilities of multi-modal LLMs, and can handle cases involving:
- complex reasoning;
- world knowledge;
- explanatory answers;
- multi-turn conversation.
LISA also demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement.
Experimental results
Installation
pip install -r requirements.txt
Inference
To chat with LISA-13B-llama2-v0: (Note that the model currently does not support explanatory answers.)
CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0'
To use bfloat16
data type for inference:
CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0' --precision='bf16'
After that, input the text prompt and then the image path. For example,
- Please input your prompt: Where can the driver see the car speed in this image? Please output segmentation mask.
- Please input the image path: imgs/example1.jpg
- Please input your prompt: Can you segment the food that tastes spicy and hot?
- Please input the image path: imgs/example2.jpg
The results should be like:
Citation
If you find this project useful in your research, please consider citing:
@article{reason-seg,
title={LISA: Reasoning Segmentation Via Large Language Model},
author={Xin Lai and Zhuotao Tian and Yukang Chen and Yanwei Li and Yuhui Yuan and Shu Liu and Jiaya Jia},
journal={arXiv:2308.00692},
year={2023}
}