YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Open-Vocabulary Audio-Visual Semantic Segmentation (ACM MM 24 Oral)

Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying*

PDF | CODE | Cite

News

25.07.2024: Our paper is accepted by ACM MM 2024 as Oral!

Introduction


Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%.


Installation

Example 1

conda create -n ov_avss python==3.8 -y
conda activate ov_avss

git clone https://github.com/ruohaoguo/ovavss
cd ovavss

pip install torch torchvision
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2

pip install -r requirements.txt

pip install git+https://github.com/openai/CLIP.git
pip install -e third_parties/mask_adapted_clip 

cd ov_avss/modeling/pixel_decoder/ops
bash make.sh

pip install git+https://github.com/sennnnn/TrackEval.git

Example 2

conda create -n ov_avss python==3.8 -y
conda activate ov_avss

git clone https://github.com/ruohaoguo/ovavss
cd ovavss

conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia
pip install -U opencv-python
git clone https://github.com/facebookresearch/detectron2
cd detectron2
pip install -e .

cd ..
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
pip install -e third_parties/mask_adapted_clip

cd ov_avss/modeling/pixel_decoder/ops
bash make.sh

pip install git+https://github.com/sennnnn/TrackEval.git

Setup

  1. Download pretrained weight model_final_3c8ec9.pkl and put it in ./pre_models.
  2. Download pretrained weight model_final_83d103.pkl and put it in ./pre_models.
  3. Download pretrained weight vggish-10086976.pth (code: 1234) and put it in ./pre_models.
  4. Download and unzip datasets (code: 1234) and put it in ./datasets.

Training

  • For ResNet-50 backbone: Run the following command

    python train_net.py \
        --config-file configs/avsbench/OV_AVSS_R50.yaml \
        --num-gpus 1
    
  • For Swin-base backbone: Run the following command

    python train_net.py \
        --config-file configs/avsbench/swin/OV_AVSS_SwinB.yaml \
        --num-gpus 1
    

Inference & Evaluation

  • For ResNet-50 backbone: Run the following command

  • Download the trained model model_ov_avss_r50.pth (code: 1234) an put it in ./pre_models.

    cd demo_video
    python test_net_video_avsbench_r50.py
    
  • For Swin-base backbone: Run the following command

  • Download the trained model model_ov_avss_swinb.pth (code: 1234) an put it in ./pre_models.

    cd demo_video
    python test_net_video_avsbench_swinb.py
    
  • Note: The results tested on Example 1 and Example 2 have a slight performance difference.

FAQ

If you want to improve the usability or any piece of advice, please feel free to contant directly ([email protected]).

Citation

Please consider citing our paper in your publications if the project helps your research. BibTeX reference is as follow.

@article{guo2024open,
  title={Open-Vocabulary Audio-Visual Semantic Segmentation},
  author={Guo, Ruohao and Qu, Liao and Niu, Dantong and Qi, Yanyu and Yue, Wenzhen and Shi, Ji and Xing, Bowei and Ying, Xianghua},
  journal={arXiv preprint arXiv:2407.21721},
  year={2024}
}

Acknowledgement

This repo is based on OpenVIS, Mask2Former and detectron2 Thanks for their wonderful works.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.