(Below is from https://github.com/yuhuacheng/clap_musicgen)

👏🏻 CLAP-MusicGen 🎵

CLAP-MusicGen is a contrastive audio-text embedding model that combines the strengths of Contrastive Language-Audio Pretraining (CLAP) with Meta's MusicGen as the audio encoder. Users can generate latent embeddings for any given audio or text, enabling downstream tasks like music similarity search and audio classification.

Note that this is a proof-of-concept project and is not aimed at providing the highest quality embeddings but rather at demonstrating the idea, as it is my personal pet project.

👨‍🏫 Overview
🏗️ Model Architecture
📀 Training Data
💻 Quick Start
🎧 Similarity Search Demo
🤿 Training / Evaluation Deep Dives
🪪 License
🖇️ Citation

👨‍🏫 Overview

CLAP-MusicGen is a multimodal model designed to enhance music retrieval capabilities. By embedding both audio and text into a shared space, it enables efficient music-to-music and text-to-music search. Unlike traditional models limited to predefined categories, CLAP-MusicGen supports zero-shot classification, retrieval, and embedding extraction, making it a valuable tool for exploring and organizing music collections.

Key Capabilities:

MusicGen-based Audio Encoding: Uses MusicGen to extract high-quality audio embeddings.
Two-way Retrieval: Supports searching for audio given an input audio or text.

🏗️ Model Architecture

CLAP-MusicGen consists of:

Audio Encoder: Uses MusicGen’s decoder for feature extraction given the tokenization inputs from EnCodec.
Text Encoder: A pretrained RoBERTa finetuned on the music styles/genres text with MLM objective.
Projection Head: A multi-layer perceptron (MLP) that projects both text and audio embeddings into the same space.
Contrastive(ish) Learning: Trained using a listwise ranking loss instead of traditional contrastive loss to optimize the alignment between text and audio embeddings, enhancing retrieval performance for tasks like music similarity search.

📀 Training Data

The model is trained on the nyuuzyou/suno dataset from Hugging Face. This dataset includes approximately 10K curated audio-caption pairs, split into 80% training, 10% validation, and 10% evaluation. Captions are derived from the metadata.tags field, which provides descriptions of musical styles and genres. Note that one can include the full prompt from metadata.prompt along with style tags for training, to achieve an even richer audio/text embeddings supervised by the full captions.

Note: since our CLAP model is trained using AI-generated music-caption pairs from Suno, forming a synthetic data loop where an AI learns from another AI’s outputs, it presents potential biases of training on AI-generated data, opening up opportunities for further refinement by incorporating human-annotated music datasets.

💻 Quick Start

Installation

To install the necessary dependencies, run:

pip install torch torchvision torchaudio transformers

Loading the Model from 🤗 Hugging Face

First, clone the project repository and navigate to the project directory:

from src.modules.clap_model import CLAPModel
from transformers import RobertaTokenizer

model = CLAPModel.from_pretrained("yuhuacheng/clap-musicgen")
tokenizer = RobertaTokenizer.from_pretrained("yuhuacheng/clap-roberta-finetuned")

Extracting Embeddings

From Audio

import torch 

with torch.no_grad():
  waveform = torch.rand(1, 1, 32000) # 1 sec waveform at 32kHz sample rate
  audio_embeddings = model.audio_encoder(ids=None, waveform=waveform)
  print(audio_embeddings.shape) # (1, 1024)

From Text

sample_captions = [
    'positive jazzy lofi',
    'fast house edm',
    'gangsta rap',
    'dark metal'
]

with torch.no_grad():
    tokenized_captions = tokenizer(list(sample_captions), return_tensors="pt", padding=True, truncation=True)    
    text_embeddings = model.text_encoder(ids=None, **tokenized_captions)
    print(text_embeddings.shape) # (4, 1024)

🎧 Similarity Search Demo

Please refer to the demo notebook that demonstates the audio-to-audio as well as the text-to-audio search.

(Result snapshots)

🎵 Audio-to-Audio Search

💬 Text-to-Audio Search

🤿 Training / Evaluation Deep Dives

(Coming soon)

🪪 License

The code in this repository is released under the MIT license as found in the LICENSE file.
Since the model was trained off of the pretrained MusicGen weights, its weights in this repository are released under the CC-BY-NC 4.0 license as found in the LICENSE_weights file.

🖇️ Citation

@inproceedings{copet2023simple,
    title={Simple and Controllable Music Generation},
    author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
}

@inproceedings{laionclap2023,
  title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  author = {Wu*, Yusong and Chen*, Ke and Zhang*, Tianyu and Hui*, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
  year = {2023}
}

@inproceedings{htsatke2022,
  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
  year = {2022}
}

yuhuacheng
/

clap-musicgen-1sec