(Below is from https://github.com/yuhuacheng/clap_musicgen)
ππ» CLAP-MusicGen π΅
CLAP-MusicGen is a contrastive audio-text embedding model that combines the strengths of Contrastive Language-Audio Pretraining (CLAP) with Meta's MusicGen as the audio encoder. Users can generate latent embeddings for any given audio or text, enabling downstream tasks like music similarity search and audio classification.
Note that this is a proof-of-concept project and is not aimed at providing the highest quality embeddings but rather at demonstrating the idea, as it is my personal pet project.
Table of Contents
- π¨βπ« Overview
- ποΈ Model Architecture
- π Training Data
- π» Quick Start
- π§ Similarity Search Demo
- π€Ώ Training / Evaluation Deep Dives
- πͺͺ License
- ποΈ Citation
π¨βπ« Overview
CLAP-MusicGen is a multimodal model designed to enhance music retrieval capabilities. By embedding both audio and text into a shared space, it enables efficient music-to-music and text-to-music search. Unlike traditional models limited to predefined categories, CLAP-MusicGen supports zero-shot classification, retrieval, and embedding extraction, making it a valuable tool for exploring and organizing music collections.
Key Capabilities:
- MusicGen-based Audio Encoding: Uses MusicGen to extract high-quality audio embeddings.
- Two-way Retrieval: Supports searching for audio given an input audio or text.
ποΈ Model Architecture
CLAP-MusicGen consists of:
Audio Encoder: Uses MusicGenβs decoder for feature extraction given the tokenization inputs from EnCodec.
Text Encoder: A pretrained RoBERTa finetuned on the music styles/genres text with MLM objective.
Projection Head: A multi-layer perceptron (MLP) that projects both text and audio embeddings into the same space.
Contrastive(ish) Learning: Trained using a listwise ranking loss instead of traditional contrastive loss to optimize the alignment between text and audio embeddings, enhancing retrieval performance for tasks like music similarity search.
π Training Data
The model is trained on the nyuuzyou/suno dataset from Hugging Face. This dataset includes approximately 10K curated audio-caption pairs, split into 80% training, 10% validation, and 10% evaluation. Captions are derived from the metadata.tags
field, which provides descriptions of musical styles and genres. Note that one can include the full prompt from metadata.prompt
along with style tags for training, to achieve an even richer audio/text embeddings supervised by the full captions.
Note: since our CLAP model is trained using AI-generated music-caption pairs from Suno, forming a synthetic data loop where an AI learns from another AIβs outputs, it presents potential biases of training on AI-generated data, opening up opportunities for further refinement by incorporating human-annotated music datasets.
π» Quick Start
Installation
To install the necessary dependencies, run:
pip install torch torchvision torchaudio transformers
Loading the Model from π€ Hugging Face
First, clone the project repository and navigate to the project directory:
from src.modules.clap_model import CLAPModel
from transformers import RobertaTokenizer
model = CLAPModel.from_pretrained("yuhuacheng/clap-musicgen")
tokenizer = RobertaTokenizer.from_pretrained("yuhuacheng/clap-roberta-finetuned")
Extracting Embeddings
From Audio
import torch
with torch.no_grad():
waveform = torch.rand(1, 1, 32000) # 1 sec waveform at 32kHz sample rate
audio_embeddings = model.audio_encoder(ids=None, waveform=waveform)
print(audio_embeddings.shape) # (1, 1024)
From Text
sample_captions = [
'positive jazzy lofi',
'fast house edm',
'gangsta rap',
'dark metal'
]
with torch.no_grad():
tokenized_captions = tokenizer(list(sample_captions), return_tensors="pt", padding=True, truncation=True)
text_embeddings = model.text_encoder(ids=None, **tokenized_captions)
print(text_embeddings.shape) # (4, 1024)
π§ Similarity Search Demo
Please refer to the demo notebook that demonstates the audio-to-audio as well as the text-to-audio search.
(Result snapshots)
π€Ώ Training / Evaluation Deep Dives
(Coming soon)
πͺͺ License
- The code in this repository is released under the MIT license as found in the LICENSE file.
- Since the model was trained off of the pretrained MusicGen weights, its weights in this repository are released under the CC-BY-NC 4.0 license as found in the LICENSE_weights file.
ποΈ Citation
@inproceedings{copet2023simple,
title={Simple and Controllable Music Generation},
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre DΓ©fossez},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
}
@inproceedings{laionclap2023,
title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
author = {Wu*, Yusong and Chen*, Ke and Zhang*, Tianyu and Hui*, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
year = {2023}
}
@inproceedings{htsatke2022,
author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},
year = {2022}
}
- Downloads last month
- 17