File size: 4,056 Bytes
0d1f840
d0ec5fd
 
 
 
0d1f840
d0ec5fd
ced1e40
d0ec5fd
 
ced1e40
d0ec5fd
 
 
ced1e40
 
d0ec5fd
 
 
ced1e40
 
 
 
 
d0ec5fd
 
 
ced1e40
d0ec5fd
 
 
ced1e40
d0ec5fd
 
ced1e40
d0ec5fd
ced1e40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d0ec5fd
ced1e40
d0ec5fd
 
 
 
 
 
ced1e40
 
 
 
d0ec5fd
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
language:
- zh
- en
pipeline_tag: text-generation
---

# JMLA

<br>
 &nbsp<a href="https://arxiv.org/pdf/2310.10159.pdf">Paper</a>  
</p>
<br>

Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (**JMLA**) model to address the open-set music tagging problem. The **JMLA** model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B. 
We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the **JMLA** models. Our proposed **JMLA** system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.


## Requirements
* conda create -name SpectPrompt python=3.9
* pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
* pip install transformers datasets librosa einops_exts einops mmcls peft ipdb torchlibrosa
* pip install -U openmim
* mim install mmcv==1.7.1
  <br>

## Quickstart
Below, we provide simple examples to show how to use **JMLA** with 🤗 Transformers.  

#### 🤗 Transformers

To use JMLA for the inference, all you need to do is to input a few lines of codes as demonstrated below.

```python
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np

model = AutoModel.from_pretrained('Tabgac/SpectPrompt', trust_remote_code=True)
device = model.device
# sample rate: 16k
music_path = '/path/to/music.wav'

# extract logmel spectrogram
# 1. parameters
class FFT_parameters:
  sample_rate = 16000
  window_size = 400
  n_fft = 400
  hop_size = 160
  n_mels = 80
  f_min = 50
  f_max = 8000
prms = FFT_parameters()
# 2. extract
import nnAudio.Spectrogram
import librosa
to_spec = nnAudio.Spectrogram.MelSpectrogram(
  sr=prms.sample_rate,
  n_fft=prms.n_fft,
  win_length=prms.window_size,
  hop_length=prms.hop_size,
  n_mels=prms.n_mels,
  fmin=prms.f_min,
  fmax=prms.f_max,
  center=True,
  power=2,
  verbose=False,
)
wav, ori_sr = librosa.load(music_path, mono=True, sr=prms.sample_rate)
lms = to_spec(torch.tensor(wav))
lms = (lms + torch.finfo().eps).log().to(device)
# 3. processing
import os
from torch.nn.utils.rnn import pad_sequence
import random
# get the file transforms.py from https://github.com/taugastcn/SpectPrompt.git
from transforms import Normalize, SpecRandomCrop, SpecPadding, SpecRepeat


transforms = [ Normalize(-4.5, 4.5), SpecRandomCrop(target_len=2992), SpecPadding(target_len=2992), SpecRepeat() ]
lms = lms.numpy()

for trans in transforms:
  lms = trans(lms)

# template of input
input = dict()
input['filenames'] = [music_path.split('/')[-1]]
input['ans_crds'] = [0]
input['audio_crds'] = [0]
input['attention_mask'] = torch.tensor([[1, 1, 1, 1, 1]]).to(device)
input['input_ids'] = torch.tensor([[1, 694, 5777, 683, 13]]).to(device)
input['spectrogram'] = torch.from_numpy(lms).unsqueez(dim=0).to(device)
# generation
model.eval()
gen_ids = model.forward_test(input)
gen_text = model.neck.tokenizer.batch_decode(gen_ids.clip(0))

```


## Citation
If you find our paper and code useful in your research, please consider giving a star and citation

```BibTeX
@article{JMLA,
  title={JOINT MUSIC AND LANGUAGE ATTENTION MODELS FOR ZERO-SHOT MUSIC TAGGING},
  author={Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong},
  journal={arXiv preprint arXiv:2310.10159},
  year={2023}
}
```
<br>