File size: 1,658 Bytes
6d82801
 
6b56fcd
 
 
 
 
6d82801
6b56fcd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
---
license: mit
datasets:
- M2UGen/MUCaps
- M2UGen/MUEdit
- M2UGen/MUImage
- M2UGen/MUVideo
---
# M<sup>2</sup>UGen Model with MusicGen-medium

The M<sup>2</sup>UGen model is a Music Understanding and Generation model that is capable of Music Question Answering and also Music Generation 
from texts, images, videos and audios, as well as Music Editing. The model utilizes encoders such as MERT for music understanding, ViT for image understanding 
and ViViT for video understanding and the MusicGen/AudioLDM2 model as the music generation model (music decoder), coupled with adapters and the LLaMA 2 model 
to make the model possible for multiple abilities. 

M<sup>2</sup>UGen was published in [M<sup>2</sup>UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models](https://arxiv.org/abs/2311.11255) by *Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun and Ying Shan*.

The code repository for the model is published in [crypto-code/M2UGen](https://github.com/crypto-code/M2UGen). Clone the repository, download the checkpoint and run the following for a model demo:
```bash
python gradio_app.py --model ./ckpts/M2UGen-MusicGen-medium/checkpoint.pth --llama_dir ./ckpts/LLaMA-2 --music_decoder musicgen --music_decoder_path facebook/musicgen-medium
```

## Citation

If you find this model useful, please consider citing: 

```bibtex
@article{hussain2023m,
  title={{M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models}},
  author={Hussain, Atin Sakkeer and Liu, Shansong and Sun, Chenshuo and Shan, Ying},
  journal={arXiv preprint arXiv:2311.11255},
  year={2023}
}
```