EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

๐ŸŸฃ EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.

๐ŸŽ› Play with EzAudio for text-to-audio generation, editing, and inpainting: EzAudio

๐ŸŽฎ EzAudio-ControlNet is available: EzAudio-ControlNet

We want to thank Hugging Face Space and Gradio for providing incredible demo platform.

Installation

Clone the repository:

git clone [email protected]:haidog-yaqub/EzAudio.git

Install the dependencies:

cd EzAudio
pip install -r requirements.txt

Download checkponts from: https://huggingface.co/OpenSound/EzAudio

Usage

You can use the model with the following code:

from api.ezaudio import load_models, generate_audio

# model and config paths
config_name = 'ckpts/ezaudio-xl.yml'
ckpt_path = 'ckpts/s3/ezaudio_s3_xl.pt'
vae_path = 'ckpts/vae/1m.pt'
# save_path = 'output/'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load model
(autoencoder, unet, tokenizer,
 text_encoder, noise_scheduler, params) = load_models(config_name, ckpt_path,
                                                      vae_path, device)

prompt = "a dog barking in the distance"
sr, audio = generate_audio(prompt, autoencoder, unet, tokenizer, text_encoder, noise_scheduler, params, device)

Todo

  • Release Gradio Demo along with checkpoints EzAudio Space
  • Release ControlNet Demo along with checkpoints EzAudio ControlNet Space
  • Release inference code
  • Release checkpoints for stage1 and stage2
  • Release training pipeline and dataset

Reference

If you find the code useful for your research, please consider citing:

@article{hai2024ezaudio,
  title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
  author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
  journal={arXiv preprint arXiv:2409.10819},
  year={2024}
}

Acknowledgement

Some code are borrowed from or inspired by: U-Vit, Pixel-Art, Huyuan-DiT, and Stable Audio.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Space using OpenSound/EzAudio 1