Stable Audio
Stable Audio was proposed in Stable Audio Open by Zach Evans et al. . it takes a text prompt as input and predicts the corresponding sound or music sample.
Stable Audio Open generates variable-length (up to 47s) stereo audio at 44.1kHz from text prompts. It comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder.
Stable Audio is trained on a corpus of around 48k audio recordings, where around 47k are from Freesound and the rest are from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. This data is used to train the autoencoder and the DiT.
The abstract of the paper is the following: Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.
This pipeline was contributed by Yoach Lacombe. The original codebase can be found at Stability-AI/stable-audio-tools.
Tips
When constructing a prompt, keep in mind:
- Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
- Using a negative prompt can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
During inference:
- The quality of the generated audio sample can be controlled by the
num_inference_steps
argument; higher steps give higher quality audio at the expense of slower inference. - Multiple waveforms can be generated in one go: set
num_waveforms_per_prompt
to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
StableAudioPipeline
[[autodoc]] StableAudioPipeline - all - call