Step-Audio-TTS-3B
Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.
This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.
Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.
Model | test-zh | test-en |
---|---|---|
CER (%) β | WER (%) β | |
GLM-4-Voice | 2.19 | 2.91 |
MinMo | 2.48 | 2.90 |
Step-Audio | 1.53 | 2.71 |
Results of TTS Models on SEED Test Sets.
- StepAudio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder*
Model | test-zh | test-en | ||
---|---|---|---|---|
CER (%) β | SS β | WER (%) β | SS β | |
FireRedTTS | 1.51 | 0.630 | 3.82 | 0.460 |
MaskGCT | 2.27 | 0.774 | 2.62 | 0.774 |
CosyVoice | 3.63 | 0.775 | 4.29 | 0.699 |
CosyVoice 2 | 1.45 | 0.806 | 2.57 | 0.736 |
CosyVoice 2-S | 1.45 | 0.812 | 2.38 | 0.743 |
Step-Audio-TTS-3B-Single | 1.37 | 0.802 | 2.52 | 0.704 |
Step-Audio-TTS-3B | 1.31 | 0.733 | 2.31 | 0.660 |
Step-Audio-TTS | 1.17 | 0.73 | 2.0 | 0.660 |
Performance comparison of Dual-codebook Resynthesis with Cosyvoice.
Token | test-zh | test-en | ||
---|---|---|---|---|
CER (%) β | SS β | WER (%) β | SS β | |
Groundtruth | 0.972 | - | 2.156 | - |
CosyVoice | 2.857 | 0.849 | 4.519 | 0.807 |
Step-Audio-TTS-3B | 2.192 | 0.784 | 3.585 | 0.742 |
More information
For more information, please refer to our repository: Step-Audio.
- Downloads last month
- 680