Speech Generation Works Sometimes, But Fails Randomly

#20
by psk - opened

When it works, it performs very well. However, sometimes the speech is not generated correctly at random. When I retry, it works again. How can I ensure that the speech is always generated correctly?

HKUST Audio org

You could generate multiple outputs using num_return_sequences=N, then run a Whisper model to select one with the correct text.

Yeah that works but i wanted use it in real time streaming ? Do I have to fine tune more if I want to get the correct output always?

HKUST Audio org

Yes, I believe finetune will work. We used a large-scale training dataset, which inevitably contains some noisy data, and that can make the model's output less stable. If you have a dataset that is extremely clean—meaning the text and speech are perfectly aligned with no errors—then fine-tuning on that data could improve the model's accuracy.

I have the same issue. It does amazingly well about 60% of the time, but then it will just generate silence or an excessively long clip where the speaker gets stuck on a syllable. It's frustrating both because of how well it does when it does work, and how difficult it is to fine a quality TTS model that is still being maintained and can be easily implemented into a custom project.

HKUST Audio org

Hi! Could you provide some examples of failed cases? That would help me analyze the issue. Also, have you tried Repetition Penalty? Does it have any effect on the problem?

Sign up or log in to comment