Speech Generation Works Sometimes, But Fails Randomly

#20

by psk - opened 6 days ago

psk

6 days ago

When it works, it performs very well. However, sometimes the speech is not generated correctly at random. When I retry, it works again. How can I ensure that the speech is always generated correctly?

HKUST-Audio

HKUST Audio org 5 days ago

You could generate multiple outputs using num_return_sequences=N, then run a Whisper model to select one with the correct text.

psk

5 days ago

Yeah that works but i wanted use it in real time streaming ? Do I have to fine tune more if I want to get the correct output always?

HKUST-Audio

HKUST Audio org 5 days ago

Yes, I believe finetune will work. We used a large-scale training dataset, which inevitably contains some noisy data, and that can make the model's output less stable. If you have a dataset that is extremely clean—meaning the text and speech are perfectly aligned with no errors—then fine-tuning on that data could improve the model's accuracy.

kmbarr

2 days ago

I have the same issue. It does amazingly well about 60% of the time, but then it will just generate silence or an excessively long clip where the speaker gets stuck on a syllable. It's frustrating both because of how well it does when it does work, and how difficult it is to fine a quality TTS model that is still being maintained and can be easily implemented into a custom project.

ZhenYe234

HKUST Audio org 1 day ago

Hi! Could you provide some examples of failed cases? That would help me analyze the issue. Also, have you tried Repetition Penalty? Does it have any effect on the problem?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment