Spaces:

mrfakename
/

E2-F5-TTS

Running on Zero

mrfakename commited on 9 days ago

Commit

1583e1c

verified ·

1 Parent(s): 34d3b0e

Sync from GitHub repo

This Space is synced from the GitHub repo: https://github.com/SWivid/F5-TTS. Please submit contributions to the Space there

Files changed (2) hide show

src/f5_tts/infer/README.md CHANGED Viewed

@@ -4,16 +4,17 @@ The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://h
 **More checkpoints with whole community efforts can be found in [SHARED.md](SHARED.md), supporting more languages.**
-Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be **clip short to ~15s**.
 To avoid possible inference failures, make sure you have seen through the following instructions.
-- Use reference audio <15s and leave some silence (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
-- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
-- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
-- Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
-- If the generation output is blank (pure silence), check for ffmpeg installation (various tutorials online, blogs, videos, etc.).
-- Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates).
 ## Gradio App

 **More checkpoints with whole community efforts can be found in [SHARED.md](SHARED.md), supporting more languages.**
+Currently support **30s for a single** generation, which is the **total length** (same logic if `fix_duration`) including both prompt and output audio. However, `infer_cli` and `infer_gradio` will automatically do chunk generation for longer text input. Long reference audio will be **clip short to ~12s**.
 To avoid possible inference failures, make sure you have seen through the following instructions.
+- Use reference audio <12s and leave proper silence space (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
+- **Uppercased** letters (best with form like K.F.C.) will be uttered letter by letter, and lowercased letters used for common words.
+- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some **pauses**.
+- If English punctuation marks the end of a sentence, make sure there is a space " " after it. Otherwise not regarded as when chunk.
+- Preprocess **numbers** to Chinese letters if you want to have them read in Chinese, otherwise in English.
+- If the generation output is blank (pure silence), check for **ffmpeg** installation.
+- Try turn off **use_ema** if using an early-stage finetuned checkpoint (which goes just few updates).
 ## Gradio App

src/f5_tts/train/README.md CHANGED Viewed

@@ -51,7 +51,9 @@ Discussion board for Finetuning [#57](https://github.com/SWivid/F5-TTS/discussio
 Gradio UI training/finetuning with `src/f5_tts/train/finetune_gradio.py` see [#143](https://github.com/SWivid/F5-TTS/discussions/143).
-The `use_ema = True` is harmful for early-stage finetuned checkpoints (which goes just few updates, thus ema weights still dominated by pretrained ones), try turn it off and see if provide better results.
 ### 3. W&B Logging

 Gradio UI training/finetuning with `src/f5_tts/train/finetune_gradio.py` see [#143](https://github.com/SWivid/F5-TTS/discussions/143).
+The **`use_ema = True` might be harmful for early-stage finetuned checkpoints** (which goes just few updates, thus ema weights still dominated by pretrained ones), try turn it off (`load_model(..., use_ema=False)`) and see if offer better results.
+If use tensorboard as logger, install it first with `pip install tensorboard`.
 ### 3. W&B Logging