Spaces:
Running
on
Zero
Running
on
Zero
Sync from GitHub repo
Browse filesThis Space is synced from the GitHub repo: https://github.com/SWivid/F5-TTS. Please submit contributions to the Space there
- src/f5_tts/infer/README.md +8 -7
- src/f5_tts/train/README.md +3 -1
src/f5_tts/infer/README.md
CHANGED
@@ -4,16 +4,17 @@ The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://h
|
|
4 |
|
5 |
**More checkpoints with whole community efforts can be found in [SHARED.md](SHARED.md), supporting more languages.**
|
6 |
|
7 |
-
Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However,
|
8 |
|
9 |
To avoid possible inference failures, make sure you have seen through the following instructions.
|
10 |
|
11 |
-
- Use reference audio <
|
12 |
-
- Uppercased letters will be uttered letter by letter,
|
13 |
-
- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses
|
14 |
-
-
|
15 |
-
-
|
16 |
-
-
|
|
|
17 |
|
18 |
|
19 |
## Gradio App
|
|
|
4 |
|
5 |
**More checkpoints with whole community efforts can be found in [SHARED.md](SHARED.md), supporting more languages.**
|
6 |
|
7 |
+
Currently support **30s for a single** generation, which is the **total length** (same logic if `fix_duration`) including both prompt and output audio. However, `infer_cli` and `infer_gradio` will automatically do chunk generation for longer text input. Long reference audio will be **clip short to ~12s**.
|
8 |
|
9 |
To avoid possible inference failures, make sure you have seen through the following instructions.
|
10 |
|
11 |
+
- Use reference audio <12s and leave proper silence space (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
|
12 |
+
- **Uppercased** letters (best with form like K.F.C.) will be uttered letter by letter, and lowercased letters used for common words.
|
13 |
+
- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some **pauses**.
|
14 |
+
- If English punctuation marks the end of a sentence, make sure there is a space " " after it. Otherwise not regarded as when chunk.
|
15 |
+
- Preprocess **numbers** to Chinese letters if you want to have them read in Chinese, otherwise in English.
|
16 |
+
- If the generation output is blank (pure silence), check for **ffmpeg** installation.
|
17 |
+
- Try turn off **use_ema** if using an early-stage finetuned checkpoint (which goes just few updates).
|
18 |
|
19 |
|
20 |
## Gradio App
|
src/f5_tts/train/README.md
CHANGED
@@ -51,7 +51,9 @@ Discussion board for Finetuning [#57](https://github.com/SWivid/F5-TTS/discussio
|
|
51 |
|
52 |
Gradio UI training/finetuning with `src/f5_tts/train/finetune_gradio.py` see [#143](https://github.com/SWivid/F5-TTS/discussions/143).
|
53 |
|
54 |
-
The
|
|
|
|
|
55 |
|
56 |
### 3. W&B Logging
|
57 |
|
|
|
51 |
|
52 |
Gradio UI training/finetuning with `src/f5_tts/train/finetune_gradio.py` see [#143](https://github.com/SWivid/F5-TTS/discussions/143).
|
53 |
|
54 |
+
The **`use_ema = True` might be harmful for early-stage finetuned checkpoints** (which goes just few updates, thus ema weights still dominated by pretrained ones), try turn it off (`load_model(..., use_ema=False)`) and see if offer better results.
|
55 |
+
|
56 |
+
If use tensorboard as logger, install it first with `pip install tensorboard`.
|
57 |
|
58 |
### 3. W&B Logging
|
59 |
|