Spaces:

mrfakename
/

E2-F5-TTS

Running on Zero

App Files Files Community

mrfakename commited on Mar 14

Commit

1a19e0f

verified ·

1 Parent(s): 40eed23

Sync from GitHub repo

Browse files

This Space is synced from the GitHub repo: https://github.com/SWivid/F5-TTS. Please submit contributions to the Space there

Files changed (47) hide show

.gitattributes +5 -0
.github/workflows/publish-pypi.yaml +66 -0
README_REPO.md +99 -24
app.py +52 -13
ckpts/README.md +5 -3
pyproject.toml +1 -1
src/f5_tts/api.py +59 -60
src/f5_tts/configs/E2TTS_Base.yaml +49 -0
src/f5_tts/configs/E2TTS_Small.yaml +49 -0
src/f5_tts/configs/F5TTS_Base.yaml +52 -0
src/f5_tts/configs/F5TTS_Small.yaml +52 -0
src/f5_tts/configs/F5TTS_v1_Base.yaml +53 -0
src/f5_tts/eval/eval_infer_batch.py +22 -27
src/f5_tts/eval/eval_infer_batch.sh +11 -6
src/f5_tts/eval/eval_librispeech_test_clean.py +21 -27
src/f5_tts/eval/eval_seedtts_testset.py +21 -27
src/f5_tts/eval/eval_utmos.py +15 -17
src/f5_tts/eval/utils_eval.py +11 -6
src/f5_tts/infer/README.md +38 -80
src/f5_tts/infer/SHARED.md +19 -9
src/f5_tts/infer/examples/basic/basic.toml +2 -2
src/f5_tts/infer/examples/basic/basic_ref_en.wav +0 -0
src/f5_tts/infer/examples/basic/basic_ref_zh.wav +0 -0
src/f5_tts/infer/examples/multi/country.flac +0 -0
src/f5_tts/infer/examples/multi/main.flac +0 -0
src/f5_tts/infer/examples/multi/story.toml +2 -2
src/f5_tts/infer/examples/multi/town.flac +0 -0
src/f5_tts/infer/infer_cli.py +26 -31
src/f5_tts/infer/speech_edit.py +35 -28
src/f5_tts/infer/utils_infer.py +114 -72
src/f5_tts/model/backbones/README.md +2 -2
src/f5_tts/model/backbones/dit.py +63 -8
src/f5_tts/model/backbones/mmdit.py +52 -9
src/f5_tts/model/backbones/unett.py +36 -5
src/f5_tts/model/cfm.py +9 -11
src/f5_tts/model/dataset.py +21 -10
src/f5_tts/model/modules.py +115 -42
src/f5_tts/model/trainer.py +143 -72
src/f5_tts/model/utils.py +4 -3
src/f5_tts/scripts/count_max_epoch.py +3 -3
src/f5_tts/socket_client.py +61 -0
src/f5_tts/socket_server.py +176 -99
src/f5_tts/train/README.md +5 -5
src/f5_tts/train/datasets/prepare_csv_wavs.py +188 -43
src/f5_tts/train/finetune_cli.py +63 -21
src/f5_tts/train/finetune_gradio.py +272 -250
src/f5_tts/train/train.py +12 -11

.gitattributes CHANGED Viewed

@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/basic/basic_ref_en.wav filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/basic/basic_ref_zh.wav filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/multi/country.flac filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/multi/main.flac filter=lfs diff=lfs merge=lfs -text
+src/f5_tts/infer/examples/multi/town.flac filter=lfs diff=lfs merge=lfs -text

.github/workflows/publish-pypi.yaml ADDED Viewed

	@@ -0,0 +1,66 @@

+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+# GitHub recommends pinning actions to a commit SHA.
+# To get a newer version, you will need to update the SHA.
+# You can also reference a tag or branch, but the action may change without warning.
+name: Upload Python Package
+on:
+  release:
+    types: [published]
+permissions:
+  contents: read
+jobs:
+  release-build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.x"
+      - name: Build release distributions
+        run: |
+          # NOTE: put your own distribution build steps here.
+          python -m pip install build
+          python -m build
+      - name: Upload distributions
+        uses: actions/upload-artifact@v4
+        with:
+          name: release-dists
+          path: dist/
+  pypi-publish:
+    runs-on: ubuntu-latest
+    needs:
+      - release-build
+    permissions:
+      # IMPORTANT: this permission is mandatory for trusted publishing
+      id-token: write
+    # Dedicated environments with protections for publishing are strongly recommended.
+    environment:
+      name: pypi
+      # OPTIONAL: uncomment and update to include your PyPI project URL in the deployment status:
+      # url: https://pypi.org/p/YOURPROJECT
+    steps:
+      - name: Retrieve release distributions
+        uses: actions/download-artifact@v4
+        with:
+          name: release-dists
+          path: dist/
+      - name: Publish release distributions to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1

README_REPO.md CHANGED Viewed

@@ -6,7 +6,8 @@
 [![hfspace](https://img.shields.io/badge/🤗-Space%20demo-yellow)](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)
 [![msspace](https://img.shields.io/badge/🤖-Space%20demo-blue)](https://modelscope.cn/studios/modelscope/E2-F5-TTS)
 [![lab](https://img.shields.io/badge/X--LANCE-Lab-grey?labelColor=lightgrey)](https://x-lance.sjtu.edu.cn/)
-<img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto">
 **F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
@@ -17,40 +18,84 @@
 ### Thanks to all the contributors !
 ## News
 - **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), [🟣 Wisemodel](https://wisemodel.cn/models/SJTU_X-LANCE/F5-TTS_Emilia-ZH-EN).
 ## Installation
 ```bash
 # Create a python 3.10 conda env (you could also use virtualenv)
 conda create -n f5-tts python=3.10
 conda activate f5-tts
-# NVIDIA GPU: install pytorch with your CUDA version, e.g.
-pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
-# AMD GPU: install pytorch with your ROCm version, e.g.
-pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
-```
-Then you can choose from a few options below:
-### 1. As a pip package (if just for inference)
-```bash
-pip install git+https://github.com/SWivid/F5-TTS.git
-```
-### 2. Local editable (if also do training, finetuning)
-```bash
-git clone https://github.com/SWivid/F5-TTS.git
-cd F5-TTS
-# git submodule update --init --recursive  # (optional, if need bigvgan)
-pip install -e .
-```
-### 3. Docker usage
 ```bash
 # Build from Dockerfile
 docker build -t f5tts:v1 .
@@ -82,14 +127,40 @@ f5-tts_infer-gradio --port 7860 --host 0.0.0.0
 f5-tts_infer-gradio --share
 ```
 ### 2. CLI Inference
 ```bash
 # Run with flags
 # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
-f5-tts_infer-cli \
---model "F5-TTS" \
---ref_audio "ref_audio.wav" \
 --ref_text "The content, subtitle or transcription of reference audio." \
 --gen_text "Some text you want TTS model generate for you."
@@ -110,15 +181,19 @@ f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
 ## Training
-### 1. Gradio App
-Read [training & finetuning guidance](src/f5_tts/train) for more instructions.
 ```bash
 # Quick start with Gradio web interface
 f5-tts_finetune-gradio
 ```
 ## [Evaluation](src/f5_tts/eval)

 [![hfspace](https://img.shields.io/badge/🤗-Space%20demo-yellow)](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)
 [![msspace](https://img.shields.io/badge/🤖-Space%20demo-blue)](https://modelscope.cn/studios/modelscope/E2-F5-TTS)
 [![lab](https://img.shields.io/badge/X--LANCE-Lab-grey?labelColor=lightgrey)](https://x-lance.sjtu.edu.cn/)
+[![lab](https://img.shields.io/badge/Peng%20Cheng-Lab-grey?labelColor=lightgrey)](https://www.pcl.ac.cn)
+<!-- <img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto"> -->
 **F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
 ### Thanks to all the contributors !
 ## News
+- **2025/03/12**: 🔥 F5-TTS v1 base model with better training and inference performance. [Few demo](https://swivid.github.io/F5-TTS_updates).
 - **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), [🟣 Wisemodel](https://wisemodel.cn/models/SJTU_X-LANCE/F5-TTS_Emilia-ZH-EN).
 ## Installation
+### Create a separate environment if needed
 ```bash
 # Create a python 3.10 conda env (you could also use virtualenv)
 conda create -n f5-tts python=3.10
 conda activate f5-tts
+```
+### Install PyTorch with matched device
+<details>
+<summary>NVIDIA GPU</summary>
+> ```bash
+> # Install pytorch with your CUDA version, e.g.
+> pip install torch==2.4.0+cu124 torchaudio==2.4.0+cu124 --extra-index-url https://download.pytorch.org/whl/cu124
+> ```
+</details>
+<details>
+<summary>AMD GPU</summary>
+> ```bash
+> # Install pytorch with your ROCm version (Linux only), e.g.
+> pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
+> ```
+</details>
+<details>
+<summary>Intel GPU</summary>
+> ```bash
+> # Install pytorch with your XPU version, e.g.
+> # Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit must be installed
+> pip install torch torchaudio --index-url https://download.pytorch.org/whl/test/xpu
+>
+> # Intel GPU support is also available through IPEX (Intel® Extension for PyTorch)
+> # IPEX does not require the Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit
+> # See: https://pytorch-extension.intel.com/installation?request=platform
+> ```
+</details>
+<details>
+<summary>Apple Silicon</summary>
+> ```bash
+> # Install the stable pytorch, e.g.
+> pip install torch torchaudio
+> ```
+</details>
+### Then you can choose one from below:
+> ### 1. As a pip package (if just for inference)
+>
+> ```bash
+> pip install f5-tts
+> ```
+>
+> ### 2. Local editable (if also do training, finetuning)
+>
+> ```bash
+> git clone https://github.com/SWivid/F5-TTS.git
+> cd F5-TTS
+> # git submodule update --init --recursive  # (optional, if need > bigvgan)
+> pip install -e .
+> ```
+### Docker usage also available
 ```bash
 # Build from Dockerfile
 docker build -t f5tts:v1 .
 f5-tts_infer-gradio --share
 ```
+<details>
+<summary>NVIDIA device docker compose file example</summary>
+```yaml
+services:
+  f5-tts:
+    image: ghcr.io/swivid/f5-tts:main
+    ports:
+      - "7860:7860"
+    environment:
+      GRADIO_SERVER_PORT: 7860
+    entrypoint: ["f5-tts_infer-gradio", "--port", "7860", "--host", "0.0.0.0"]
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+volumes:
+  f5-tts:
+    driver: local
+```
+</details>
 ### 2. CLI Inference
 ```bash
 # Run with flags
 # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
+f5-tts_infer-cli --model F5TTS_v1_Base \
+--ref_audio "provide_prompt_wav_path_here.wav" \
 --ref_text "The content, subtitle or transcription of reference audio." \
 --gen_text "Some text you want TTS model generate for you."
 ## Training
+### 1. With Hugging Face Accelerate
+Refer to [training & finetuning guidance](src/f5_tts/train) for best practice.
+### 2. With Gradio App
 ```bash
 # Quick start with Gradio web interface
 f5-tts_finetune-gradio
 ```
+Read [training & finetuning guidance](src/f5_tts/train) for more instructions.
 ## [Evaluation](src/f5_tts/eval)

app.py CHANGED Viewed

@@ -41,12 +41,12 @@ from f5_tts.infer.utils_infer import (
 )
-DEFAULT_TTS_MODEL = "F5-TTS"
 tts_model_choice = DEFAULT_TTS_MODEL
 DEFAULT_TTS_MODEL_CFG = [
-    "hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors",
-    "hf://SWivid/F5-TTS/F5TTS_Base/vocab.txt",
     json.dumps(dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)),
 ]
@@ -56,13 +56,15 @@ DEFAULT_TTS_MODEL_CFG = [
 vocoder = load_vocoder()
-def load_f5tts(ckpt_path=str(cached_path("hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors"))):
-    F5TTS_model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
     return load_model(DiT, F5TTS_model_cfg, ckpt_path)
-def load_e2tts(ckpt_path=str(cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.safetensors"))):
-    E2TTS_model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
     return load_model(UNetT, E2TTS_model_cfg, ckpt_path)
@@ -73,7 +75,7 @@ def load_custom(ckpt_path: str, vocab_path="", model_cfg=None):
     if vocab_path.startswith("hf://"):
         vocab_path = str(cached_path(vocab_path))
     if model_cfg is None:
-        model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
     return load_model(DiT, model_cfg, ckpt_path, vocab_file=vocab_path)
@@ -130,7 +132,7 @@ def infer(
     ref_audio, ref_text = preprocess_ref_audio_text(ref_audio_orig, ref_text, show_info=show_info)
-    if model == "F5-TTS":
         ema_model = F5TTS_ema_model
     elif model == "E2-TTS":
         global E2TTS_ema_model
@@ -762,7 +764,7 @@ If you're having issues, try converting your reference audio to WAV or MP3, clip
 """
     )
-    last_used_custom = files("f5_tts").joinpath("infer/.cache/last_used_custom_model_info.txt")
     def load_last_used_custom():
         try:
@@ -821,7 +823,30 @@ If you're having issues, try converting your reference audio to WAV or MP3, clip
         custom_model_cfg = gr.Dropdown(
             choices=[
                 DEFAULT_TTS_MODEL_CFG[2],
-                json.dumps(dict(dim=768, depth=18, heads=12, ff_mult=2, text_dim=512, conv_layers=4)),
             ],
             value=load_last_used_custom()[2],
             allow_custom_value=True,
@@ -875,10 +900,24 @@ If you're having issues, try converting your reference audio to WAV or MP3, clip
     type=str,
     help='The root path (or "mount point") of the application, if it\'s not served from the root ("/") of the domain. Often used when the application is behind a reverse proxy that forwards requests to the application, e.g. set "/myapp" or full URL for application served at "https://example.com/myapp".',
 )
-def main(port, host, share, api, root_path):
     global app
     print("Starting app...")
-    app.queue(api_open=api).launch(server_name=host, server_port=port, share=share, show_api=api, root_path=root_path)
 if __name__ == "__main__":

 )
+DEFAULT_TTS_MODEL = "F5-TTS_v1"
 tts_model_choice = DEFAULT_TTS_MODEL
 DEFAULT_TTS_MODEL_CFG = [
+    "hf://SWivid/F5-TTS/F5TTS_v1_Base/model_1250000.safetensors",
+    "hf://SWivid/F5-TTS/F5TTS_v1_Base/vocab.txt",
     json.dumps(dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)),
 ]
 vocoder = load_vocoder()
+def load_f5tts():
+    ckpt_path = str(cached_path(DEFAULT_TTS_MODEL_CFG[0]))
+    F5TTS_model_cfg = json.loads(DEFAULT_TTS_MODEL_CFG[2])
     return load_model(DiT, F5TTS_model_cfg, ckpt_path)
+def load_e2tts():
+    ckpt_path = str(cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.safetensors"))
+    E2TTS_model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4, text_mask_padding=False, pe_attn_head=1)
     return load_model(UNetT, E2TTS_model_cfg, ckpt_path)
     if vocab_path.startswith("hf://"):
         vocab_path = str(cached_path(vocab_path))
     if model_cfg is None:
+        model_cfg = json.loads(DEFAULT_TTS_MODEL_CFG[2])
     return load_model(DiT, model_cfg, ckpt_path, vocab_file=vocab_path)
     ref_audio, ref_text = preprocess_ref_audio_text(ref_audio_orig, ref_text, show_info=show_info)
+    if model == DEFAULT_TTS_MODEL:
         ema_model = F5TTS_ema_model
     elif model == "E2-TTS":
         global E2TTS_ema_model
 """
     )
+    last_used_custom = files("f5_tts").joinpath("infer/.cache/last_used_custom_model_info_v1.txt")
     def load_last_used_custom():
         try:
         custom_model_cfg = gr.Dropdown(
             choices=[
                 DEFAULT_TTS_MODEL_CFG[2],
+                json.dumps(
+                    dict(
+                        dim=1024,
+                        depth=22,
+                        heads=16,
+                        ff_mult=2,
+                        text_dim=512,
+                        text_mask_padding=False,
+                        conv_layers=4,
+                        pe_attn_head=1,
+                    )
+                ),
+                json.dumps(
+                    dict(
+                        dim=768,
+                        depth=18,
+                        heads=12,
+                        ff_mult=2,
+                        text_dim=512,
+                        text_mask_padding=False,
+                        conv_layers=4,
+                        pe_attn_head=1,
+                    )
+                ),
             ],
             value=load_last_used_custom()[2],
             allow_custom_value=True,
     type=str,
     help='The root path (or "mount point") of the application, if it\'s not served from the root ("/") of the domain. Often used when the application is behind a reverse proxy that forwards requests to the application, e.g. set "/myapp" or full URL for application served at "https://example.com/myapp".',
 )
+@click.option(
+    "--inbrowser",
+    "-i",
+    is_flag=True,
+    default=False,
+    help="Automatically launch the interface in the default web browser",
+)
+def main(port, host, share, api, root_path, inbrowser):
     global app
     print("Starting app...")
+    app.queue(api_open=api).launch(
+        server_name=host,
+        server_port=port,
+        share=share,
+        show_api=api,
+        root_path=root_path,
+        inbrowser=inbrowser,
+    )
 if __name__ == "__main__":

ckpts/README.md CHANGED Viewed

@@ -3,8 +3,10 @@ Pretrained model ckpts. https://huggingface.co/SWivid/F5-TTS
 ```
 ckpts/
-    E2TTS_Base/
-        model_1200000.pt
     F5TTS_Base/
-        model_1200000.pt
 ```

 ```
 ckpts/
+    F5TTS_v1_Base/
+        model_1250000.safetensors
     F5TTS_Base/
+        model_1200000.safetensors
+    E2TTS_Base/
+        model_1200000.safetensors
 ```

pyproject.toml CHANGED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "f5-tts"
-version = "0.3.4"
 description = "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
 readme = "README.md"
 license = {text = "MIT License"}

 [project]
 name = "f5-tts"
+version = "1.0.1"
 description = "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
 readme = "README.md"
 license = {text = "MIT License"}

src/f5_tts/api.py CHANGED Viewed

@@ -5,84 +5,84 @@ from importlib.resources import files
 import soundfile as sf
 import tqdm
 from cached_path import cached_path
 from f5_tts.infer.utils_infer import (
-    hop_length,
-    infer_process,
     load_model,
     load_vocoder,
     preprocess_ref_audio_text,
     remove_silence_for_generated_wav,
     save_spectrogram,
-    transcribe,
-    target_sample_rate,
 )
-from f5_tts.model import DiT, UNetT
 from f5_tts.model.utils import seed_everything
 class F5TTS:
     def __init__(
         self,
-        model_type="F5-TTS",
         ckpt_file="",
         vocab_file="",
         ode_method="euler",
         use_ema=True,
-        vocoder_name="vocos",
-        local_path=None,
         device=None,
         hf_cache_dir=None,
     ):
-        # Initialize parameters
-        self.final_wave = None
-        self.target_sample_rate = target_sample_rate
-        self.hop_length = hop_length
-        self.seed = -1
-        self.mel_spec_type = vocoder_name
-        # Set device
         if device is not None:
             self.device = device
         else:
             import torch
-            self.device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
         # Load models
-        self.load_vocoder_model(vocoder_name, local_path=local_path, hf_cache_dir=hf_cache_dir)
-        self.load_ema_model(
-            model_type, ckpt_file, vocoder_name, vocab_file, ode_method, use_ema, hf_cache_dir=hf_cache_dir
         )
-    def load_vocoder_model(self, vocoder_name, local_path=None, hf_cache_dir=None):
-        self.vocoder = load_vocoder(vocoder_name, local_path is not None, local_path, self.device, hf_cache_dir)
-    def load_ema_model(self, model_type, ckpt_file, mel_spec_type, vocab_file, ode_method, use_ema, hf_cache_dir=None):
-        if model_type == "F5-TTS":
-            if not ckpt_file:
-                if mel_spec_type == "vocos":
-                    ckpt_file = str(
-                        cached_path("hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors", cache_dir=hf_cache_dir)
-                    )
-                elif mel_spec_type == "bigvgan":
-                    ckpt_file = str(
-                        cached_path("hf://SWivid/F5-TTS/F5TTS_Base_bigvgan/model_1250000.pt", cache_dir=hf_cache_dir)
-                    )
-            model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
-            model_cls = DiT
-        elif model_type == "E2-TTS":
-            if not ckpt_file:
-                ckpt_file = str(
-                    cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.safetensors", cache_dir=hf_cache_dir)
-                )
-            model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
-            model_cls = UNetT
         else:
-            raise ValueError(f"Unknown model type: {model_type}")
         self.ema_model = load_model(
-            model_cls, model_cfg, ckpt_file, mel_spec_type, vocab_file, ode_method, use_ema, self.device
         )
     def transcribe(self, ref_audio, language=None):
@@ -94,8 +94,8 @@ class F5TTS:
         if remove_silence:
             remove_silence_for_generated_wav(file_wave)
-    def export_spectrogram(self, spect, file_spect):
-        save_spectrogram(spect, file_spect)
     def infer(
         self,
@@ -113,17 +113,16 @@ class F5TTS:
         fix_duration=None,
         remove_silence=False,
         file_wave=None,
-        file_spect=None,
-        seed=-1,
     ):
-        if seed == -1:
-            seed = random.randint(0, sys.maxsize)
-        seed_everything(seed)
-        self.seed = seed
         ref_file, ref_text = preprocess_ref_audio_text(ref_file, ref_text, device=self.device)
-        wav, sr, spect = infer_process(
             ref_file,
             ref_text,
             gen_text,
@@ -145,22 +144,22 @@ class F5TTS:
         if file_wave is not None:
             self.export_wav(wav, file_wave, remove_silence)
-        if file_spect is not None:
-            self.export_spectrogram(spect, file_spect)
-        return wav, sr, spect
 if __name__ == "__main__":
     f5tts = F5TTS()
-    wav, sr, spect = f5tts.infer(
         ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
         ref_text="some call me nature, others call me mother nature.",
         gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
         file_wave=str(files("f5_tts").joinpath("../../tests/api_out.wav")),
-        file_spect=str(files("f5_tts").joinpath("../../tests/api_out.png")),
-        seed=-1,  # random seed = -1
     )
     print("seed :", f5tts.seed)

 import soundfile as sf
 import tqdm
 from cached_path import cached_path
+from omegaconf import OmegaConf
 from f5_tts.infer.utils_infer import (
     load_model,
     load_vocoder,
+    transcribe,
     preprocess_ref_audio_text,
+    infer_process,
     remove_silence_for_generated_wav,
     save_spectrogram,
 )
+from f5_tts.model import DiT, UNetT  # noqa: F401. used for config
 from f5_tts.model.utils import seed_everything
 class F5TTS:
     def __init__(
         self,
+        model="F5TTS_v1_Base",
         ckpt_file="",
         vocab_file="",
         ode_method="euler",
         use_ema=True,
+        vocoder_local_path=None,
         device=None,
         hf_cache_dir=None,
     ):
+        model_cfg = OmegaConf.load(str(files("f5_tts").joinpath(f"configs/{model}.yaml")))
+        model_cls = globals()[model_cfg.model.backbone]
+        model_arc = model_cfg.model.arch
+        self.mel_spec_type = model_cfg.model.mel_spec.mel_spec_type
+        self.target_sample_rate = model_cfg.model.mel_spec.target_sample_rate
+        self.ode_method = ode_method
+        self.use_ema = use_ema
         if device is not None:
             self.device = device
         else:
             import torch
+            self.device = (
+                "cuda"
+                if torch.cuda.is_available()
+                else "xpu"
+                if torch.xpu.is_available()
+                else "mps"
+                if torch.backends.mps.is_available()
+                else "cpu"
+            )
         # Load models
+        self.vocoder = load_vocoder(
+            self.mel_spec_type, vocoder_local_path is not None, vocoder_local_path, self.device, hf_cache_dir
         )
+        repo_name, ckpt_step, ckpt_type = "F5-TTS", 1250000, "safetensors"
+        # override for previous models
+        if model == "F5TTS_Base":
+            if self.mel_spec_type == "vocos":
+                ckpt_step = 1200000
+            elif self.mel_spec_type == "bigvgan":
+                model = "F5TTS_Base_bigvgan"
+                ckpt_type = "pt"
+        elif model == "E2TTS_Base":
+            repo_name = "E2-TTS"
+            ckpt_step = 1200000
         else:
+            raise ValueError(f"Unknown model type: {model}")
+        if not ckpt_file:
+            ckpt_file = str(
+                cached_path(f"hf://SWivid/{repo_name}/{model}/model_{ckpt_step}.{ckpt_type}", cache_dir=hf_cache_dir)
+            )
         self.ema_model = load_model(
+            model_cls, model_arc, ckpt_file, self.mel_spec_type, vocab_file, self.ode_method, self.use_ema, self.device
         )
     def transcribe(self, ref_audio, language=None):
         if remove_silence:
             remove_silence_for_generated_wav(file_wave)
+    def export_spectrogram(self, spec, file_spec):
+        save_spectrogram(spec, file_spec)
     def infer(
         self,
         fix_duration=None,
         remove_silence=False,
         file_wave=None,
+        file_spec=None,
+        seed=None,
     ):
+        if seed is None:
+            self.seed = random.randint(0, sys.maxsize)
+        seed_everything(self.seed)
         ref_file, ref_text = preprocess_ref_audio_text(ref_file, ref_text, device=self.device)
+        wav, sr, spec = infer_process(
             ref_file,
             ref_text,
             gen_text,
         if file_wave is not None:
             self.export_wav(wav, file_wave, remove_silence)
+        if file_spec is not None:
+            self.export_spectrogram(spec, file_spec)
+        return wav, sr, spec
 if __name__ == "__main__":
     f5tts = F5TTS()
+    wav, sr, spec = f5tts.infer(
         ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
         ref_text="some call me nature, others call me mother nature.",
         gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
         file_wave=str(files("f5_tts").joinpath("../../tests/api_out.wav")),
+        file_spec=str(files("f5_tts").joinpath("../../tests/api_out.png")),
+        seed=None,
     )
     print("seed :", f5tts.seed)

src/f5_tts/configs/E2TTS_Base.yaml ADDED Viewed

	@@ -0,0 +1,49 @@

+hydra:
+  run:
+    dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: Emilia_ZH_EN  # dataset name
+  batch_size_per_gpu: 38400  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # frame | sample
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 16
+optim:
+  epochs: 11
+  learning_rate: 7.5e-5
+  num_warmup_updates: 20000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: E2TTS_Base
+  tokenizer: pinyin
+  tokenizer_path: null  # if 'custom' tokenizer, define the path want to use (should be vocab.txt)
+  backbone: UNetT
+  arch:
+    dim: 1024
+    depth: 24
+    heads: 16
+    ff_mult: 4
+    text_mask_padding: False
+    pe_attn_head: 1
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # vocos | bigvgan
+  vocoder:
+    is_local: False  # use local offline ckpt or not
+    local_path: null  # local vocoder path
+ckpts:
+  logger: wandb  # wandb | tensorboard | null
+  log_samples: True  # infer random sample per save checkpoint. wip, normal to fail with extra long samples
+  save_per_updates: 50000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 5000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

src/f5_tts/configs/E2TTS_Small.yaml ADDED Viewed

	@@ -0,0 +1,49 @@

+hydra:
+  run:
+    dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: Emilia_ZH_EN
+  batch_size_per_gpu: 38400  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # frame | sample
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 16
+optim:
+  epochs: 11
+  learning_rate: 7.5e-5
+  num_warmup_updates: 20000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0
+  bnb_optimizer: False
+model:
+  name: E2TTS_Small
+  tokenizer: pinyin
+  tokenizer_path: null  # if 'custom' tokenizer, define the path want to use (should be vocab.txt)
+  backbone: UNetT
+  arch:
+    dim: 768
+    depth: 20
+    heads: 12
+    ff_mult: 4
+    text_mask_padding: False
+    pe_attn_head: 1
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # vocos | bigvgan
+  vocoder:
+    is_local: False  # use local offline ckpt or not
+    local_path: null  # local vocoder path
+ckpts:
+  logger: wandb  # wandb | tensorboard | null
+  log_samples: True  # infer random sample per save checkpoint. wip, normal to fail with extra long samples
+  save_per_updates: 50000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 5000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

src/f5_tts/configs/F5TTS_Base.yaml ADDED Viewed

	@@ -0,0 +1,52 @@

+hydra:
+  run:
+    dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: Emilia_ZH_EN  # dataset name
+  batch_size_per_gpu: 38400  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # frame | sample
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 16
+optim:
+  epochs: 11
+  learning_rate: 7.5e-5
+  num_warmup_updates: 20000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: F5TTS_Base  # model name
+  tokenizer: pinyin  # tokenizer type
+  tokenizer_path: null  # if 'custom' tokenizer, define the path want to use (should be vocab.txt)
+  backbone: DiT
+  arch:
+    dim: 1024
+    depth: 22
+    heads: 16
+    ff_mult: 2
+    text_dim: 512
+    text_mask_padding: False
+    conv_layers: 4
+    pe_attn_head: 1
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # vocos | bigvgan
+  vocoder:
+    is_local: False  # use local offline ckpt or not
+    local_path: null  # local vocoder path
+ckpts:
+  logger: wandb  # wandb | tensorboard | null
+  log_samples: True  # infer random sample per save checkpoint. wip, normal to fail with extra long samples
+  save_per_updates: 50000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 5000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

src/f5_tts/configs/F5TTS_Small.yaml ADDED Viewed

	@@ -0,0 +1,52 @@

+hydra:
+  run:
+    dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: Emilia_ZH_EN
+  batch_size_per_gpu: 38400  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # frame | sample
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 16
+optim:
+  epochs: 11
+  learning_rate: 7.5e-5
+  num_warmup_updates: 20000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: F5TTS_Small
+  tokenizer: pinyin
+  tokenizer_path: null  # if 'custom' tokenizer, define the path want to use (should be vocab.txt)
+  backbone: DiT
+  arch:
+    dim: 768
+    depth: 18
+    heads: 12
+    ff_mult: 2
+    text_dim: 512
+    text_mask_padding: False
+    conv_layers: 4
+    pe_attn_head: 1
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # vocos | bigvgan
+  vocoder:
+    is_local: False  # use local offline ckpt or not
+    local_path: null  # local vocoder path
+ckpts:
+  logger: wandb  # wandb | tensorboard | null
+  log_samples: True  # infer random sample per save checkpoint. wip, normal to fail with extra long samples
+  save_per_updates: 50000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 5000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

src/f5_tts/configs/F5TTS_v1_Base.yaml ADDED Viewed

	@@ -0,0 +1,53 @@

+hydra:
+  run:
+    dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: Emilia_ZH_EN  # dataset name
+  batch_size_per_gpu: 38400  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # frame | sample
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 16
+optim:
+  epochs: 11
+  learning_rate: 7.5e-5
+  num_warmup_updates: 20000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: F5TTS_v1_Base  # model name
+  tokenizer: pinyin  # tokenizer type
+  tokenizer_path: null  # if 'custom' tokenizer, define the path want to use (should be vocab.txt)
+  backbone: DiT
+  arch:
+    dim: 1024
+    depth: 22
+    heads: 16
+    ff_mult: 2
+    text_dim: 512
+    text_mask_padding: True
+    qk_norm: null  # null | rms_norm
+    conv_layers: 4
+    pe_attn_head: null
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # vocos | bigvgan
+  vocoder:
+    is_local: False  # use local offline ckpt or not
+    local_path: null  # local vocoder path
+ckpts:
+  logger: wandb  # wandb | tensorboard | null
+  log_samples: True  # infer random sample per save checkpoint. wip, normal to fail with extra long samples
+  save_per_updates: 50000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 5000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

src/f5_tts/eval/eval_infer_batch.py CHANGED Viewed

@@ -10,6 +10,7 @@ from importlib.resources import files
 import torch
 import torchaudio
 from accelerate import Accelerator
 from tqdm import tqdm
 from f5_tts.eval.utils_eval import (
@@ -18,36 +19,26 @@ from f5_tts.eval.utils_eval import (
     get_seedtts_testset_metainfo,
 )
 from f5_tts.infer.utils_infer import load_checkpoint, load_vocoder
-from f5_tts.model import CFM, DiT, UNetT
 from f5_tts.model.utils import get_tokenizer
 accelerator = Accelerator()
 device = f"cuda:{accelerator.process_index}"
-# --------------------- Dataset Settings -------------------- #
-target_sample_rate = 24000
-n_mel_channels = 100
-hop_length = 256
-win_length = 1024
-n_fft = 1024
 target_rms = 0.1
 rel_path = str(files("f5_tts").joinpath("../../"))
 def main():
-    # ---------------------- infer setting ---------------------- #
     parser = argparse.ArgumentParser(description="batch inference")
     parser.add_argument("-s", "--seed", default=None, type=int)
-    parser.add_argument("-d", "--dataset", default="Emilia_ZH_EN")
     parser.add_argument("-n", "--expname", required=True)
-    parser.add_argument("-c", "--ckptstep", default=1200000, type=int)
-    parser.add_argument("-m", "--mel_spec_type", default="vocos", type=str, choices=["bigvgan", "vocos"])
-    parser.add_argument("-to", "--tokenizer", default="pinyin", type=str, choices=["pinyin", "char"])
     parser.add_argument("-nfe", "--nfestep", default=32, type=int)
     parser.add_argument("-o", "--odemethod", default="euler")
@@ -58,12 +49,8 @@ def main():
     args = parser.parse_args()
     seed = args.seed
-    dataset_name = args.dataset
     exp_name = args.expname
     ckpt_step = args.ckptstep
-    ckpt_path = rel_path + f"/ckpts/{exp_name}/model_{ckpt_step}.pt"
-    mel_spec_type = args.mel_spec_type
-    tokenizer = args.tokenizer
     nfe_step = args.nfestep
     ode_method = args.odemethod
@@ -77,13 +64,19 @@ def main():
     use_truth_duration = False
     no_ref_audio = False
-    if exp_name == "F5TTS_Base":
-        model_cls = DiT
-        model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
-    elif exp_name == "E2TTS_Base":
-        model_cls = UNetT
-        model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
     if testset == "ls_pc_test_clean":
         metalst = rel_path + "/data/librispeech_pc_test_clean_cross_sentence.lst"
@@ -111,8 +104,6 @@ def main():
     # -------------------------------------------------#
-    use_ema = True
     prompts_all = get_inference_prompt(
         metainfo,
         speed=speed,
@@ -139,7 +130,7 @@ def main():
     # Model
     model = CFM(
-        transformer=model_cls(**model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels),
         mel_spec_kwargs=dict(
             n_fft=n_fft,
             hop_length=hop_length,
@@ -154,6 +145,10 @@ def main():
         vocab_char_map=vocab_char_map,
     ).to(device)
     dtype = torch.float32 if mel_spec_type == "bigvgan" else None
     model = load_checkpoint(model, ckpt_path, device, dtype=dtype, use_ema=use_ema)

 import torch
 import torchaudio
 from accelerate import Accelerator
+from omegaconf import OmegaConf
 from tqdm import tqdm
 from f5_tts.eval.utils_eval import (
     get_seedtts_testset_metainfo,
 )
 from f5_tts.infer.utils_infer import load_checkpoint, load_vocoder
+from f5_tts.model import CFM, DiT, UNetT  # noqa: F401. used for config
 from f5_tts.model.utils import get_tokenizer
 accelerator = Accelerator()
 device = f"cuda:{accelerator.process_index}"
+use_ema = True
 target_rms = 0.1
 rel_path = str(files("f5_tts").joinpath("../../"))
 def main():
     parser = argparse.ArgumentParser(description="batch inference")
     parser.add_argument("-s", "--seed", default=None, type=int)
     parser.add_argument("-n", "--expname", required=True)
+    parser.add_argument("-c", "--ckptstep", default=1250000, type=int)
     parser.add_argument("-nfe", "--nfestep", default=32, type=int)
     parser.add_argument("-o", "--odemethod", default="euler")
     args = parser.parse_args()
     seed = args.seed
     exp_name = args.expname
     ckpt_step = args.ckptstep
     nfe_step = args.nfestep
     ode_method = args.odemethod
     use_truth_duration = False
     no_ref_audio = False
+    model_cfg = OmegaConf.load(str(files("f5_tts").joinpath(f"configs/{exp_name}.yaml")))
+    model_cls = globals()[model_cfg.model.backbone]
+    model_arc = model_cfg.model.arch
+    dataset_name = model_cfg.datasets.name
+    tokenizer = model_cfg.model.tokenizer
+    mel_spec_type = model_cfg.model.mel_spec.mel_spec_type
+    target_sample_rate = model_cfg.model.mel_spec.target_sample_rate
+    n_mel_channels = model_cfg.model.mel_spec.n_mel_channels
+    hop_length = model_cfg.model.mel_spec.hop_length
+    win_length = model_cfg.model.mel_spec.win_length
+    n_fft = model_cfg.model.mel_spec.n_fft
     if testset == "ls_pc_test_clean":
         metalst = rel_path + "/data/librispeech_pc_test_clean_cross_sentence.lst"
     # -------------------------------------------------#
     prompts_all = get_inference_prompt(
         metainfo,
         speed=speed,
     # Model
     model = CFM(
+        transformer=model_cls(**model_arc, text_num_embeds=vocab_size, mel_dim=n_mel_channels),
         mel_spec_kwargs=dict(
             n_fft=n_fft,
             hop_length=hop_length,
         vocab_char_map=vocab_char_map,
     ).to(device)
+    ckpt_path = rel_path + f"/ckpts/{exp_name}/model_{ckpt_step}.pt"
+    if not os.path.exists(ckpt_path):
+        print("Loading from self-organized training checkpoints rather than released pretrained.")
+        ckpt_path = rel_path + f"/{model_cfg.ckpts.save_dir}/model_{ckpt_step}.pt"
     dtype = torch.float32 if mel_spec_type == "bigvgan" else None
     model = load_checkpoint(model, ckpt_path, device, dtype=dtype, use_ema=use_ema)

src/f5_tts/eval/eval_infer_batch.sh CHANGED Viewed

@@ -1,13 +1,18 @@
 #!/bin/bash
 # e.g. F5-TTS, 16 NFE
-accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_Base" -t "seedtts_test_zh" -nfe 16
-accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_Base" -t "seedtts_test_en" -nfe 16
-accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_Base" -t "ls_pc_test_clean" -nfe 16
 # e.g. Vanilla E2 TTS, 32 NFE
-accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -t "seedtts_test_zh" -o "midpoint" -ss 0
-accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -t "seedtts_test_en" -o "midpoint" -ss 0
-accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -t "ls_pc_test_clean" -o "midpoint" -ss 0
 # etc.

 #!/bin/bash
 # e.g. F5-TTS, 16 NFE
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_v1_Base" -t "seedtts_test_zh" -nfe 16
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_v1_Base" -t "seedtts_test_en" -nfe 16
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "F5TTS_v1_Base" -t "ls_pc_test_clean" -nfe 16
 # e.g. Vanilla E2 TTS, 32 NFE
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -c 1200000 -t "seedtts_test_zh" -o "midpoint" -ss 0
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -c 1200000 -t "seedtts_test_en" -o "midpoint" -ss 0
+accelerate launch src/f5_tts/eval/eval_infer_batch.py -s 0 -n "E2TTS_Base" -c 1200000 -t "ls_pc_test_clean" -o "midpoint" -ss 0
+# e.g. evaluate F5-TTS 16 NFE result on Seed-TTS test-zh
+python src/f5_tts/eval/eval_seedtts_testset.py -e wer -l zh --gen_wav_dir results/F5TTS_v1_Base_1250000/seedtts_test_zh/seed0_euler_nfe32_vocos_ss-1_cfg2.0_speed1.0 --gpu_nums 8
+python src/f5_tts/eval/eval_seedtts_testset.py -e sim -l zh --gen_wav_dir results/F5TTS_v1_Base_1250000/seedtts_test_zh/seed0_euler_nfe32_vocos_ss-1_cfg2.0_speed1.0 --gpu_nums 8
+python src/f5_tts/eval/eval_utmos.py --audio_dir results/F5TTS_v1_Base_1250000/seedtts_test_zh/seed0_euler_nfe32_vocos_ss-1_cfg2.0_speed1.0
 # etc.

src/f5_tts/eval/eval_librispeech_test_clean.py CHANGED Viewed

@@ -53,43 +53,37 @@ def main():
         asr_ckpt_dir = ""  # auto download to cache dir
     wavlm_ckpt_dir = "../checkpoints/UniSpeech/wavlm_large_finetune.pth"
-    # --------------------------- WER ---------------------------
-    if eval_task == "wer":
-        wer_results = []
-        wers = []
         with mp.Pool(processes=len(gpus)) as pool:
             args = [(rank, lang, sub_test_set, asr_ckpt_dir) for (rank, sub_test_set) in test_set]
             results = pool.map(run_asr_wer, args)
             for r in results:
-                wer_results.extend(r)
-        wer_result_path = f"{gen_wav_dir}/{lang}_wer_results.jsonl"
-        with open(wer_result_path, "w") as f:
-            for line in wer_results:
-                wers.append(line["wer"])
-                json_line = json.dumps(line, ensure_ascii=False)
-                f.write(json_line + "\n")
-        wer = round(np.mean(wers) * 100, 3)
-        print(f"\nTotal {len(wers)} samples")
-        print(f"WER      : {wer}%")
-        print(f"Results have been saved to {wer_result_path}")
-    # --------------------------- SIM ---------------------------
-    if eval_task == "sim":
-        sims = []
         with mp.Pool(processes=len(gpus)) as pool:
             args = [(rank, sub_test_set, wavlm_ckpt_dir) for (rank, sub_test_set) in test_set]
             results = pool.map(run_sim, args)
             for r in results:
-                sims.extend(r)
-        sim = round(sum(sims) / len(sims), 3)
-        print(f"\nTotal {len(sims)} samples")
-        print(f"SIM      : {sim}")
 if __name__ == "__main__":

         asr_ckpt_dir = ""  # auto download to cache dir
     wavlm_ckpt_dir = "../checkpoints/UniSpeech/wavlm_large_finetune.pth"
+    # --------------------------------------------------------------------------
+    full_results = []
+    metrics = []
+    if eval_task == "wer":
         with mp.Pool(processes=len(gpus)) as pool:
             args = [(rank, lang, sub_test_set, asr_ckpt_dir) for (rank, sub_test_set) in test_set]
             results = pool.map(run_asr_wer, args)
             for r in results:
+                full_results.extend(r)
+    elif eval_task == "sim":
         with mp.Pool(processes=len(gpus)) as pool:
             args = [(rank, sub_test_set, wavlm_ckpt_dir) for (rank, sub_test_set) in test_set]
             results = pool.map(run_sim, args)
             for r in results:
+                full_results.extend(r)
+    else:
+        raise ValueError(f"Unknown metric type: {eval_task}")
+    result_path = f"{gen_wav_dir}/_{eval_task}_results.jsonl"
+    with open(result_path, "w") as f:
+        for line in full_results:
+            metrics.append(line[eval_task])
+            f.write(json.dumps(line, ensure_ascii=False) + "\n")
+        metric = round(np.mean(metrics), 5)
+        f.write(f"\n{eval_task.upper()}: {metric}\n")
+    print(f"\nTotal {len(metrics)} samples")
+    print(f"{eval_task.upper()}: {metric}")
+    print(f"{eval_task.upper()} results saved to {result_path}")
 if __name__ == "__main__":

src/f5_tts/eval/eval_seedtts_testset.py CHANGED Viewed

@@ -52,43 +52,37 @@ def main():
         asr_ckpt_dir = ""  # auto download to cache dir
     wavlm_ckpt_dir = "../checkpoints/UniSpeech/wavlm_large_finetune.pth"
-    # --------------------------- WER ---------------------------
-    if eval_task == "wer":
-        wer_results = []
-        wers = []
         with mp.Pool(processes=len(gpus)) as pool:
             args = [(rank, lang, sub_test_set, asr_ckpt_dir) for (rank, sub_test_set) in test_set]
             results = pool.map(run_asr_wer, args)
             for r in results:
-                wer_results.extend(r)
-        wer_result_path = f"{gen_wav_dir}/{lang}_wer_results.jsonl"
-        with open(wer_result_path, "w") as f:
-            for line in wer_results:
-                wers.append(line["wer"])
-                json_line = json.dumps(line, ensure_ascii=False)
-                f.write(json_line + "\n")
-        wer = round(np.mean(wers) * 100, 3)
-        print(f"\nTotal {len(wers)} samples")
-        print(f"WER      : {wer}%")
-        print(f"Results have been saved to {wer_result_path}")
-    # --------------------------- SIM ---------------------------
-    if eval_task == "sim":
-        sims = []
         with mp.Pool(processes=len(gpus)) as pool:
             args = [(rank, sub_test_set, wavlm_ckpt_dir) for (rank, sub_test_set) in test_set]
             results = pool.map(run_sim, args)
             for r in results:
-                sims.extend(r)
-        sim = round(sum(sims) / len(sims), 3)
-        print(f"\nTotal {len(sims)} samples")
-        print(f"SIM      : {sim}")
 if __name__ == "__main__":

         asr_ckpt_dir = ""  # auto download to cache dir
     wavlm_ckpt_dir = "../checkpoints/UniSpeech/wavlm_large_finetune.pth"
+    # --------------------------------------------------------------------------
+    full_results = []
+    metrics = []
+    if eval_task == "wer":
         with mp.Pool(processes=len(gpus)) as pool:
             args = [(rank, lang, sub_test_set, asr_ckpt_dir) for (rank, sub_test_set) in test_set]
             results = pool.map(run_asr_wer, args)
             for r in results:
+                full_results.extend(r)
+    elif eval_task == "sim":
         with mp.Pool(processes=len(gpus)) as pool:
             args = [(rank, sub_test_set, wavlm_ckpt_dir) for (rank, sub_test_set) in test_set]
             results = pool.map(run_sim, args)
             for r in results:
+                full_results.extend(r)
+    else:
+        raise ValueError(f"Unknown metric type: {eval_task}")
+    result_path = f"{gen_wav_dir}/_{eval_task}_results.jsonl"
+    with open(result_path, "w") as f:
+        for line in full_results:
+            metrics.append(line[eval_task])
+            f.write(json.dumps(line, ensure_ascii=False) + "\n")
+        metric = round(np.mean(metrics), 5)
+        f.write(f"\n{eval_task.upper()}: {metric}\n")
+    print(f"\nTotal {len(metrics)} samples")
+    print(f"{eval_task.upper()}: {metric}")
+    print(f"{eval_task.upper()} results saved to {result_path}")
 if __name__ == "__main__":

src/f5_tts/eval/eval_utmos.py CHANGED Viewed

@@ -13,31 +13,29 @@ def main():
     parser.add_argument("--ext", type=str, default="wav", help="Audio extension.")
     args = parser.parse_args()
-    device = "cuda" if torch.cuda.is_available() else "cpu"
     predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True)
     predictor = predictor.to(device)
     audio_paths = list(Path(args.audio_dir).rglob(f"*.{args.ext}"))
-    utmos_results = {}
     utmos_score = 0
-    for audio_path in tqdm(audio_paths, desc="Processing"):
-        wav_name = audio_path.stem
-        wav, sr = librosa.load(audio_path, sr=None, mono=True)
-        wav_tensor = torch.from_numpy(wav).to(device).unsqueeze(0)
-        score = predictor(wav_tensor, sr)
-        utmos_results[str(wav_name)] = score.item()
-        utmos_score += score.item()
-    avg_score = utmos_score / len(audio_paths) if len(audio_paths) > 0 else 0
-    print(f"UTMOS: {avg_score}")
-    utmos_result_path = Path(args.audio_dir) / "utmos_results.json"
     with open(utmos_result_path, "w", encoding="utf-8") as f:
-        json.dump(utmos_results, f, ensure_ascii=False, indent=4)
-    print(f"Results have been saved to {utmos_result_path}")
 if __name__ == "__main__":

     parser.add_argument("--ext", type=str, default="wav", help="Audio extension.")
     args = parser.parse_args()
+    device = "cuda" if torch.cuda.is_available() else "xpu" if torch.xpu.is_available() else "cpu"
     predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True)
     predictor = predictor.to(device)
     audio_paths = list(Path(args.audio_dir).rglob(f"*.{args.ext}"))
     utmos_score = 0
+    utmos_result_path = Path(args.audio_dir) / "_utmos_results.jsonl"
     with open(utmos_result_path, "w", encoding="utf-8") as f:
+        for audio_path in tqdm(audio_paths, desc="Processing"):
+            wav, sr = librosa.load(audio_path, sr=None, mono=True)
+            wav_tensor = torch.from_numpy(wav).to(device).unsqueeze(0)
+            score = predictor(wav_tensor, sr)
+            line = {}
+            line["wav"], line["utmos"] = str(audio_path.stem), score.item()
+            utmos_score += score.item()
+            f.write(json.dumps(line, ensure_ascii=False) + "\n")
+        avg_score = utmos_score / len(audio_paths) if len(audio_paths) > 0 else 0
+        f.write(f"\nUTMOS: {avg_score:.4f}\n")
+    print(f"UTMOS: {avg_score:.4f}")
+    print(f"UTMOS results saved to {utmos_result_path}")
 if __name__ == "__main__":

src/f5_tts/eval/utils_eval.py CHANGED Viewed

@@ -389,10 +389,10 @@ def run_sim(args):
         model = model.cuda(device)
     model.eval()
-    sims = []
-    for wav1, wav2, truth in tqdm(test_set):
-        wav1, sr1 = torchaudio.load(wav1)
-        wav2, sr2 = torchaudio.load(wav2)
         resample1 = torchaudio.transforms.Resample(orig_freq=sr1, new_freq=16000)
         resample2 = torchaudio.transforms.Resample(orig_freq=sr2, new_freq=16000)
@@ -408,6 +408,11 @@ def run_sim(args):
         sim = F.cosine_similarity(emb1, emb2)[0].item()
         # print(f"VSim score between two audios: {sim:.4f} (-1.0, 1.0).")
-        sims.append(sim)
-    return sims

         model = model.cuda(device)
     model.eval()
+    sim_results = []
+    for gen_wav, prompt_wav, truth in tqdm(test_set):
+        wav1, sr1 = torchaudio.load(gen_wav)
+        wav2, sr2 = torchaudio.load(prompt_wav)
         resample1 = torchaudio.transforms.Resample(orig_freq=sr1, new_freq=16000)
         resample2 = torchaudio.transforms.Resample(orig_freq=sr2, new_freq=16000)
         sim = F.cosine_similarity(emb1, emb2)[0].item()
         # print(f"VSim score between two audios: {sim:.4f} (-1.0, 1.0).")
+        sim_results.append(
+            {
+                "wav": Path(gen_wav).stem,
+                "sim": sim,
+            }
+        )
+    return sim_results

src/f5_tts/infer/README.md CHANGED Viewed

@@ -23,12 +23,24 @@ Currently supported features:
 - Basic TTS with Chunk Inference
 - Multi-Style / Multi-Speaker Generation
 - Voice Chat powered by Qwen2.5-3B-Instruct
 The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference.
 The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat.
-Could also be used as a component for larger application.
 ```python
 import gradio as gr
 from f5_tts.infer.infer_gradio import app
@@ -56,14 +68,16 @@ Basically you can inference with flags:
 ```bash
 # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
 f5-tts_infer-cli \
---model "F5-TTS" \
 --ref_audio "ref_audio.wav" \
 --ref_text "The content, subtitle or transcription of reference audio." \
 --gen_text "Some text you want TTS model generate for you."
-# Choose Vocoder
-f5-tts_infer-cli --vocoder_name bigvgan --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base_bigvgan/model_1250000.pt>
-f5-tts_infer-cli --vocoder_name vocos --load_vocoder_from_local --ckpt_file <YOUR_CKPT_PATH, eg:ckpts/F5TTS_Base/model_1200000.safetensors>
 # More instructions
 f5-tts_infer-cli --help
@@ -78,8 +92,8 @@ f5-tts_infer-cli -c custom.toml
 For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`:
 ```toml
-# F5-TTS | E2-TTS
-model = "F5-TTS"
 ref_audio = "infer/examples/basic/basic_ref_en.wav"
 # If an empty "", transcribes the reference audio automatically.
 ref_text = "Some call me nature, others call me mother nature."
@@ -93,8 +107,8 @@ output_dir = "tests"
 You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
 ```toml
-# F5-TTS | E2-TTS
-model = "F5-TTS"
 ref_audio = "infer/examples/multi/main.flac"
 # If an empty "", transcribes the reference audio automatically.
 ref_text = ""
@@ -114,83 +128,27 @@ ref_text = ""
 ```
 You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
-## Speech Editing
-To test speech editing capabilities, use the following command:
 ```bash
-python src/f5_tts/infer/speech_edit.py
-```
-## Socket Realtime Client
-To communicate with socket server you need to run
-```bash
-python src/f5_tts/socket_server.py
 ```
-<details>
-<summary>Then create client to communicate</summary>
-``` python
-import socket
-import numpy as np
-import asyncio
-import pyaudio
-async def listen_to_voice(text, server_ip='localhost', server_port=9999):
-    client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-    client_socket.connect((server_ip, server_port))
-    async def play_audio_stream():
-        buffer = b''
-        p = pyaudio.PyAudio()
-        stream = p.open(format=pyaudio.paFloat32,
-                        channels=1,
-                        rate=24000,  # Ensure this matches the server's sampling rate
-                        output=True,
-                        frames_per_buffer=2048)
-        try:
-            while True:
-                chunk = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 1024)
-                if not chunk:  # End of stream
-                    break
-                if b"END_OF_AUDIO" in chunk:
-                    buffer += chunk.replace(b"END_OF_AUDIO", b"")
-                    if buffer:
-                        audio_array = np.frombuffer(buffer, dtype=np.float32).copy()  # Make a writable copy
-                        stream.write(audio_array.tobytes())
-                    break
-                buffer += chunk
-                if len(buffer) >= 4096:
-                    audio_array = np.frombuffer(buffer[:4096], dtype=np.float32).copy()  # Make a writable copy
-                    stream.write(audio_array.tobytes())
-                    buffer = buffer[4096:]
-        finally:
-            stream.stop_stream()
-            stream.close()
-            p.terminate()
-    try:
-        # Send only the text to the server
-        await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, text.encode('utf-8'))
-        await play_audio_stream()
-        print("Audio playback finished.")
-    except Exception as e:
-        print(f"Error in listen_to_voice: {e}")
-    finally:
-        client_socket.close()
-# Example usage: Replace this with your actual server IP and port
-async def main():
-    await listen_to_voice("my name is jenny..", server_ip='localhost', server_port=9998)
-# Run the main async function
-asyncio.run(main())
-```
-</details>

 - Basic TTS with Chunk Inference
 - Multi-Style / Multi-Speaker Generation
 - Voice Chat powered by Qwen2.5-3B-Instruct
+- [Custom inference with more language support](src/f5_tts/infer/SHARED.md)
 The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference.
 The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat.
+More flags options:
+```bash
+# Automatically launch the interface in the default web browser
+f5-tts_infer-gradio --inbrowser
+# Set the root path of the application, if it's not served from the root ("/") of the domain
+# For example, if the application is served at "https://example.com/myapp"
+f5-tts_infer-gradio --root_path "/myapp"
+```
+Could also be used as a component for larger application:
 ```python
 import gradio as gr
 from f5_tts.infer.infer_gradio import app
 ```bash
 # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
 f5-tts_infer-cli \
+--model F5TTS_v1_Base \
 --ref_audio "ref_audio.wav" \
 --ref_text "The content, subtitle or transcription of reference audio." \
 --gen_text "Some text you want TTS model generate for you."
+# Use BigVGAN as vocoder. Currently only support F5TTS_Base.
+f5-tts_infer-cli --model F5TTS_Base --vocoder_name bigvgan --load_vocoder_from_local
+# Use custom path checkpoint, e.g.
+f5-tts_infer-cli --ckpt_file ckpts/F5TTS_v1_Base/model_1250000.safetensors
 # More instructions
 f5-tts_infer-cli --help
 For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`:
 ```toml
+# F5TTS_v1_Base | E2TTS_Base
+model = "F5TTS_v1_Base"
 ref_audio = "infer/examples/basic/basic_ref_en.wav"
 # If an empty "", transcribes the reference audio automatically.
 ref_text = "Some call me nature, others call me mother nature."
 You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
 ```toml
+# F5TTS_v1_Base | E2TTS_Base
+model = "F5TTS_v1_Base"
 ref_audio = "infer/examples/multi/main.flac"
 # If an empty "", transcribes the reference audio automatically.
 ref_text = ""
 ```
 You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
+## Socket Real-time Service
+Real-time voice output with chunk stream:
 ```bash
+# Start socket server
+python src/f5_tts/socket_server.py
+# If PyAudio not installed
+sudo apt-get install portaudio19-dev
+pip install pyaudio
+# Communicate with socket client
+python src/f5_tts/socket_client.py
 ```
+## Speech Editing
+To test speech editing capabilities, use the following command:
+```bash
+python src/f5_tts/infer/speech_edit.py
+```

src/f5_tts/infer/SHARED.md CHANGED Viewed

@@ -16,7 +16,7 @@
 <!-- omit in toc -->
 ### Supported Languages
 - [Multilingual](#multilingual)
-    - [F5-TTS Base @ zh \& en @ F5-TTS](#f5-tts-base--zh--en--f5-tts)
 - [English](#english)
 - [Finnish](#finnish)
     - [F5-TTS Base @ fi @ AsmoKoskinen](#f5-tts-base--fi--asmokoskinen)
@@ -37,7 +37,17 @@
 ## Multilingual
-#### F5-TTS Base @ zh & en @ F5-TTS
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
 |F5-TTS Base|[ckpt & vocab](https://huggingface.co/SWivid/F5-TTS/tree/main/F5TTS_Base)|[Emilia 95K zh&en](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07)|cc-by-nc-4.0|
@@ -45,7 +55,7 @@
 ```bash
 Model: hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors
 Vocab: hf://SWivid/F5-TTS/F5TTS_Base/vocab.txt
-Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
 *Other infos, e.g. Author info, Github repo, Link to some sampled results, Usage instruction, Tutorial (Blog, Video, etc.) ...*
@@ -64,7 +74,7 @@ Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "
 ```bash
 Model: hf://AsmoKoskinen/F5-TTS_Finnish_Model/model_common_voice_fi_vox_populi_fi_20241206.safetensors
 Vocab: hf://AsmoKoskinen/F5-TTS_Finnish_Model/vocab.txt
-Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
@@ -78,7 +88,7 @@ Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "
 ```bash
 Model: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/model_last_reduced.pt
 Vocab: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/vocab.txt
-Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
 - [Online Inference with Hugging Face Space](https://huggingface.co/spaces/RASPIAUDIO/f5-tts_french).
@@ -96,7 +106,7 @@ Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "
 ```bash
 Model: hf://SPRINGLab/F5-Hindi-24KHz/model_2500000.safetensors
 Vocab: hf://SPRINGLab/F5-Hindi-24KHz/vocab.txt
-Config: {"dim": 768, "depth": 18, "heads": 12, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
 - Authors: SPRING Lab, Indian Institute of Technology, Madras
@@ -113,7 +123,7 @@ Config: {"dim": 768, "depth": 18, "heads": 12, "ff_mult": 2, "text_dim": 512, "c
 ```bash
 Model: hf://alien79/F5-TTS-italian/model_159600.safetensors
 Vocab: hf://alien79/F5-TTS-italian/vocab.txt
-Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
 - Trained by [Mithril Man](https://github.com/MithrilMan)
@@ -131,7 +141,7 @@ Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "
 ```bash
 Model: hf://Jmica/F5TTS/JA_25498980/model_25498980.pt
 Vocab: hf://Jmica/F5TTS/JA_25498980/vocab_updated.txt
-Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
@@ -148,7 +158,7 @@ Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "
 ```bash
 Model: hf://hotstone228/F5-TTS-Russian/model_last.safetensors
 Vocab: hf://hotstone228/F5-TTS-Russian/vocab.txt
-Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
 - Finetuned by [HotDro4illa](https://github.com/HotDro4illa)
 - Any improvements are welcome

 <!-- omit in toc -->
 ### Supported Languages
 - [Multilingual](#multilingual)
+    - [F5-TTS v1 v0 Base @ zh \& en @ F5-TTS](#f5-tts-v1-v0-base--zh--en--f5-tts)
 - [English](#english)
 - [Finnish](#finnish)
     - [F5-TTS Base @ fi @ AsmoKoskinen](#f5-tts-base--fi--asmokoskinen)
 ## Multilingual
+#### F5-TTS v1 v0 Base @ zh & en @ F5-TTS
+|Model|🤗Hugging Face|Data (Hours)|Model License|
+|:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS v1 Base|[ckpt & vocab](https://huggingface.co/SWivid/F5-TTS/tree/main/F5TTS_v1_Base)|[Emilia 95K zh&en](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07)|cc-by-nc-4.0|
+```bash
+Model: hf://SWivid/F5-TTS/F5TTS_v1_Base/model_1250000.safetensors
+Vocab: hf://SWivid/F5-TTS/F5TTS_v1_Base/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
+```
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
 |F5-TTS Base|[ckpt & vocab](https://huggingface.co/SWivid/F5-TTS/tree/main/F5TTS_Base)|[Emilia 95K zh&en](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07)|cc-by-nc-4.0|
 ```bash
 Model: hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors
 Vocab: hf://SWivid/F5-TTS/F5TTS_Base/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "text_mask_padding": False, "conv_layers": 4, "pe_attn_head": 1}
 ```
 *Other infos, e.g. Author info, Github repo, Link to some sampled results, Usage instruction, Tutorial (Blog, Video, etc.) ...*
 ```bash
 Model: hf://AsmoKoskinen/F5-TTS_Finnish_Model/model_common_voice_fi_vox_populi_fi_20241206.safetensors
 Vocab: hf://AsmoKoskinen/F5-TTS_Finnish_Model/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "text_mask_padding": False, "conv_layers": 4, "pe_attn_head": 1}
 ```
 ```bash
 Model: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/model_last_reduced.pt
 Vocab: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "text_mask_padding": False, "conv_layers": 4, "pe_attn_head": 1}
 ```
 - [Online Inference with Hugging Face Space](https://huggingface.co/spaces/RASPIAUDIO/f5-tts_french).
 ```bash
 Model: hf://SPRINGLab/F5-Hindi-24KHz/model_2500000.safetensors
 Vocab: hf://SPRINGLab/F5-Hindi-24KHz/vocab.txt
+Config: {"dim": 768, "depth": 18, "heads": 12, "ff_mult": 2, "text_dim": 512, "text_mask_padding": False, "conv_layers": 4, "pe_attn_head": 1}
 ```
 - Authors: SPRING Lab, Indian Institute of Technology, Madras
 ```bash
 Model: hf://alien79/F5-TTS-italian/model_159600.safetensors
 Vocab: hf://alien79/F5-TTS-italian/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "text_mask_padding": False, "conv_layers": 4, "pe_attn_head": 1}
 ```
 - Trained by [Mithril Man](https://github.com/MithrilMan)
 ```bash
 Model: hf://Jmica/F5TTS/JA_25498980/model_25498980.pt
 Vocab: hf://Jmica/F5TTS/JA_25498980/vocab_updated.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "text_mask_padding": False, "conv_layers": 4, "pe_attn_head": 1}
 ```
 ```bash
 Model: hf://hotstone228/F5-TTS-Russian/model_last.safetensors
 Vocab: hf://hotstone228/F5-TTS-Russian/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "text_mask_padding": False, "conv_layers": 4, "pe_attn_head": 1}
 ```
 - Finetuned by [HotDro4illa](https://github.com/HotDro4illa)
 - Any improvements are welcome

src/f5_tts/infer/examples/basic/basic.toml CHANGED Viewed

@@ -1,5 +1,5 @@
-# F5-TTS | E2-TTS
-model = "F5-TTS"
 ref_audio = "infer/examples/basic/basic_ref_en.wav"
 # If an empty "", transcribes the reference audio automatically.
 ref_text = "Some call me nature, others call me mother nature."

+# F5TTS_v1_Base | E2TTS_Base
+model = "F5TTS_v1_Base"
 ref_audio = "infer/examples/basic/basic_ref_en.wav"
 # If an empty "", transcribes the reference audio automatically.
 ref_text = "Some call me nature, others call me mother nature."

src/f5_tts/infer/examples/basic/basic_ref_en.wav CHANGED Viewed

Binary files a/src/f5_tts/infer/examples/basic/basic_ref_en.wav and b/src/f5_tts/infer/examples/basic/basic_ref_en.wav differ

src/f5_tts/infer/examples/basic/basic_ref_zh.wav CHANGED Viewed

Binary files a/src/f5_tts/infer/examples/basic/basic_ref_zh.wav and b/src/f5_tts/infer/examples/basic/basic_ref_zh.wav differ

src/f5_tts/infer/examples/multi/country.flac CHANGED Viewed

Binary files a/src/f5_tts/infer/examples/multi/country.flac and b/src/f5_tts/infer/examples/multi/country.flac differ

src/f5_tts/infer/examples/multi/main.flac CHANGED Viewed

Binary files a/src/f5_tts/infer/examples/multi/main.flac and b/src/f5_tts/infer/examples/multi/main.flac differ

src/f5_tts/infer/examples/multi/story.toml CHANGED Viewed

@@ -1,5 +1,5 @@
-# F5-TTS | E2-TTS
-model = "F5-TTS"
 ref_audio = "infer/examples/multi/main.flac"
 # If an empty "", transcribes the reference audio automatically.
 ref_text = ""

+# F5TTS_v1_Base | E2TTS_Base
+model = "F5TTS_v1_Base"
 ref_audio = "infer/examples/multi/main.flac"
 # If an empty "", transcribes the reference audio automatically.
 ref_text = ""

src/f5_tts/infer/examples/multi/town.flac CHANGED Viewed

Binary files a/src/f5_tts/infer/examples/multi/town.flac and b/src/f5_tts/infer/examples/multi/town.flac differ

src/f5_tts/infer/infer_cli.py CHANGED Viewed

@@ -27,7 +27,7 @@ from f5_tts.infer.utils_infer import (
     preprocess_ref_audio_text,
     remove_silence_for_generated_wav,
 )
-from f5_tts.model import DiT, UNetT
 parser = argparse.ArgumentParser(
@@ -50,7 +50,7 @@ parser.add_argument(
     "-m",
     "--model",
     type=str,
-    help="The model name: F5-TTS | E2-TTS",
 )
 parser.add_argument(
     "-mc",
@@ -172,8 +172,7 @@ config = tomli.load(open(args.config, "rb"))
 # command-line interface parameters
-model = args.model or config.get("model", "F5-TTS")
-model_cfg = args.model_cfg or config.get("model_cfg", str(files("f5_tts").joinpath("configs/F5TTS_Base_train.yaml")))
 ckpt_file = args.ckpt_file or config.get("ckpt_file", "")
 vocab_file = args.vocab_file or config.get("vocab_file", "")
@@ -245,36 +244,32 @@ vocoder = load_vocoder(vocoder_name=vocoder_name, is_local=load_vocoder_from_loc
 # load TTS model
-if model == "F5-TTS":
-    model_cls = DiT
-    model_cfg = OmegaConf.load(model_cfg).model.arch
-    if not ckpt_file:  # path not specified, download from repo
-        if vocoder_name == "vocos":
-            repo_name = "F5-TTS"
-            exp_name = "F5TTS_Base"
-            ckpt_step = 1200000
-            ckpt_file = str(cached_path(f"hf://SWivid/{repo_name}/{exp_name}/model_{ckpt_step}.safetensors"))
-            # ckpt_file = f"ckpts/{exp_name}/model_{ckpt_step}.pt"  # .pt | .safetensors; local path
-        elif vocoder_name == "bigvgan":
-            repo_name = "F5-TTS"
-            exp_name = "F5TTS_Base_bigvgan"
-            ckpt_step = 1250000
-            ckpt_file = str(cached_path(f"hf://SWivid/{repo_name}/{exp_name}/model_{ckpt_step}.pt"))
-elif model == "E2-TTS":
-    assert args.model_cfg is None, "E2-TTS does not support custom model_cfg yet"
-    assert vocoder_name == "vocos", "E2-TTS only supports vocoder vocos yet"
-    model_cls = UNetT
-    model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
-    if not ckpt_file:  # path not specified, download from repo
-        repo_name = "E2-TTS"
-        exp_name = "E2TTS_Base"
         ckpt_step = 1200000
-        ckpt_file = str(cached_path(f"hf://SWivid/{repo_name}/{exp_name}/model_{ckpt_step}.safetensors"))
-        # ckpt_file = f"ckpts/{exp_name}/model_{ckpt_step}.pt"  # .pt | .safetensors; local path
 print(f"Using {model}...")
-ema_model = load_model(model_cls, model_cfg, ckpt_file, mel_spec_type=vocoder_name, vocab_file=vocab_file)
 # inference process

     preprocess_ref_audio_text,
     remove_silence_for_generated_wav,
 )
+from f5_tts.model import DiT, UNetT  # noqa: F401. used for config
 parser = argparse.ArgumentParser(
     "-m",
     "--model",
     type=str,
+    help="The model name: F5TTS_v1_Base | F5TTS_Base | E2TTS_Base | etc.",
 )
 parser.add_argument(
     "-mc",
 # command-line interface parameters
+model = args.model or config.get("model", "F5TTS_v1_Base")
 ckpt_file = args.ckpt_file or config.get("ckpt_file", "")
 vocab_file = args.vocab_file or config.get("vocab_file", "")
 # load TTS model
+model_cfg = OmegaConf.load(
+    args.model_cfg or config.get("model_cfg", str(files("f5_tts").joinpath(f"configs/{model}.yaml")))
+).model
+model_cls = globals()[model_cfg.backbone]
+repo_name, ckpt_step, ckpt_type = "F5-TTS", 1250000, "safetensors"
+if model != "F5TTS_Base":
+    assert vocoder_name == model_cfg.mel_spec.mel_spec_type
+# override for previous models
+if model == "F5TTS_Base":
+    if vocoder_name == "vocos":
         ckpt_step = 1200000
+    elif vocoder_name == "bigvgan":
+        model = "F5TTS_Base_bigvgan"
+        ckpt_type = "pt"
+elif model == "E2TTS_Base":
+    repo_name = "E2-TTS"
+    ckpt_step = 1200000
+if not ckpt_file:
+    ckpt_file = str(cached_path(f"hf://SWivid/{repo_name}/{model}/model_{ckpt_step}.{ckpt_type}"))
 print(f"Using {model}...")
+ema_model = load_model(model_cls, model_cfg.arch, ckpt_file, mel_spec_type=vocoder_name, vocab_file=vocab_file)
 # inference process

src/f5_tts/infer/speech_edit.py CHANGED Viewed

@@ -1,56 +1,63 @@
 import os
-os.environ["PYTOCH_ENABLE_MPS_FALLBACK"] = "1"  # for MPS device compatibility
 import torch
 import torch.nn.functional as F
 import torchaudio
 from f5_tts.infer.utils_infer import load_checkpoint, load_vocoder, save_spectrogram
-from f5_tts.model import CFM, DiT, UNetT
 from f5_tts.model.utils import convert_char_to_pinyin, get_tokenizer
-device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
-# --------------------- Dataset Settings -------------------- #
-target_sample_rate = 24000
-n_mel_channels = 100
-hop_length = 256
-win_length = 1024
-n_fft = 1024
-mel_spec_type = "vocos"  # 'vocos' or 'bigvgan'
-target_rms = 0.1
-tokenizer = "pinyin"
-dataset_name = "Emilia_ZH_EN"
 # ---------------------- infer setting ---------------------- #
 seed = None  # int | None
-exp_name = "F5TTS_Base"  # F5TTS_Base | E2TTS_Base
-ckpt_step = 1200000
 nfe_step = 32  # 16, 32
 cfg_strength = 2.0
 ode_method = "euler"  # euler | midpoint
 sway_sampling_coef = -1.0
 speed = 1.0
-if exp_name == "F5TTS_Base":
-    model_cls = DiT
-    model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
-elif exp_name == "E2TTS_Base":
-    model_cls = UNetT
-    model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
-ckpt_path = f"ckpts/{exp_name}/model_{ckpt_step}.safetensors"
 output_dir = "tests"
 # [leverage https://github.com/MahmoudAshraf97/ctc-forced-aligner to get char level alignment]
 # pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
 # [write the origin_text into a file, e.g. tests/test_edit.txt]
@@ -59,7 +66,7 @@ output_dir = "tests"
 # [--language "zho" for Chinese, "eng" for English]
 # [if local ckpt, set --alignment_model "../checkpoints/mms-300m-1130-forced-aligner"]
-audio_to_edit = "src/f5_tts/infer/examples/basic/basic_ref_en.wav"
 origin_text = "Some call me nature, others call me mother nature."
 target_text = "Some call me optimist, others call me realist."
 parts_to_edit = [
@@ -98,7 +105,7 @@ vocab_char_map, vocab_size = get_tokenizer(dataset_name, tokenizer)
 # Model
 model = CFM(
-    transformer=model_cls(**model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels),
     mel_spec_kwargs=dict(
         n_fft=n_fft,
         hop_length=hop_length,

 import os
+os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"  # for MPS device compatibility
+from importlib.resources import files
 import torch
 import torch.nn.functional as F
 import torchaudio
+from omegaconf import OmegaConf
 from f5_tts.infer.utils_infer import load_checkpoint, load_vocoder, save_spectrogram
+from f5_tts.model import CFM, DiT, UNetT  # noqa: F401. used for config
 from f5_tts.model.utils import convert_char_to_pinyin, get_tokenizer
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "xpu"
+    if torch.xpu.is_available()
+    else "mps"
+    if torch.backends.mps.is_available()
+    else "cpu"
+)
 # ---------------------- infer setting ---------------------- #
 seed = None  # int | None
+exp_name = "F5TTS_v1_Base"  # F5TTS_v1_Base | E2TTS_Base
+ckpt_step = 1250000
 nfe_step = 32  # 16, 32
 cfg_strength = 2.0
 ode_method = "euler"  # euler | midpoint
 sway_sampling_coef = -1.0
 speed = 1.0
+target_rms = 0.1
+model_cfg = OmegaConf.load(str(files("f5_tts").joinpath(f"configs/{exp_name}.yaml")))
+model_cls = globals()[model_cfg.model.backbone]
+model_arc = model_cfg.model.arch
+dataset_name = model_cfg.datasets.name
+tokenizer = model_cfg.model.tokenizer
+mel_spec_type = model_cfg.model.mel_spec.mel_spec_type
+target_sample_rate = model_cfg.model.mel_spec.target_sample_rate
+n_mel_channels = model_cfg.model.mel_spec.n_mel_channels
+hop_length = model_cfg.model.mel_spec.hop_length
+win_length = model_cfg.model.mel_spec.win_length
+n_fft = model_cfg.model.mel_spec.n_fft
+ckpt_path = str(files("f5_tts").joinpath("../../")) + f"ckpts/{exp_name}/model_{ckpt_step}.safetensors"
 output_dir = "tests"
 # [leverage https://github.com/MahmoudAshraf97/ctc-forced-aligner to get char level alignment]
 # pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
 # [write the origin_text into a file, e.g. tests/test_edit.txt]
 # [--language "zho" for Chinese, "eng" for English]
 # [if local ckpt, set --alignment_model "../checkpoints/mms-300m-1130-forced-aligner"]
+audio_to_edit = str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav"))
 origin_text = "Some call me nature, others call me mother nature."
 target_text = "Some call me optimist, others call me realist."
 parts_to_edit = [
 # Model
 model = CFM(
+    transformer=model_cls(**model_arc, text_num_embeds=vocab_size, mel_dim=n_mel_channels),
     mel_spec_kwargs=dict(
         n_fft=n_fft,
         hop_length=hop_length,

src/f5_tts/infer/utils_infer.py CHANGED Viewed

@@ -2,8 +2,9 @@
 # Make adjustments inside functions, and consider both gradio and cli scripts if need to change func output format
 import os
 import sys
-os.environ["PYTOCH_ENABLE_MPS_FALLBACK"] = "1"  # for MPS device compatibility
 sys.path.append(f"{os.path.dirname(os.path.abspath(__file__))}/../../third_party/BigVGAN/")
 import hashlib
@@ -33,7 +34,15 @@ from f5_tts.model.utils import (
 _ref_audio_cache = {}
-device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
 # -----------------------------------------
@@ -292,19 +301,19 @@ def preprocess_ref_audio_text(ref_audio_orig, ref_text, clip_short=True, show_in
             )
             non_silent_wave = AudioSegment.silent(duration=0)
             for non_silent_seg in non_silent_segs:
-                if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 15000:
                     show_info("Audio is over 15s, clipping short. (1)")
                     break
                 non_silent_wave += non_silent_seg
             # 2. try to find short silence for clipping if 1. failed
-            if len(non_silent_wave) > 15000:
                 non_silent_segs = silence.split_on_silence(
                     aseg, min_silence_len=100, silence_thresh=-40, keep_silence=1000, seek_step=10
                 )
                 non_silent_wave = AudioSegment.silent(duration=0)
                 for non_silent_seg in non_silent_segs:
-                    if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 15000:
                         show_info("Audio is over 15s, clipping short. (2)")
                         break
                     non_silent_wave += non_silent_seg
@@ -312,8 +321,8 @@ def preprocess_ref_audio_text(ref_audio_orig, ref_text, clip_short=True, show_in
             aseg = non_silent_wave
             # 3. if no proper silence found for clipping
-            if len(aseg) > 15000:
-                aseg = aseg[:15000]
                 show_info("Audio is over 15s, clipping short. (3)")
         aseg = remove_silence_edges(aseg) + AudioSegment.silent(duration=50)
@@ -374,29 +383,31 @@ def infer_process(
 ):
     # Split the input text into batches
     audio, sr = torchaudio.load(ref_audio)
-    max_chars = int(len(ref_text.encode("utf-8")) / (audio.shape[-1] / sr) * (25 - audio.shape[-1] / sr))
     gen_text_batches = chunk_text(gen_text, max_chars=max_chars)
     for i, gen_text in enumerate(gen_text_batches):
         print(f"gen_text {i}", gen_text)
     print("\n")
     show_info(f"Generating audio in {len(gen_text_batches)} batches...")
-    return infer_batch_process(
-        (audio, sr),
-        ref_text,
-        gen_text_batches,
-        model_obj,
-        vocoder,
-        mel_spec_type=mel_spec_type,
-        progress=progress,
-        target_rms=target_rms,
-        cross_fade_duration=cross_fade_duration,
-        nfe_step=nfe_step,
-        cfg_strength=cfg_strength,
-        sway_sampling_coef=sway_sampling_coef,
-        speed=speed,
-        fix_duration=fix_duration,
-        device=device,
     )
@@ -419,6 +430,8 @@ def infer_batch_process(
     speed=1,
     fix_duration=None,
     device=None,
 ):
     audio, sr = ref_audio
     if audio.shape[0] > 1:
@@ -437,7 +450,12 @@ def infer_batch_process(
     if len(ref_text[-1].encode("utf-8")) == 1:
         ref_text = ref_text + " "
-    for i, gen_text in enumerate(progress.tqdm(gen_text_batches)):
         # Prepare the text
         text_list = [ref_text + gen_text]
         final_text_list = convert_char_to_pinyin(text_list)
@@ -449,7 +467,7 @@ def infer_batch_process(
             # Calculate duration
             ref_text_len = len(ref_text.encode("utf-8"))
             gen_text_len = len(gen_text.encode("utf-8"))
-            duration = ref_audio_len + int(ref_audio_len / ref_text_len * gen_text_len / speed)
         # inference
         with torch.inference_mode():
@@ -461,64 +479,88 @@ def infer_batch_process(
                 cfg_strength=cfg_strength,
                 sway_sampling_coef=sway_sampling_coef,
             )
-            generated = generated.to(torch.float32)
             generated = generated[:, ref_audio_len:, :]
-            generated_mel_spec = generated.permute(0, 2, 1)
             if mel_spec_type == "vocos":
-                generated_wave = vocoder.decode(generated_mel_spec)
             elif mel_spec_type == "bigvgan":
-                generated_wave = vocoder(generated_mel_spec)
             if rms < target_rms:
                 generated_wave = generated_wave * rms / target_rms
             # wav -> numpy
             generated_wave = generated_wave.squeeze().cpu().numpy()
-            generated_waves.append(generated_wave)
-            spectrograms.append(generated_mel_spec[0].cpu().numpy())
-    # Combine all generated waves with cross-fading
-    if cross_fade_duration <= 0:
-        # Simply concatenate
-        final_wave = np.concatenate(generated_waves)
     else:
-        final_wave = generated_waves[0]
-        for i in range(1, len(generated_waves)):
-            prev_wave = final_wave
-            next_wave = generated_waves[i]
-            # Calculate cross-fade samples, ensuring it does not exceed wave lengths
-            cross_fade_samples = int(cross_fade_duration * target_sample_rate)
-            cross_fade_samples = min(cross_fade_samples, len(prev_wave), len(next_wave))
-            if cross_fade_samples <= 0:
-                # No overlap possible, concatenate
-                final_wave = np.concatenate([prev_wave, next_wave])
-                continue
-            # Overlapping parts
-            prev_overlap = prev_wave[-cross_fade_samples:]
-            next_overlap = next_wave[:cross_fade_samples]
-            # Fade out and fade in
-            fade_out = np.linspace(1, 0, cross_fade_samples)
-            fade_in = np.linspace(0, 1, cross_fade_samples)
-            # Cross-faded overlap
-            cross_faded_overlap = prev_overlap * fade_out + next_overlap * fade_in
-            # Combine
-            new_wave = np.concatenate(
-                [prev_wave[:-cross_fade_samples], cross_faded_overlap, next_wave[cross_fade_samples:]]
-            )
-            final_wave = new_wave
-    # Create a combined spectrogram
-    combined_spectrogram = np.concatenate(spectrograms, axis=1)
-    return final_wave, target_sample_rate, combined_spectrogram
 # remove silence from generated wav

 # Make adjustments inside functions, and consider both gradio and cli scripts if need to change func output format
 import os
 import sys
+from concurrent.futures import ThreadPoolExecutor
+os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"  # for MPS device compatibility
 sys.path.append(f"{os.path.dirname(os.path.abspath(__file__))}/../../third_party/BigVGAN/")
 import hashlib
 _ref_audio_cache = {}
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "xpu"
+    if torch.xpu.is_available()
+    else "mps"
+    if torch.backends.mps.is_available()
+    else "cpu"
+)
 # -----------------------------------------
             )
             non_silent_wave = AudioSegment.silent(duration=0)
             for non_silent_seg in non_silent_segs:
+                if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 12000:
                     show_info("Audio is over 15s, clipping short. (1)")
                     break
                 non_silent_wave += non_silent_seg
             # 2. try to find short silence for clipping if 1. failed
+            if len(non_silent_wave) > 12000:
                 non_silent_segs = silence.split_on_silence(
                     aseg, min_silence_len=100, silence_thresh=-40, keep_silence=1000, seek_step=10
                 )
                 non_silent_wave = AudioSegment.silent(duration=0)
                 for non_silent_seg in non_silent_segs:
+                    if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 12000:
                         show_info("Audio is over 15s, clipping short. (2)")
                         break
                     non_silent_wave += non_silent_seg
             aseg = non_silent_wave
             # 3. if no proper silence found for clipping
+            if len(aseg) > 12000:
+                aseg = aseg[:12000]
                 show_info("Audio is over 15s, clipping short. (3)")
         aseg = remove_silence_edges(aseg) + AudioSegment.silent(duration=50)
 ):
     # Split the input text into batches
     audio, sr = torchaudio.load(ref_audio)
+    max_chars = int(len(ref_text.encode("utf-8")) / (audio.shape[-1] / sr) * (22 - audio.shape[-1] / sr))
     gen_text_batches = chunk_text(gen_text, max_chars=max_chars)
     for i, gen_text in enumerate(gen_text_batches):
         print(f"gen_text {i}", gen_text)
     print("\n")
     show_info(f"Generating audio in {len(gen_text_batches)} batches...")
+    return next(
+        infer_batch_process(
+            (audio, sr),
+            ref_text,
+            gen_text_batches,
+            model_obj,
+            vocoder,
+            mel_spec_type=mel_spec_type,
+            progress=progress,
+            target_rms=target_rms,
+            cross_fade_duration=cross_fade_duration,
+            nfe_step=nfe_step,
+            cfg_strength=cfg_strength,
+            sway_sampling_coef=sway_sampling_coef,
+            speed=speed,
+            fix_duration=fix_duration,
+            device=device,
+        )
     )
     speed=1,
     fix_duration=None,
     device=None,
+    streaming=False,
+    chunk_size=2048,
 ):
     audio, sr = ref_audio
     if audio.shape[0] > 1:
     if len(ref_text[-1].encode("utf-8")) == 1:
         ref_text = ref_text + " "
+    def process_batch(gen_text):
+        local_speed = speed
+        if len(gen_text.encode("utf-8")) < 10:
+            local_speed = 0.3
         # Prepare the text
         text_list = [ref_text + gen_text]
         final_text_list = convert_char_to_pinyin(text_list)
             # Calculate duration
             ref_text_len = len(ref_text.encode("utf-8"))
             gen_text_len = len(gen_text.encode("utf-8"))
+            duration = ref_audio_len + int(ref_audio_len / ref_text_len * gen_text_len / local_speed)
         # inference
         with torch.inference_mode():
                 cfg_strength=cfg_strength,
                 sway_sampling_coef=sway_sampling_coef,
             )
+            del _
+            generated = generated.to(torch.float32)  # generated mel spectrogram
             generated = generated[:, ref_audio_len:, :]
+            generated = generated.permute(0, 2, 1)
             if mel_spec_type == "vocos":
+                generated_wave = vocoder.decode(generated)
             elif mel_spec_type == "bigvgan":
+                generated_wave = vocoder(generated)
             if rms < target_rms:
                 generated_wave = generated_wave * rms / target_rms
             # wav -> numpy
             generated_wave = generated_wave.squeeze().cpu().numpy()
+            if streaming:
+                for j in range(0, len(generated_wave), chunk_size):
+                    yield generated_wave[j : j + chunk_size], target_sample_rate
+            else:
+                generated_cpu = generated[0].cpu().numpy()
+                del generated
+                yield generated_wave, generated_cpu
+    if streaming:
+        for gen_text in progress.tqdm(gen_text_batches) if progress is not None else gen_text_batches:
+            for chunk in process_batch(gen_text):
+                yield chunk
     else:
+        with ThreadPoolExecutor() as executor:
+            futures = [executor.submit(process_batch, gen_text) for gen_text in gen_text_batches]
+            for future in progress.tqdm(futures) if progress is not None else futures:
+                result = future.result()
+                if result:
+                    generated_wave, generated_mel_spec = next(result)
+                    generated_waves.append(generated_wave)
+                    spectrograms.append(generated_mel_spec)
+        if generated_waves:
+            if cross_fade_duration <= 0:
+                # Simply concatenate
+                final_wave = np.concatenate(generated_waves)
+            else:
+                # Combine all generated waves with cross-fading
+                final_wave = generated_waves[0]
+                for i in range(1, len(generated_waves)):
+                    prev_wave = final_wave
+                    next_wave = generated_waves[i]
+                    # Calculate cross-fade samples, ensuring it does not exceed wave lengths
+                    cross_fade_samples = int(cross_fade_duration * target_sample_rate)
+                    cross_fade_samples = min(cross_fade_samples, len(prev_wave), len(next_wave))
+                    if cross_fade_samples <= 0:
+                        # No overlap possible, concatenate
+                        final_wave = np.concatenate([prev_wave, next_wave])
+                        continue
+                    # Overlapping parts
+                    prev_overlap = prev_wave[-cross_fade_samples:]
+                    next_overlap = next_wave[:cross_fade_samples]
+                    # Fade out and fade in
+                    fade_out = np.linspace(1, 0, cross_fade_samples)
+                    fade_in = np.linspace(0, 1, cross_fade_samples)
+                    # Cross-faded overlap
+                    cross_faded_overlap = prev_overlap * fade_out + next_overlap * fade_in
+                    # Combine
+                    new_wave = np.concatenate(
+                        [prev_wave[:-cross_fade_samples], cross_faded_overlap, next_wave[cross_fade_samples:]]
+                    )
+                    final_wave = new_wave
+            # Create a combined spectrogram
+            combined_spectrogram = np.concatenate(spectrograms, axis=1)
+            yield final_wave, target_sample_rate, combined_spectrogram
+        else:
+            yield None, target_sample_rate, None
 # remove silence from generated wav

src/f5_tts/model/backbones/README.md CHANGED Viewed

@@ -4,7 +4,7 @@
 ### unett.py
 - flat unet transformer
 - structure same as in e2-tts & voicebox paper except using rotary pos emb
-- update: allow possible abs pos emb & convnextv2 blocks for embedded text before concat
 ### dit.py
 - adaln-zero dit
@@ -14,7 +14,7 @@
 - possible long skip connection (first layer to last layer)
 ### mmdit.py
-- sd3 structure
 - timestep as condition
 - left stream: text embedded and applied a abs pos emb
 - right stream: masked_cond & noised_input concatted and with same conv pos emb as unett

 ### unett.py
 - flat unet transformer
 - structure same as in e2-tts & voicebox paper except using rotary pos emb
+- possible abs pos emb & convnextv2 blocks for embedded text before concat
 ### dit.py
 - adaln-zero dit
 - possible long skip connection (first layer to last layer)
 ### mmdit.py
+- stable diffusion 3 block structure
 - timestep as condition
 - left stream: text embedded and applied a abs pos emb
 - right stream: masked_cond & noised_input concatted and with same conv pos emb as unett

src/f5_tts/model/backbones/dit.py CHANGED Viewed

@@ -20,7 +20,7 @@ from f5_tts.model.modules import (
     ConvNeXtV2Block,
     ConvPositionEmbedding,
     DiTBlock,
-    AdaLayerNormZero_Final,
     precompute_freqs_cis,
     get_pos_embed_indices,
 )
@@ -30,10 +30,12 @@ from f5_tts.model.modules import (
 class TextEmbedding(nn.Module):
-    def __init__(self, text_num_embeds, text_dim, conv_layers=0, conv_mult=2):
         super().__init__()
         self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim)  # use 0 as filler token
         if conv_layers > 0:
             self.extra_modeling = True
             self.precompute_max_pos = 4096  # ~44s of 24khz audio
@@ -49,6 +51,8 @@ class TextEmbedding(nn.Module):
         text = text[:, :seq_len]  # curtail if character tokens are more than the mel spec tokens
         batch, text_len = text.shape[0], text.shape[1]
         text = F.pad(text, (0, seq_len - text_len), value=0)
         if drop_text:  # cfg for text
             text = torch.zeros_like(text)
@@ -64,7 +68,13 @@ class TextEmbedding(nn.Module):
             text = text + text_pos_embed
             # convnextv2 blocks
-            text = self.text_blocks(text)
         return text
@@ -103,7 +113,10 @@ class DiT(nn.Module):
         mel_dim=100,
         text_num_embeds=256,
         text_dim=None,
         conv_layers=0,
         long_skip_connection=False,
         checkpoint_activations=False,
     ):
@@ -112,7 +125,10 @@ class DiT(nn.Module):
         self.time_embed = TimestepEmbedding(dim)
         if text_dim is None:
             text_dim = mel_dim
-        self.text_embed = TextEmbedding(text_num_embeds, text_dim, conv_layers=conv_layers)
         self.input_embed = InputEmbedding(mel_dim, text_dim, dim)
         self.rotary_embed = RotaryEmbedding(dim_head)
@@ -121,15 +137,40 @@ class DiT(nn.Module):
         self.depth = depth
         self.transformer_blocks = nn.ModuleList(
-            [DiTBlock(dim=dim, heads=heads, dim_head=dim_head, ff_mult=ff_mult, dropout=dropout) for _ in range(depth)]
         )
         self.long_skip_connection = nn.Linear(dim * 2, dim, bias=False) if long_skip_connection else None
-        self.norm_out = AdaLayerNormZero_Final(dim)  # final modulation
         self.proj_out = nn.Linear(dim, mel_dim)
         self.checkpoint_activations = checkpoint_activations
     def ckpt_wrapper(self, module):
         # https://github.com/chuanyangjin/fast-DiT/blob/main/models.py
         def ckpt_forward(*inputs):
@@ -138,6 +179,9 @@ class DiT(nn.Module):
         return ckpt_forward
     def forward(
         self,
         x: float["b n d"],  # nosied input audio  # noqa: F722
@@ -147,14 +191,25 @@ class DiT(nn.Module):
         drop_audio_cond,  # cfg for cond audio
         drop_text,  # cfg for text
         mask: bool["b n"] | None = None,  # noqa: F722
     ):
         batch, seq_len = x.shape[0], x.shape[1]
         if time.ndim == 0:
             time = time.repeat(batch)
-        # t: conditioning time, c: context (text + masked cond audio), x: noised input audio
         t = self.time_embed(time)
-        text_embed = self.text_embed(text, seq_len, drop_text=drop_text)
         x = self.input_embed(x, cond, text_embed, drop_audio_cond=drop_audio_cond)
         rope = self.rotary_embed.forward_from_seq_len(seq_len)

     ConvNeXtV2Block,
     ConvPositionEmbedding,
     DiTBlock,
+    AdaLayerNorm_Final,
     precompute_freqs_cis,
     get_pos_embed_indices,
 )
 class TextEmbedding(nn.Module):
+    def __init__(self, text_num_embeds, text_dim, mask_padding=True, conv_layers=0, conv_mult=2):
         super().__init__()
         self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim)  # use 0 as filler token
+        self.mask_padding = mask_padding  # mask filler and batch padding tokens or not
         if conv_layers > 0:
             self.extra_modeling = True
             self.precompute_max_pos = 4096  # ~44s of 24khz audio
         text = text[:, :seq_len]  # curtail if character tokens are more than the mel spec tokens
         batch, text_len = text.shape[0], text.shape[1]
         text = F.pad(text, (0, seq_len - text_len), value=0)
+        if self.mask_padding:
+            text_mask = text == 0
         if drop_text:  # cfg for text
             text = torch.zeros_like(text)
             text = text + text_pos_embed
             # convnextv2 blocks
+            if self.mask_padding:
+                text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+                for block in self.text_blocks:
+                    text = block(text)
+                    text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+            else:
+                text = self.text_blocks(text)
         return text
         mel_dim=100,
         text_num_embeds=256,
         text_dim=None,
+        text_mask_padding=True,
+        qk_norm=None,
         conv_layers=0,
+        pe_attn_head=None,
         long_skip_connection=False,
         checkpoint_activations=False,
     ):
         self.time_embed = TimestepEmbedding(dim)
         if text_dim is None:
             text_dim = mel_dim
+        self.text_embed = TextEmbedding(
+            text_num_embeds, text_dim, mask_padding=text_mask_padding, conv_layers=conv_layers
+        )
+        self.text_cond, self.text_uncond = None, None  # text cache
         self.input_embed = InputEmbedding(mel_dim, text_dim, dim)
         self.rotary_embed = RotaryEmbedding(dim_head)
         self.depth = depth
         self.transformer_blocks = nn.ModuleList(
+            [
+                DiTBlock(
+                    dim=dim,
+                    heads=heads,
+                    dim_head=dim_head,
+                    ff_mult=ff_mult,
+                    dropout=dropout,
+                    qk_norm=qk_norm,
+                    pe_attn_head=pe_attn_head,
+                )
+                for _ in range(depth)
+            ]
         )
         self.long_skip_connection = nn.Linear(dim * 2, dim, bias=False) if long_skip_connection else None
+        self.norm_out = AdaLayerNorm_Final(dim)  # final modulation
         self.proj_out = nn.Linear(dim, mel_dim)
         self.checkpoint_activations = checkpoint_activations
+        self.initialize_weights()
+    def initialize_weights(self):
+        # Zero-out AdaLN layers in DiT blocks:
+        for block in self.transformer_blocks:
+            nn.init.constant_(block.attn_norm.linear.weight, 0)
+            nn.init.constant_(block.attn_norm.linear.bias, 0)
+        # Zero-out output layers:
+        nn.init.constant_(self.norm_out.linear.weight, 0)
+        nn.init.constant_(self.norm_out.linear.bias, 0)
+        nn.init.constant_(self.proj_out.weight, 0)
+        nn.init.constant_(self.proj_out.bias, 0)
     def ckpt_wrapper(self, module):
         # https://github.com/chuanyangjin/fast-DiT/blob/main/models.py
         def ckpt_forward(*inputs):
         return ckpt_forward
+    def clear_cache(self):
+        self.text_cond, self.text_uncond = None, None
     def forward(
         self,
         x: float["b n d"],  # nosied input audio  # noqa: F722
         drop_audio_cond,  # cfg for cond audio
         drop_text,  # cfg for text
         mask: bool["b n"] | None = None,  # noqa: F722
+        cache=False,
     ):
         batch, seq_len = x.shape[0], x.shape[1]
         if time.ndim == 0:
             time = time.repeat(batch)
+        # t: conditioning time, text: text, x: noised audio + cond audio + text
         t = self.time_embed(time)
+        if cache:
+            if drop_text:
+                if self.text_uncond is None:
+                    self.text_uncond = self.text_embed(text, seq_len, drop_text=True)
+                text_embed = self.text_uncond
+            else:
+                if self.text_cond is None:
+                    self.text_cond = self.text_embed(text, seq_len, drop_text=False)
+                text_embed = self.text_cond
+        else:
+            text_embed = self.text_embed(text, seq_len, drop_text=drop_text)
         x = self.input_embed(x, cond, text_embed, drop_audio_cond=drop_audio_cond)
         rope = self.rotary_embed.forward_from_seq_len(seq_len)

src/f5_tts/model/backbones/mmdit.py CHANGED Viewed

@@ -18,7 +18,7 @@ from f5_tts.model.modules import (
     TimestepEmbedding,
     ConvPositionEmbedding,
     MMDiTBlock,
-    AdaLayerNormZero_Final,
     precompute_freqs_cis,
     get_pos_embed_indices,
 )
@@ -28,18 +28,24 @@ from f5_tts.model.modules import (
 class TextEmbedding(nn.Module):
-    def __init__(self, out_dim, text_num_embeds):
         super().__init__()
         self.text_embed = nn.Embedding(text_num_embeds + 1, out_dim)  # will use 0 as filler token
         self.precompute_max_pos = 1024
         self.register_buffer("freqs_cis", precompute_freqs_cis(out_dim, self.precompute_max_pos), persistent=False)
     def forward(self, text: int["b nt"], drop_text=False) -> int["b nt d"]:  # noqa: F722
-        text = text + 1
-        if drop_text:
             text = torch.zeros_like(text)
-        text = self.text_embed(text)
         # sinus pos emb
         batch_start = torch.zeros((text.shape[0],), dtype=torch.long)
@@ -49,6 +55,9 @@ class TextEmbedding(nn.Module):
         text = text + text_pos_embed
         return text
@@ -83,13 +92,16 @@ class MMDiT(nn.Module):
         dim_head=64,
         dropout=0.1,
         ff_mult=4,
-        text_num_embeds=256,
         mel_dim=100,
     ):
         super().__init__()
         self.time_embed = TimestepEmbedding(dim)
-        self.text_embed = TextEmbedding(dim, text_num_embeds)
         self.audio_embed = AudioEmbedding(mel_dim, dim)
         self.rotary_embed = RotaryEmbedding(dim_head)
@@ -106,13 +118,33 @@ class MMDiT(nn.Module):
                     dropout=dropout,
                     ff_mult=ff_mult,
                     context_pre_only=i == depth - 1,
                 )
                 for i in range(depth)
             ]
         )
-        self.norm_out = AdaLayerNormZero_Final(dim)  # final modulation
         self.proj_out = nn.Linear(dim, mel_dim)
     def forward(
         self,
         x: float["b n d"],  # nosied input audio  # noqa: F722
@@ -122,6 +154,7 @@ class MMDiT(nn.Module):
         drop_audio_cond,  # cfg for cond audio
         drop_text,  # cfg for text
         mask: bool["b n"] | None = None,  # noqa: F722
     ):
         batch = x.shape[0]
         if time.ndim == 0:
@@ -129,7 +162,17 @@ class MMDiT(nn.Module):
         # t: conditioning (time), c: context (text + masked cond audio), x: noised input audio
         t = self.time_embed(time)
-        c = self.text_embed(text, drop_text=drop_text)
         x = self.audio_embed(x, cond, drop_audio_cond=drop_audio_cond)
         seq_len = x.shape[1]

     TimestepEmbedding,
     ConvPositionEmbedding,
     MMDiTBlock,
+    AdaLayerNorm_Final,
     precompute_freqs_cis,
     get_pos_embed_indices,
 )
 class TextEmbedding(nn.Module):
+    def __init__(self, out_dim, text_num_embeds, mask_padding=True):
         super().__init__()
         self.text_embed = nn.Embedding(text_num_embeds + 1, out_dim)  # will use 0 as filler token
+        self.mask_padding = mask_padding  # mask filler and batch padding tokens or not
         self.precompute_max_pos = 1024
         self.register_buffer("freqs_cis", precompute_freqs_cis(out_dim, self.precompute_max_pos), persistent=False)
     def forward(self, text: int["b nt"], drop_text=False) -> int["b nt d"]:  # noqa: F722
+        text = text + 1  # use 0 as filler token. preprocess of batch pad -1, see list_str_to_idx()
+        if self.mask_padding:
+            text_mask = text == 0
+        if drop_text:  # cfg for text
             text = torch.zeros_like(text)
+        text = self.text_embed(text)  # b nt -> b nt d
         # sinus pos emb
         batch_start = torch.zeros((text.shape[0],), dtype=torch.long)
         text = text + text_pos_embed
+        if self.mask_padding:
+            text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
         return text
         dim_head=64,
         dropout=0.1,
         ff_mult=4,
         mel_dim=100,
+        text_num_embeds=256,
+        text_mask_padding=True,
+        qk_norm=None,
     ):
         super().__init__()
         self.time_embed = TimestepEmbedding(dim)
+        self.text_embed = TextEmbedding(dim, text_num_embeds, mask_padding=text_mask_padding)
+        self.text_cond, self.text_uncond = None, None  # text cache
         self.audio_embed = AudioEmbedding(mel_dim, dim)
         self.rotary_embed = RotaryEmbedding(dim_head)
                     dropout=dropout,
                     ff_mult=ff_mult,
                     context_pre_only=i == depth - 1,
+                    qk_norm=qk_norm,
                 )
                 for i in range(depth)
             ]
         )
+        self.norm_out = AdaLayerNorm_Final(dim)  # final modulation
         self.proj_out = nn.Linear(dim, mel_dim)
+        self.initialize_weights()
+    def initialize_weights(self):
+        # Zero-out AdaLN layers in MMDiT blocks:
+        for block in self.transformer_blocks:
+            nn.init.constant_(block.attn_norm_x.linear.weight, 0)
+            nn.init.constant_(block.attn_norm_x.linear.bias, 0)
+            nn.init.constant_(block.attn_norm_c.linear.weight, 0)
+            nn.init.constant_(block.attn_norm_c.linear.bias, 0)
+        # Zero-out output layers:
+        nn.init.constant_(self.norm_out.linear.weight, 0)
+        nn.init.constant_(self.norm_out.linear.bias, 0)
+        nn.init.constant_(self.proj_out.weight, 0)
+        nn.init.constant_(self.proj_out.bias, 0)
+    def clear_cache(self):
+        self.text_cond, self.text_uncond = None, None
     def forward(
         self,
         x: float["b n d"],  # nosied input audio  # noqa: F722
         drop_audio_cond,  # cfg for cond audio
         drop_text,  # cfg for text
         mask: bool["b n"] | None = None,  # noqa: F722
+        cache=False,
     ):
         batch = x.shape[0]
         if time.ndim == 0:
         # t: conditioning (time), c: context (text + masked cond audio), x: noised input audio
         t = self.time_embed(time)
+        if cache:
+            if drop_text:
+                if self.text_uncond is None:
+                    self.text_uncond = self.text_embed(text, drop_text=True)
+                c = self.text_uncond
+            else:
+                if self.text_cond is None:
+                    self.text_cond = self.text_embed(text, drop_text=False)
+                c = self.text_cond
+        else:
+            c = self.text_embed(text, drop_text=drop_text)
         x = self.audio_embed(x, cond, drop_audio_cond=drop_audio_cond)
         seq_len = x.shape[1]

src/f5_tts/model/backbones/unett.py CHANGED Viewed

@@ -33,10 +33,12 @@ from f5_tts.model.modules import (
 class TextEmbedding(nn.Module):
-    def __init__(self, text_num_embeds, text_dim, conv_layers=0, conv_mult=2):
         super().__init__()
         self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim)  # use 0 as filler token
         if conv_layers > 0:
             self.extra_modeling = True
             self.precompute_max_pos = 4096  # ~44s of 24khz audio
@@ -52,6 +54,8 @@ class TextEmbedding(nn.Module):
         text = text[:, :seq_len]  # curtail if character tokens are more than the mel spec tokens
         batch, text_len = text.shape[0], text.shape[1]
         text = F.pad(text, (0, seq_len - text_len), value=0)
         if drop_text:  # cfg for text
             text = torch.zeros_like(text)
@@ -67,7 +71,13 @@ class TextEmbedding(nn.Module):
             text = text + text_pos_embed
             # convnextv2 blocks
-            text = self.text_blocks(text)
         return text
@@ -106,7 +116,10 @@ class UNetT(nn.Module):
         mel_dim=100,
         text_num_embeds=256,
         text_dim=None,
         conv_layers=0,
         skip_connect_type: Literal["add", "concat", "none"] = "concat",
     ):
         super().__init__()
@@ -115,7 +128,10 @@ class UNetT(nn.Module):
         self.time_embed = TimestepEmbedding(dim)
         if text_dim is None:
             text_dim = mel_dim
-        self.text_embed = TextEmbedding(text_num_embeds, text_dim, conv_layers=conv_layers)
         self.input_embed = InputEmbedding(mel_dim, text_dim, dim)
         self.rotary_embed = RotaryEmbedding(dim_head)
@@ -134,11 +150,12 @@ class UNetT(nn.Module):
             attn_norm = RMSNorm(dim)
             attn = Attention(
-                processor=AttnProcessor(),
                 dim=dim,
                 heads=heads,
                 dim_head=dim_head,
                 dropout=dropout,
             )
             ff_norm = RMSNorm(dim)
@@ -161,6 +178,9 @@ class UNetT(nn.Module):
         self.norm_out = RMSNorm(dim)
         self.proj_out = nn.Linear(dim, mel_dim)
     def forward(
         self,
         x: float["b n d"],  # nosied input audio  # noqa: F722
@@ -170,6 +190,7 @@ class UNetT(nn.Module):
         drop_audio_cond,  # cfg for cond audio
         drop_text,  # cfg for text
         mask: bool["b n"] | None = None,  # noqa: F722
     ):
         batch, seq_len = x.shape[0], x.shape[1]
         if time.ndim == 0:
@@ -177,7 +198,17 @@ class UNetT(nn.Module):
         # t: conditioning time, c: context (text + masked cond audio), x: noised input audio
         t = self.time_embed(time)
-        text_embed = self.text_embed(text, seq_len, drop_text=drop_text)
         x = self.input_embed(x, cond, text_embed, drop_audio_cond=drop_audio_cond)
         # postfix time t to input x, [b n d] -> [b n+1 d]

 class TextEmbedding(nn.Module):
+    def __init__(self, text_num_embeds, text_dim, mask_padding=True, conv_layers=0, conv_mult=2):
         super().__init__()
         self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim)  # use 0 as filler token
+        self.mask_padding = mask_padding  # mask filler and batch padding tokens or not
         if conv_layers > 0:
             self.extra_modeling = True
             self.precompute_max_pos = 4096  # ~44s of 24khz audio
         text = text[:, :seq_len]  # curtail if character tokens are more than the mel spec tokens
         batch, text_len = text.shape[0], text.shape[1]
         text = F.pad(text, (0, seq_len - text_len), value=0)
+        if self.mask_padding:
+            text_mask = text == 0
         if drop_text:  # cfg for text
             text = torch.zeros_like(text)
             text = text + text_pos_embed
             # convnextv2 blocks
+            if self.mask_padding:
+                text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+                for block in self.text_blocks:
+                    text = block(text)
+                    text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+            else:
+                text = self.text_blocks(text)
         return text
         mel_dim=100,
         text_num_embeds=256,
         text_dim=None,
+        text_mask_padding=True,
+        qk_norm=None,
         conv_layers=0,
+        pe_attn_head=None,
         skip_connect_type: Literal["add", "concat", "none"] = "concat",
     ):
         super().__init__()
         self.time_embed = TimestepEmbedding(dim)
         if text_dim is None:
             text_dim = mel_dim
+        self.text_embed = TextEmbedding(
+            text_num_embeds, text_dim, mask_padding=text_mask_padding, conv_layers=conv_layers
+        )
+        self.text_cond, self.text_uncond = None, None  # text cache
         self.input_embed = InputEmbedding(mel_dim, text_dim, dim)
         self.rotary_embed = RotaryEmbedding(dim_head)
             attn_norm = RMSNorm(dim)
             attn = Attention(
+                processor=AttnProcessor(pe_attn_head=pe_attn_head),
                 dim=dim,
                 heads=heads,
                 dim_head=dim_head,
                 dropout=dropout,
+                qk_norm=qk_norm,
             )
             ff_norm = RMSNorm(dim)
         self.norm_out = RMSNorm(dim)
         self.proj_out = nn.Linear(dim, mel_dim)
+    def clear_cache(self):
+        self.text_cond, self.text_uncond = None, None
     def forward(
         self,
         x: float["b n d"],  # nosied input audio  # noqa: F722
         drop_audio_cond,  # cfg for cond audio
         drop_text,  # cfg for text
         mask: bool["b n"] | None = None,  # noqa: F722
+        cache=False,
     ):
         batch, seq_len = x.shape[0], x.shape[1]
         if time.ndim == 0:
         # t: conditioning time, c: context (text + masked cond audio), x: noised input audio
         t = self.time_embed(time)
+        if cache:
+            if drop_text:
+                if self.text_uncond is None:
+                    self.text_uncond = self.text_embed(text, seq_len, drop_text=True)
+                text_embed = self.text_uncond
+            else:
+                if self.text_cond is None:
+                    self.text_cond = self.text_embed(text, seq_len, drop_text=False)
+                text_embed = self.text_cond
+        else:
+            text_embed = self.text_embed(text, seq_len, drop_text=drop_text)
         x = self.input_embed(x, cond, text_embed, drop_audio_cond=drop_audio_cond)
         # postfix time t to input x, [b n d] -> [b n+1 d]

src/f5_tts/model/cfm.py CHANGED Viewed

@@ -120,10 +120,6 @@ class CFM(nn.Module):
                 text = list_str_to_tensor(text).to(device)
             assert text.shape[0] == batch
-        if exists(text):
-            text_lens = (text != -1).sum(dim=-1)
-            lens = torch.maximum(text_lens, lens)  # make sure lengths are at least those of the text characters
         # duration
         cond_mask = lens_to_mask(lens)
@@ -133,7 +129,9 @@ class CFM(nn.Module):
         if isinstance(duration, int):
             duration = torch.full((batch,), duration, device=device, dtype=torch.long)
-        duration = torch.maximum(lens + 1, duration)  # just add one token so something is generated
         duration = duration.clamp(max=max_duration)
         max_duration = duration.amax()
@@ -142,6 +140,9 @@ class CFM(nn.Module):
             test_cond = F.pad(cond, (0, 0, cond_seq_len, max_duration - 2 * cond_seq_len), value=0.0)
         cond = F.pad(cond, (0, 0, 0, max_duration - cond_seq_len), value=0.0)
         cond_mask = F.pad(cond_mask, (0, max_duration - cond_mask.shape[-1]), value=False)
         cond_mask = cond_mask.unsqueeze(-1)
         step_cond = torch.where(
@@ -153,10 +154,6 @@ class CFM(nn.Module):
         else:  # save memory and speed up, as single inference need no mask currently
             mask = None
-        # test for no ref audio
-        if no_ref_audio:
-            cond = torch.zeros_like(cond)
         # neural ode
         def fn(t, x):
@@ -165,13 +162,13 @@ class CFM(nn.Module):
             # predict flow
             pred = self.transformer(
-                x=x, cond=step_cond, text=text, time=t, mask=mask, drop_audio_cond=False, drop_text=False
             )
             if cfg_strength < 1e-5:
                 return pred
             null_pred = self.transformer(
-                x=x, cond=step_cond, text=text, time=t, mask=mask, drop_audio_cond=True, drop_text=True
             )
             return pred + (pred - null_pred) * cfg_strength
@@ -198,6 +195,7 @@ class CFM(nn.Module):
             t = t + sway_sampling_coef * (torch.cos(torch.pi / 2 * t) - 1 + t)
         trajectory = odeint(fn, y0, t, **self.odeint_kwargs)
         sampled = trajectory[-1]
         out = sampled

                 text = list_str_to_tensor(text).to(device)
             assert text.shape[0] == batch
         # duration
         cond_mask = lens_to_mask(lens)
         if isinstance(duration, int):
             duration = torch.full((batch,), duration, device=device, dtype=torch.long)
+        duration = torch.maximum(
+            torch.maximum((text != -1).sum(dim=-1), lens) + 1, duration
+        )  # duration at least text/audio prompt length plus one token, so something is generated
         duration = duration.clamp(max=max_duration)
         max_duration = duration.amax()
             test_cond = F.pad(cond, (0, 0, cond_seq_len, max_duration - 2 * cond_seq_len), value=0.0)
         cond = F.pad(cond, (0, 0, 0, max_duration - cond_seq_len), value=0.0)
+        if no_ref_audio:
+            cond = torch.zeros_like(cond)
         cond_mask = F.pad(cond_mask, (0, max_duration - cond_mask.shape[-1]), value=False)
         cond_mask = cond_mask.unsqueeze(-1)
         step_cond = torch.where(
         else:  # save memory and speed up, as single inference need no mask currently
             mask = None
         # neural ode
         def fn(t, x):
             # predict flow
             pred = self.transformer(
+                x=x, cond=step_cond, text=text, time=t, mask=mask, drop_audio_cond=False, drop_text=False, cache=True
             )
             if cfg_strength < 1e-5:
                 return pred
             null_pred = self.transformer(
+                x=x, cond=step_cond, text=text, time=t, mask=mask, drop_audio_cond=True, drop_text=True, cache=True
             )
             return pred + (pred - null_pred) * cfg_strength
             t = t + sway_sampling_coef * (torch.cos(torch.pi / 2 * t) - 1 + t)
         trajectory = odeint(fn, y0, t, **self.odeint_kwargs)
+        self.transformer.clear_cache()
         sampled = trajectory[-1]
         out = sampled

src/f5_tts/model/dataset.py CHANGED Viewed

@@ -1,5 +1,4 @@
 import json
-import random
 from importlib.resources import files
 import torch
@@ -170,14 +169,17 @@ class DynamicBatchSampler(Sampler[list[int]]):
         in a batch to ensure that the total number of frames are less
         than a certain threshold.
     2.  Make sure the padding efficiency in the batch is high.
     """
     def __init__(
-        self, sampler: Sampler[int], frames_threshold: int, max_samples=0, random_seed=None, drop_last: bool = False
     ):
         self.sampler = sampler
         self.frames_threshold = frames_threshold
         self.max_samples = max_samples
         indices, batches = [], []
         data_source = self.sampler.data_source
@@ -206,21 +208,30 @@ class DynamicBatchSampler(Sampler[list[int]]):
                     batch = []
                     batch_frames = 0
-        if not drop_last and len(batch) > 0:
             batches.append(batch)
         del indices
-        # if want to have different batches between epochs, may just set a seed and log it in ckpt
-        # cuz during multi-gpu training, although the batch on per gpu not change between epochs, the formed general minibatch is different
-        # e.g. for epoch n, use (random_seed + n)
-        random.seed(random_seed)
-        random.shuffle(batches)
-        self.batches = batches
     def __iter__(self):
-        return iter(self.batches)
     def __len__(self):
         return len(self.batches)

 import json
 from importlib.resources import files
 import torch
         in a batch to ensure that the total number of frames are less
         than a certain threshold.
     2.  Make sure the padding efficiency in the batch is high.
+    3.  Shuffle batches each epoch while maintaining reproducibility.
     """
     def __init__(
+        self, sampler: Sampler[int], frames_threshold: int, max_samples=0, random_seed=None, drop_residual: bool = False
     ):
         self.sampler = sampler
         self.frames_threshold = frames_threshold
         self.max_samples = max_samples
+        self.random_seed = random_seed
+        self.epoch = 0
         indices, batches = [], []
         data_source = self.sampler.data_source
                     batch = []
                     batch_frames = 0
+        if not drop_residual and len(batch) > 0:
             batches.append(batch)
         del indices
+        self.batches = batches
+        # Ensure even batches with accelerate BatchSamplerShard cls under frame_per_batch setting
+        self.drop_last = True
+    def set_epoch(self, epoch: int) -> None:
+        """Sets the epoch for this sampler."""
+        self.epoch = epoch
     def __iter__(self):
+        # Use both random_seed and epoch for deterministic but different shuffling per epoch
+        if self.random_seed is not None:
+            g = torch.Generator()
+            g.manual_seed(self.random_seed + self.epoch)
+            # Use PyTorch's random permutation for better reproducibility across PyTorch versions
+            indices = torch.randperm(len(self.batches), generator=g).tolist()
+            batches = [self.batches[i] for i in indices]
+        else:
+            batches = self.batches
+        return iter(batches)
     def __len__(self):
         return len(self.batches)

src/f5_tts/model/modules.py CHANGED Viewed

@@ -269,11 +269,36 @@ class ConvNeXtV2Block(nn.Module):
         return residual + x
-# AdaLayerNormZero
 # return with modulated x for attn input, and params for later mlp modulation
-class AdaLayerNormZero(nn.Module):
     def __init__(self, dim):
         super().__init__()
@@ -290,11 +315,11 @@ class AdaLayerNormZero(nn.Module):
         return x, gate_msa, shift_mlp, scale_mlp, gate_mlp
-# AdaLayerNormZero for final layer
 # return only with modulated x for attn input, cuz no more mlp modulation
-class AdaLayerNormZero_Final(nn.Module):
     def __init__(self, dim):
         super().__init__()
@@ -341,7 +366,8 @@ class Attention(nn.Module):
         dim_head: int = 64,
         dropout: float = 0.0,
         context_dim: Optional[int] = None,  # if not None -> joint attention
-        context_pre_only=None,
     ):
         super().__init__()
@@ -362,18 +388,32 @@ class Attention(nn.Module):
         self.to_k = nn.Linear(dim, self.inner_dim)
         self.to_v = nn.Linear(dim, self.inner_dim)
         if self.context_dim is not None:
             self.to_k_c = nn.Linear(context_dim, self.inner_dim)
             self.to_v_c = nn.Linear(context_dim, self.inner_dim)
-            if self.context_pre_only is not None:
-                self.to_q_c = nn.Linear(context_dim, self.inner_dim)
         self.to_out = nn.ModuleList([])
         self.to_out.append(nn.Linear(self.inner_dim, dim))
         self.to_out.append(nn.Dropout(dropout))
-        if self.context_pre_only is not None and not self.context_pre_only:
-            self.to_out_c = nn.Linear(self.inner_dim, dim)
     def forward(
         self,
@@ -393,8 +433,11 @@ class Attention(nn.Module):
 class AttnProcessor:
-    def __init__(self):
-        pass
     def __call__(
         self,
@@ -405,19 +448,11 @@ class AttnProcessor:
     ) -> torch.FloatTensor:
         batch_size = x.shape[0]
-        # `sample` projections.
         query = attn.to_q(x)
         key = attn.to_k(x)
         value = attn.to_v(x)
-        # apply rotary position embedding
-        if rope is not None:
-            freqs, xpos_scale = rope
-            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
-            query = apply_rotary_pos_emb(query, freqs, q_xpos_scale)
-            key = apply_rotary_pos_emb(key, freqs, k_xpos_scale)
         # attention
         inner_dim = key.shape[-1]
         head_dim = inner_dim // attn.heads
@@ -425,6 +460,25 @@ class AttnProcessor:
         key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
         value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
         # mask. e.g. inference got a batch with different target durations, mask out the padding
         if mask is not None:
             attn_mask = mask
@@ -470,16 +524,36 @@ class JointAttnProcessor:
         batch_size = c.shape[0]
-        # `sample` projections.
         query = attn.to_q(x)
         key = attn.to_k(x)
         value = attn.to_v(x)
-        # `context` projections.
         c_query = attn.to_q_c(c)
         c_key = attn.to_k_c(c)
         c_value = attn.to_v_c(c)
         # apply rope for context and noised input independently
         if rope is not None:
             freqs, xpos_scale = rope
@@ -492,16 +566,10 @@ class JointAttnProcessor:
             c_query = apply_rotary_pos_emb(c_query, freqs, q_xpos_scale)
             c_key = apply_rotary_pos_emb(c_key, freqs, k_xpos_scale)
-        # attention
-        query = torch.cat([query, c_query], dim=1)
-        key = torch.cat([key, c_key], dim=1)
-        value = torch.cat([value, c_value], dim=1)
-        inner_dim = key.shape[-1]
-        head_dim = inner_dim // attn.heads
-        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
         # mask. e.g. inference got a batch with different target durations, mask out the padding
         if mask is not None:
@@ -540,16 +608,17 @@ class JointAttnProcessor:
 class DiTBlock(nn.Module):
-    def __init__(self, dim, heads, dim_head, ff_mult=4, dropout=0.1):
         super().__init__()
-        self.attn_norm = AdaLayerNormZero(dim)
         self.attn = Attention(
-            processor=AttnProcessor(),
             dim=dim,
             heads=heads,
             dim_head=dim_head,
             dropout=dropout,
         )
         self.ff_norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
@@ -585,26 +654,30 @@ class MMDiTBlock(nn.Module):
     context_pre_only: last layer only do prenorm + modulation cuz no more ffn
     """
-    def __init__(self, dim, heads, dim_head, ff_mult=4, dropout=0.1, context_pre_only=False):
         super().__init__()
         self.context_pre_only = context_pre_only
-        self.attn_norm_c = AdaLayerNormZero_Final(dim) if context_pre_only else AdaLayerNormZero(dim)
-        self.attn_norm_x = AdaLayerNormZero(dim)
         self.attn = Attention(
             processor=JointAttnProcessor(),
             dim=dim,
             heads=heads,
             dim_head=dim_head,
             dropout=dropout,
-            context_dim=dim,
             context_pre_only=context_pre_only,
         )
         if not context_pre_only:
-            self.ff_norm_c = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
-            self.ff_c = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
         else:
             self.ff_norm_c = None
             self.ff_c = None

         return residual + x
+# RMSNorm
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+        self.native_rms_norm = float(torch.__version__[:3]) >= 2.4
+    def forward(self, x):
+        if self.native_rms_norm:
+            if self.weight.dtype in [torch.float16, torch.bfloat16]:
+                x = x.to(self.weight.dtype)
+            x = F.rms_norm(x, normalized_shape=(x.shape[-1],), weight=self.weight, eps=self.eps)
+        else:
+            variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)
+            x = x * torch.rsqrt(variance + self.eps)
+            if self.weight.dtype in [torch.float16, torch.bfloat16]:
+                x = x.to(self.weight.dtype)
+            x = x * self.weight
+        return x
+# AdaLayerNorm
 # return with modulated x for attn input, and params for later mlp modulation
+class AdaLayerNorm(nn.Module):
     def __init__(self, dim):
         super().__init__()
         return x, gate_msa, shift_mlp, scale_mlp, gate_mlp
+# AdaLayerNorm for final layer
 # return only with modulated x for attn input, cuz no more mlp modulation
+class AdaLayerNorm_Final(nn.Module):
     def __init__(self, dim):
         super().__init__()
         dim_head: int = 64,
         dropout: float = 0.0,
         context_dim: Optional[int] = None,  # if not None -> joint attention
+        context_pre_only: bool = False,
+        qk_norm: Optional[str] = None,
     ):
         super().__init__()
         self.to_k = nn.Linear(dim, self.inner_dim)
         self.to_v = nn.Linear(dim, self.inner_dim)
+        if qk_norm is None:
+            self.q_norm = None
+            self.k_norm = None
+        elif qk_norm == "rms_norm":
+            self.q_norm = RMSNorm(dim_head, eps=1e-6)
+            self.k_norm = RMSNorm(dim_head, eps=1e-6)
+        else:
+            raise ValueError(f"Unimplemented qk_norm: {qk_norm}")
         if self.context_dim is not None:
+            self.to_q_c = nn.Linear(context_dim, self.inner_dim)
             self.to_k_c = nn.Linear(context_dim, self.inner_dim)
             self.to_v_c = nn.Linear(context_dim, self.inner_dim)
+            if qk_norm is None:
+                self.c_q_norm = None
+                self.c_k_norm = None
+            elif qk_norm == "rms_norm":
+                self.c_q_norm = RMSNorm(dim_head, eps=1e-6)
+                self.c_k_norm = RMSNorm(dim_head, eps=1e-6)
         self.to_out = nn.ModuleList([])
         self.to_out.append(nn.Linear(self.inner_dim, dim))
         self.to_out.append(nn.Dropout(dropout))
+        if self.context_dim is not None and not self.context_pre_only:
+            self.to_out_c = nn.Linear(self.inner_dim, context_dim)
     def forward(
         self,
 class AttnProcessor:
+    def __init__(
+        self,
+        pe_attn_head: int | None = None,  # number of attention head to apply rope, None for all
+    ):
+        self.pe_attn_head = pe_attn_head
     def __call__(
         self,
     ) -> torch.FloatTensor:
         batch_size = x.shape[0]
+        # `sample` projections
         query = attn.to_q(x)
         key = attn.to_k(x)
         value = attn.to_v(x)
         # attention
         inner_dim = key.shape[-1]
         head_dim = inner_dim // attn.heads
         key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
         value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        # qk norm
+        if attn.q_norm is not None:
+            query = attn.q_norm(query)
+        if attn.k_norm is not None:
+            key = attn.k_norm(key)
+        # apply rotary position embedding
+        if rope is not None:
+            freqs, xpos_scale = rope
+            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
+            if self.pe_attn_head is not None:
+                pn = self.pe_attn_head
+                query[:, :pn, :, :] = apply_rotary_pos_emb(query[:, :pn, :, :], freqs, q_xpos_scale)
+                key[:, :pn, :, :] = apply_rotary_pos_emb(key[:, :pn, :, :], freqs, k_xpos_scale)
+            else:
+                query = apply_rotary_pos_emb(query, freqs, q_xpos_scale)
+                key = apply_rotary_pos_emb(key, freqs, k_xpos_scale)
         # mask. e.g. inference got a batch with different target durations, mask out the padding
         if mask is not None:
             attn_mask = mask
         batch_size = c.shape[0]
+        # `sample` projections
         query = attn.to_q(x)
         key = attn.to_k(x)
         value = attn.to_v(x)
+        # `context` projections
         c_query = attn.to_q_c(c)
         c_key = attn.to_k_c(c)
         c_value = attn.to_v_c(c)
+        # attention
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        c_query = c_query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        c_key = c_key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        c_value = c_value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        # qk norm
+        if attn.q_norm is not None:
+            query = attn.q_norm(query)
+        if attn.k_norm is not None:
+            key = attn.k_norm(key)
+        if attn.c_q_norm is not None:
+            c_query = attn.c_q_norm(c_query)
+        if attn.c_k_norm is not None:
+            c_key = attn.c_k_norm(c_key)
         # apply rope for context and noised input independently
         if rope is not None:
             freqs, xpos_scale = rope
             c_query = apply_rotary_pos_emb(c_query, freqs, q_xpos_scale)
             c_key = apply_rotary_pos_emb(c_key, freqs, k_xpos_scale)
+        # joint attention
+        query = torch.cat([query, c_query], dim=2)
+        key = torch.cat([key, c_key], dim=2)
+        value = torch.cat([value, c_value], dim=2)
         # mask. e.g. inference got a batch with different target durations, mask out the padding
         if mask is not None:
 class DiTBlock(nn.Module):
+    def __init__(self, dim, heads, dim_head, ff_mult=4, dropout=0.1, qk_norm=None, pe_attn_head=None):
         super().__init__()
+        self.attn_norm = AdaLayerNorm(dim)
         self.attn = Attention(
+            processor=AttnProcessor(pe_attn_head=pe_attn_head),
             dim=dim,
             heads=heads,
             dim_head=dim_head,
             dropout=dropout,
+            qk_norm=qk_norm,
         )
         self.ff_norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
     context_pre_only: last layer only do prenorm + modulation cuz no more ffn
     """
+    def __init__(
+        self, dim, heads, dim_head, ff_mult=4, dropout=0.1, context_dim=None, context_pre_only=False, qk_norm=None
+    ):
         super().__init__()
+        if context_dim is None:
+            context_dim = dim
         self.context_pre_only = context_pre_only
+        self.attn_norm_c = AdaLayerNorm_Final(context_dim) if context_pre_only else AdaLayerNorm(context_dim)
+        self.attn_norm_x = AdaLayerNorm(dim)
         self.attn = Attention(
             processor=JointAttnProcessor(),
             dim=dim,
             heads=heads,
             dim_head=dim_head,
             dropout=dropout,
+            context_dim=context_dim,
             context_pre_only=context_pre_only,
+            qk_norm=qk_norm,
         )
         if not context_pre_only:
+            self.ff_norm_c = nn.LayerNorm(context_dim, elementwise_affine=False, eps=1e-6)
+            self.ff_c = FeedForward(dim=context_dim, mult=ff_mult, dropout=dropout, approximate="tanh")
         else:
             self.ff_norm_c = None
             self.ff_c = None

src/f5_tts/model/trainer.py CHANGED Viewed

@@ -1,6 +1,7 @@
 from __future__ import annotations
 import gc
 import os
 import torch
@@ -29,8 +30,9 @@ class Trainer:
         learning_rate,
         num_warmup_updates=20000,
         save_per_updates=1000,
         checkpoint_path=None,
-        batch_size=32,
         batch_size_type: str = "sample",
         max_samples=32,
         grad_accumulation_steps=1,
@@ -38,23 +40,23 @@ class Trainer:
         noise_scheduler: str | None = None,
         duration_predictor: torch.nn.Module | None = None,
         logger: str | None = "wandb",  # "wandb" | "tensorboard" | None
-        wandb_project="test_e2-tts",
         wandb_run_name="test_run",
         wandb_resume_id: str = None,
         log_samples: bool = False,
-        last_per_steps=None,
         accelerate_kwargs: dict = dict(),
         ema_kwargs: dict = dict(),
         bnb_optimizer: bool = False,
         mel_spec_type: str = "vocos",  # "vocos" | "bigvgan"
         is_local_vocoder: bool = False,  # use local path vocoder
         local_vocoder_path: str = "",  # local vocoder path
     ):
         ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
         if logger == "wandb" and not wandb.api.api_key:
             logger = None
-        print(f"Using logger: {logger}")
         self.log_samples = log_samples
         self.accelerator = Accelerator(
@@ -71,21 +73,23 @@ class Trainer:
             else:
                 init_kwargs = {"wandb": {"resume": "allow", "name": wandb_run_name}}
-            self.accelerator.init_trackers(
-                project_name=wandb_project,
-                init_kwargs=init_kwargs,
-                config={
                     "epochs": epochs,
                     "learning_rate": learning_rate,
                     "num_warmup_updates": num_warmup_updates,
-                    "batch_size": batch_size,
                     "batch_size_type": batch_size_type,
                     "max_samples": max_samples,
                     "grad_accumulation_steps": grad_accumulation_steps,
                     "max_grad_norm": max_grad_norm,
-                    "gpus": self.accelerator.num_processes,
                     "noise_scheduler": noise_scheduler,
-                },
             )
         elif self.logger == "tensorboard":
@@ -99,13 +103,20 @@ class Trainer:
             self.ema_model = EMA(model, include_online_model=False, **ema_kwargs)
             self.ema_model.to(self.accelerator.device)
         self.epochs = epochs
         self.num_warmup_updates = num_warmup_updates
         self.save_per_updates = save_per_updates
-        self.last_per_steps = default(last_per_steps, save_per_updates * grad_accumulation_steps)
-        self.checkpoint_path = default(checkpoint_path, "ckpts/test_e2-tts")
-        self.batch_size = batch_size
         self.batch_size_type = batch_size_type
         self.max_samples = max_samples
         self.grad_accumulation_steps = grad_accumulation_steps
@@ -132,7 +143,7 @@ class Trainer:
     def is_main(self):
         return self.accelerator.is_main_process
-    def save_checkpoint(self, step, last=False):
         self.accelerator.wait_for_everyone()
         if self.is_main:
             checkpoint = dict(
@@ -140,21 +151,38 @@ class Trainer:
                 optimizer_state_dict=self.accelerator.unwrap_model(self.optimizer).state_dict(),
                 ema_model_state_dict=self.ema_model.state_dict(),
                 scheduler_state_dict=self.scheduler.state_dict(),
-                step=step,
             )
             if not os.path.exists(self.checkpoint_path):
                 os.makedirs(self.checkpoint_path)
             if last:
                 self.accelerator.save(checkpoint, f"{self.checkpoint_path}/model_last.pt")
-                print(f"Saved last checkpoint at step {step}")
             else:
-                self.accelerator.save(checkpoint, f"{self.checkpoint_path}/model_{step}.pt")
     def load_checkpoint(self):
         if (
             not exists(self.checkpoint_path)
             or not os.path.exists(self.checkpoint_path)
-            or not any(filename.endswith(".pt") for filename in os.listdir(self.checkpoint_path))
         ):
             return 0
@@ -162,12 +190,34 @@ class Trainer:
         if "model_last.pt" in os.listdir(self.checkpoint_path):
             latest_checkpoint = "model_last.pt"
         else:
-            latest_checkpoint = sorted(
-                [f for f in os.listdir(self.checkpoint_path) if f.endswith(".pt")],
-                key=lambda x: int("".join(filter(str.isdigit, x))),
-            )[-1]
-        # checkpoint = torch.load(f"{self.checkpoint_path}/{latest_checkpoint}", map_location=self.accelerator.device)  # rather use accelerator.load_state ಥ_ಥ
-        checkpoint = torch.load(f"{self.checkpoint_path}/{latest_checkpoint}", weights_only=True, map_location="cpu")
         # patch for backward compatibility, 305e3ea
         for key in ["ema_model.mel_spec.mel_stft.mel_scale.fb", "ema_model.mel_spec.mel_stft.spectrogram.window"]:
@@ -177,7 +227,14 @@ class Trainer:
         if self.is_main:
             self.ema_model.load_state_dict(checkpoint["ema_model_state_dict"])
-        if "step" in checkpoint:
             # patch for backward compatibility, 305e3ea
             for key in ["mel_spec.mel_stft.mel_scale.fb", "mel_spec.mel_stft.spectrogram.window"]:
                 if key in checkpoint["model_state_dict"]:
@@ -187,19 +244,19 @@ class Trainer:
             self.accelerator.unwrap_model(self.optimizer).load_state_dict(checkpoint["optimizer_state_dict"])
             if self.scheduler:
                 self.scheduler.load_state_dict(checkpoint["scheduler_state_dict"])
-            step = checkpoint["step"]
         else:
             checkpoint["model_state_dict"] = {
                 k.replace("ema_model.", ""): v
                 for k, v in checkpoint["ema_model_state_dict"].items()
-                if k not in ["initted", "step"]
             }
             self.accelerator.unwrap_model(self.model).load_state_dict(checkpoint["model_state_dict"])
-            step = 0
         del checkpoint
         gc.collect()
-        return step
     def train(self, train_dataset: Dataset, num_workers=16, resumable_with_seed: int = None):
         if self.log_samples:
@@ -225,7 +282,7 @@ class Trainer:
                 num_workers=num_workers,
                 pin_memory=True,
                 persistent_workers=True,
-                batch_size=self.batch_size,
                 shuffle=True,
                 generator=generator,
             )
@@ -233,7 +290,11 @@ class Trainer:
             self.accelerator.even_batches = False
             sampler = SequentialSampler(train_dataset)
             batch_sampler = DynamicBatchSampler(
-                sampler, self.batch_size, max_samples=self.max_samples, random_seed=resumable_with_seed, drop_last=False
             )
             train_dataloader = DataLoader(
                 train_dataset,
@@ -248,25 +309,26 @@ class Trainer:
         #  accelerator.prepare() dispatches batches to devices;
         #  which means the length of dataloader calculated before, should consider the number of devices
-        warmup_steps = (
             self.num_warmup_updates * self.accelerator.num_processes
         )  # consider a fixed warmup steps while using accelerate multi-gpu ddp
         # otherwise by default with split_batches=False, warmup steps change with num_processes
-        total_steps = len(train_dataloader) * self.epochs / self.grad_accumulation_steps
-        decay_steps = total_steps - warmup_steps
-        warmup_scheduler = LinearLR(self.optimizer, start_factor=1e-8, end_factor=1.0, total_iters=warmup_steps)
-        decay_scheduler = LinearLR(self.optimizer, start_factor=1.0, end_factor=1e-8, total_iters=decay_steps)
         self.scheduler = SequentialLR(
-            self.optimizer, schedulers=[warmup_scheduler, decay_scheduler], milestones=[warmup_steps]
         )
         train_dataloader, self.scheduler = self.accelerator.prepare(
             train_dataloader, self.scheduler
-        )  # actual steps = 1 gpu steps / gpus
-        start_step = self.load_checkpoint()
-        global_step = start_step
         if exists(resumable_with_seed):
             orig_epoch_step = len(train_dataloader)
             skipped_epoch = int(start_step // orig_epoch_step)
             skipped_batch = start_step % orig_epoch_step
             skipped_dataloader = self.accelerator.skip_first_batches(train_dataloader, num_batches=skipped_batch)
@@ -276,23 +338,25 @@ class Trainer:
         for epoch in range(skipped_epoch, self.epochs):
             self.model.train()
             if exists(resumable_with_seed) and epoch == skipped_epoch:
-                progress_bar = tqdm(
-                    skipped_dataloader,
-                    desc=f"Epoch {epoch+1}/{self.epochs}",
-                    unit="step",
-                    disable=not self.accelerator.is_local_main_process,
-                    initial=skipped_batch,
-                    total=orig_epoch_step,
-                )
             else:
-                progress_bar = tqdm(
-                    train_dataloader,
-                    desc=f"Epoch {epoch+1}/{self.epochs}",
-                    unit="step",
-                    disable=not self.accelerator.is_local_main_process,
-                )
-            for batch in progress_bar:
                 with self.accelerator.accumulate(self.model):
                     text_inputs = batch["text"]
                     mel_spec = batch["mel"].permute(0, 2, 1)
@@ -301,7 +365,7 @@ class Trainer:
                     # TODO. add duration predictor training
                     if self.duration_predictor is not None and self.accelerator.is_local_main_process:
                         dur_loss = self.duration_predictor(mel_spec, lens=batch.get("durations"))
-                        self.accelerator.log({"duration loss": dur_loss.item()}, step=global_step)
                     loss, cond, pred = self.model(
                         mel_spec, text=text_inputs, lens=mel_lengths, noise_scheduler=self.noise_scheduler
@@ -315,21 +379,24 @@ class Trainer:
                     self.scheduler.step()
                     self.optimizer.zero_grad()
-                if self.is_main and self.accelerator.sync_gradients:
-                    self.ema_model.update()
-                global_step += 1
                 if self.accelerator.is_local_main_process:
-                    self.accelerator.log({"loss": loss.item(), "lr": self.scheduler.get_last_lr()[0]}, step=global_step)
                     if self.logger == "tensorboard":
-                        self.writer.add_scalar("loss", loss.item(), global_step)
-                        self.writer.add_scalar("lr", self.scheduler.get_last_lr()[0], global_step)
-                progress_bar.set_postfix(step=str(global_step), loss=loss.item())
-                if global_step % (self.save_per_updates * self.grad_accumulation_steps) == 0:
-                    self.save_checkpoint(global_step)
                     if self.log_samples and self.accelerator.is_local_main_process:
                         ref_audio_len = mel_lengths[0]
@@ -355,12 +422,16 @@ class Trainer:
                                 gen_audio = vocoder(gen_mel_spec).squeeze(0).cpu()
                                 ref_audio = vocoder(ref_mel_spec).squeeze(0).cpu()
-                        torchaudio.save(f"{log_samples_path}/step_{global_step}_gen.wav", gen_audio, target_sample_rate)
-                        torchaudio.save(f"{log_samples_path}/step_{global_step}_ref.wav", ref_audio, target_sample_rate)
-                if global_step % self.last_per_steps == 0:
-                    self.save_checkpoint(global_step, last=True)
-        self.save_checkpoint(global_step, last=True)
         self.accelerator.end_training()

 from __future__ import annotations
 import gc
+import math
 import os
 import torch
         learning_rate,
         num_warmup_updates=20000,
         save_per_updates=1000,
+        keep_last_n_checkpoints: int = -1,  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
         checkpoint_path=None,
+        batch_size_per_gpu=32,
         batch_size_type: str = "sample",
         max_samples=32,
         grad_accumulation_steps=1,
         noise_scheduler: str | None = None,
         duration_predictor: torch.nn.Module | None = None,
         logger: str | None = "wandb",  # "wandb" | "tensorboard" | None
+        wandb_project="test_f5-tts",
         wandb_run_name="test_run",
         wandb_resume_id: str = None,
         log_samples: bool = False,
+        last_per_updates=None,
         accelerate_kwargs: dict = dict(),
         ema_kwargs: dict = dict(),
         bnb_optimizer: bool = False,
         mel_spec_type: str = "vocos",  # "vocos" | "bigvgan"
         is_local_vocoder: bool = False,  # use local path vocoder
         local_vocoder_path: str = "",  # local vocoder path
+        cfg_dict: dict = dict(),  # training config
     ):
         ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
         if logger == "wandb" and not wandb.api.api_key:
             logger = None
         self.log_samples = log_samples
         self.accelerator = Accelerator(
             else:
                 init_kwargs = {"wandb": {"resume": "allow", "name": wandb_run_name}}
+            if not cfg_dict:
+                cfg_dict = {
                     "epochs": epochs,
                     "learning_rate": learning_rate,
                     "num_warmup_updates": num_warmup_updates,
+                    "batch_size_per_gpu": batch_size_per_gpu,
                     "batch_size_type": batch_size_type,
                     "max_samples": max_samples,
                     "grad_accumulation_steps": grad_accumulation_steps,
                     "max_grad_norm": max_grad_norm,
                     "noise_scheduler": noise_scheduler,
+                }
+            cfg_dict["gpus"] = self.accelerator.num_processes
+            self.accelerator.init_trackers(
+                project_name=wandb_project,
+                init_kwargs=init_kwargs,
+                config=cfg_dict,
             )
         elif self.logger == "tensorboard":
             self.ema_model = EMA(model, include_online_model=False, **ema_kwargs)
             self.ema_model.to(self.accelerator.device)
+            print(f"Using logger: {logger}")
+            if grad_accumulation_steps > 1:
+                print(
+                    "Gradient accumulation checkpointing with per_updates now, old logic per_steps used with before f992c4e"
+                )
         self.epochs = epochs
         self.num_warmup_updates = num_warmup_updates
         self.save_per_updates = save_per_updates
+        self.keep_last_n_checkpoints = keep_last_n_checkpoints
+        self.last_per_updates = default(last_per_updates, save_per_updates)
+        self.checkpoint_path = default(checkpoint_path, "ckpts/test_f5-tts")
+        self.batch_size_per_gpu = batch_size_per_gpu
         self.batch_size_type = batch_size_type
         self.max_samples = max_samples
         self.grad_accumulation_steps = grad_accumulation_steps
     def is_main(self):
         return self.accelerator.is_main_process
+    def save_checkpoint(self, update, last=False):
         self.accelerator.wait_for_everyone()
         if self.is_main:
             checkpoint = dict(
                 optimizer_state_dict=self.accelerator.unwrap_model(self.optimizer).state_dict(),
                 ema_model_state_dict=self.ema_model.state_dict(),
                 scheduler_state_dict=self.scheduler.state_dict(),
+                update=update,
             )
             if not os.path.exists(self.checkpoint_path):
                 os.makedirs(self.checkpoint_path)
             if last:
                 self.accelerator.save(checkpoint, f"{self.checkpoint_path}/model_last.pt")
+                print(f"Saved last checkpoint at update {update}")
             else:
+                if self.keep_last_n_checkpoints == 0:
+                    return
+                self.accelerator.save(checkpoint, f"{self.checkpoint_path}/model_{update}.pt")
+                if self.keep_last_n_checkpoints > 0:
+                    # Updated logic to exclude pretrained model from rotation
+                    checkpoints = [
+                        f
+                        for f in os.listdir(self.checkpoint_path)
+                        if f.startswith("model_")
+                        and not f.startswith("pretrained_")  # Exclude pretrained models
+                        and f.endswith(".pt")
+                        and f != "model_last.pt"
+                    ]
+                    checkpoints.sort(key=lambda x: int(x.split("_")[1].split(".")[0]))
+                    while len(checkpoints) > self.keep_last_n_checkpoints:
+                        oldest_checkpoint = checkpoints.pop(0)
+                        os.remove(os.path.join(self.checkpoint_path, oldest_checkpoint))
+                        print(f"Removed old checkpoint: {oldest_checkpoint}")
     def load_checkpoint(self):
         if (
             not exists(self.checkpoint_path)
             or not os.path.exists(self.checkpoint_path)
+            or not any(filename.endswith((".pt", ".safetensors")) for filename in os.listdir(self.checkpoint_path))
         ):
             return 0
         if "model_last.pt" in os.listdir(self.checkpoint_path):
             latest_checkpoint = "model_last.pt"
         else:
+            # Updated to consider pretrained models for loading but prioritize training checkpoints
+            all_checkpoints = [
+                f
+                for f in os.listdir(self.checkpoint_path)
+                if (f.startswith("model_") or f.startswith("pretrained_")) and f.endswith((".pt", ".safetensors"))
+            ]
+            # First try to find regular training checkpoints
+            training_checkpoints = [f for f in all_checkpoints if f.startswith("model_") and f != "model_last.pt"]
+            if training_checkpoints:
+                latest_checkpoint = sorted(
+                    training_checkpoints,
+                    key=lambda x: int("".join(filter(str.isdigit, x))),
+                )[-1]
+            else:
+                # If no training checkpoints, use pretrained model
+                latest_checkpoint = next(f for f in all_checkpoints if f.startswith("pretrained_"))
+        if latest_checkpoint.endswith(".safetensors"):  # always a pretrained checkpoint
+            from safetensors.torch import load_file
+            checkpoint = load_file(f"{self.checkpoint_path}/{latest_checkpoint}", device="cpu")
+            checkpoint = {"ema_model_state_dict": checkpoint}
+        elif latest_checkpoint.endswith(".pt"):
+            # checkpoint = torch.load(f"{self.checkpoint_path}/{latest_checkpoint}", map_location=self.accelerator.device)  # rather use accelerator.load_state ಥ_ಥ
+            checkpoint = torch.load(
+                f"{self.checkpoint_path}/{latest_checkpoint}", weights_only=True, map_location="cpu"
+            )
         # patch for backward compatibility, 305e3ea
         for key in ["ema_model.mel_spec.mel_stft.mel_scale.fb", "ema_model.mel_spec.mel_stft.spectrogram.window"]:
         if self.is_main:
             self.ema_model.load_state_dict(checkpoint["ema_model_state_dict"])
+        if "update" in checkpoint or "step" in checkpoint:
+            # patch for backward compatibility, with before f992c4e
+            if "step" in checkpoint:
+                checkpoint["update"] = checkpoint["step"] // self.grad_accumulation_steps
+                if self.grad_accumulation_steps > 1 and self.is_main:
+                    print(
+                        "F5-TTS WARNING: Loading checkpoint saved with per_steps logic (before f992c4e), will convert to per_updates according to grad_accumulation_steps setting, may have unexpected behaviour."
+                    )
             # patch for backward compatibility, 305e3ea
             for key in ["mel_spec.mel_stft.mel_scale.fb", "mel_spec.mel_stft.spectrogram.window"]:
                 if key in checkpoint["model_state_dict"]:
             self.accelerator.unwrap_model(self.optimizer).load_state_dict(checkpoint["optimizer_state_dict"])
             if self.scheduler:
                 self.scheduler.load_state_dict(checkpoint["scheduler_state_dict"])
+            update = checkpoint["update"]
         else:
             checkpoint["model_state_dict"] = {
                 k.replace("ema_model.", ""): v
                 for k, v in checkpoint["ema_model_state_dict"].items()
+                if k not in ["initted", "update", "step"]
             }
             self.accelerator.unwrap_model(self.model).load_state_dict(checkpoint["model_state_dict"])
+            update = 0
         del checkpoint
         gc.collect()
+        return update
     def train(self, train_dataset: Dataset, num_workers=16, resumable_with_seed: int = None):
         if self.log_samples:
                 num_workers=num_workers,
                 pin_memory=True,
                 persistent_workers=True,
+                batch_size=self.batch_size_per_gpu,
                 shuffle=True,
                 generator=generator,
             )
             self.accelerator.even_batches = False
             sampler = SequentialSampler(train_dataset)
             batch_sampler = DynamicBatchSampler(
+                sampler,
+                self.batch_size_per_gpu,
+                max_samples=self.max_samples,
+                random_seed=resumable_with_seed,  # This enables reproducible shuffling
+                drop_residual=False,
             )
             train_dataloader = DataLoader(
                 train_dataset,
         #  accelerator.prepare() dispatches batches to devices;
         #  which means the length of dataloader calculated before, should consider the number of devices
+        warmup_updates = (
             self.num_warmup_updates * self.accelerator.num_processes
         )  # consider a fixed warmup steps while using accelerate multi-gpu ddp
         # otherwise by default with split_batches=False, warmup steps change with num_processes
+        total_updates = math.ceil(len(train_dataloader) / self.grad_accumulation_steps) * self.epochs
+        decay_updates = total_updates - warmup_updates
+        warmup_scheduler = LinearLR(self.optimizer, start_factor=1e-8, end_factor=1.0, total_iters=warmup_updates)
+        decay_scheduler = LinearLR(self.optimizer, start_factor=1.0, end_factor=1e-8, total_iters=decay_updates)
         self.scheduler = SequentialLR(
+            self.optimizer, schedulers=[warmup_scheduler, decay_scheduler], milestones=[warmup_updates]
         )
         train_dataloader, self.scheduler = self.accelerator.prepare(
             train_dataloader, self.scheduler
+        )  # actual multi_gpu updates = single_gpu updates / gpu nums
+        start_update = self.load_checkpoint()
+        global_update = start_update
         if exists(resumable_with_seed):
             orig_epoch_step = len(train_dataloader)
+            start_step = start_update * self.grad_accumulation_steps
             skipped_epoch = int(start_step // orig_epoch_step)
             skipped_batch = start_step % orig_epoch_step
             skipped_dataloader = self.accelerator.skip_first_batches(train_dataloader, num_batches=skipped_batch)
         for epoch in range(skipped_epoch, self.epochs):
             self.model.train()
             if exists(resumable_with_seed) and epoch == skipped_epoch:
+                progress_bar_initial = math.ceil(skipped_batch / self.grad_accumulation_steps)
+                current_dataloader = skipped_dataloader
             else:
+                progress_bar_initial = 0
+                current_dataloader = train_dataloader
+            # Set epoch for the batch sampler if it exists
+            if hasattr(train_dataloader, "batch_sampler") and hasattr(train_dataloader.batch_sampler, "set_epoch"):
+                train_dataloader.batch_sampler.set_epoch(epoch)
+            progress_bar = tqdm(
+                range(math.ceil(len(train_dataloader) / self.grad_accumulation_steps)),
+                desc=f"Epoch {epoch+1}/{self.epochs}",
+                unit="update",
+                disable=not self.accelerator.is_local_main_process,
+                initial=progress_bar_initial,
+            )
+            for batch in current_dataloader:
                 with self.accelerator.accumulate(self.model):
                     text_inputs = batch["text"]
                     mel_spec = batch["mel"].permute(0, 2, 1)
                     # TODO. add duration predictor training
                     if self.duration_predictor is not None and self.accelerator.is_local_main_process:
                         dur_loss = self.duration_predictor(mel_spec, lens=batch.get("durations"))
+                        self.accelerator.log({"duration loss": dur_loss.item()}, step=global_update)
                     loss, cond, pred = self.model(
                         mel_spec, text=text_inputs, lens=mel_lengths, noise_scheduler=self.noise_scheduler
                     self.scheduler.step()
                     self.optimizer.zero_grad()
+                if self.accelerator.sync_gradients:
+                    if self.is_main:
+                        self.ema_model.update()
+                    global_update += 1
+                    progress_bar.update(1)
+                    progress_bar.set_postfix(update=str(global_update), loss=loss.item())
                 if self.accelerator.is_local_main_process:
+                    self.accelerator.log(
+                        {"loss": loss.item(), "lr": self.scheduler.get_last_lr()[0]}, step=global_update
+                    )
                     if self.logger == "tensorboard":
+                        self.writer.add_scalar("loss", loss.item(), global_update)
+                        self.writer.add_scalar("lr", self.scheduler.get_last_lr()[0], global_update)
+                if global_update % self.save_per_updates == 0 and self.accelerator.sync_gradients:
+                    self.save_checkpoint(global_update)
                     if self.log_samples and self.accelerator.is_local_main_process:
                         ref_audio_len = mel_lengths[0]
                                 gen_audio = vocoder(gen_mel_spec).squeeze(0).cpu()
                                 ref_audio = vocoder(ref_mel_spec).squeeze(0).cpu()
+                        torchaudio.save(
+                            f"{log_samples_path}/update_{global_update}_gen.wav", gen_audio, target_sample_rate
+                        )
+                        torchaudio.save(
+                            f"{log_samples_path}/update_{global_update}_ref.wav", ref_audio, target_sample_rate
+                        )
+                if global_update % self.last_per_updates == 0 and self.accelerator.sync_gradients:
+                    self.save_checkpoint(global_update, last=True)
+        self.save_checkpoint(global_update, last=True)
         self.accelerator.end_training()

src/f5_tts/model/utils.py CHANGED Viewed

@@ -133,11 +133,12 @@ def get_tokenizer(dataset_name, tokenizer: str = "pinyin"):
 # convert char to pinyin
-jieba.initialize()
-print("Word segmentation module jieba initialized.\n")
 def convert_char_to_pinyin(text_list, polyphone=True):
     final_text_list = []
     custom_trans = str.maketrans(
         {";": ",", "“": '"', "”": '"', "‘": "'", "’": "'"}

 # convert char to pinyin
 def convert_char_to_pinyin(text_list, polyphone=True):
+    if jieba.dt.initialized is False:
+        jieba.default_logger.setLevel(50)  # CRITICAL
+        jieba.initialize()
     final_text_list = []
     custom_trans = str.maketrans(
         {";": ",", "“": '"', "”": '"', "‘": "'", "’": "'"}

src/f5_tts/scripts/count_max_epoch.py CHANGED Viewed

@@ -9,7 +9,7 @@ mel_hop_length = 256
 mel_sampling_rate = 24000
 # target
-wanted_max_updates = 1000000
 # train params
 gpus = 8
@@ -20,13 +20,13 @@ grad_accum = 1
 mini_batch_frames = frames_per_gpu * grad_accum * gpus
 mini_batch_hours = mini_batch_frames * mel_hop_length / mel_sampling_rate / 3600
 updates_per_epoch = total_hours / mini_batch_hours
-steps_per_epoch = updates_per_epoch * grad_accum
 # result
 epochs = wanted_max_updates / updates_per_epoch
 print(f"epochs should be set to: {epochs:.0f} ({epochs/grad_accum:.1f} x gd_acum {grad_accum})")
 print(f"progress_bar should show approx. 0/{updates_per_epoch:.0f} updates")
-print(f"                      or approx. 0/{steps_per_epoch:.0f} steps")
 # others
 print(f"total {total_hours:.0f} hours")

 mel_sampling_rate = 24000
 # target
+wanted_max_updates = 1200000
 # train params
 gpus = 8
 mini_batch_frames = frames_per_gpu * grad_accum * gpus
 mini_batch_hours = mini_batch_frames * mel_hop_length / mel_sampling_rate / 3600
 updates_per_epoch = total_hours / mini_batch_hours
+# steps_per_epoch = updates_per_epoch * grad_accum
 # result
 epochs = wanted_max_updates / updates_per_epoch
 print(f"epochs should be set to: {epochs:.0f} ({epochs/grad_accum:.1f} x gd_acum {grad_accum})")
 print(f"progress_bar should show approx. 0/{updates_per_epoch:.0f} updates")
+# print(f"                      or approx. 0/{steps_per_epoch:.0f} steps")
 # others
 print(f"total {total_hours:.0f} hours")

src/f5_tts/socket_client.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import socket
+import asyncio
+import pyaudio
+import numpy as np
+import logging
+import time
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+async def listen_to_F5TTS(text, server_ip="localhost", server_port=9998):
+    client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    await asyncio.get_event_loop().run_in_executor(None, client_socket.connect, (server_ip, int(server_port)))
+    start_time = time.time()
+    first_chunk_time = None
+    async def play_audio_stream():
+        nonlocal first_chunk_time
+        p = pyaudio.PyAudio()
+        stream = p.open(format=pyaudio.paFloat32, channels=1, rate=24000, output=True, frames_per_buffer=2048)
+        try:
+            while True:
+                data = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 8192)
+                if not data:
+                    break
+                if data == b"END":
+                    logger.info("End of audio received.")
+                    break
+                audio_array = np.frombuffer(data, dtype=np.float32)
+                stream.write(audio_array.tobytes())
+                if first_chunk_time is None:
+                    first_chunk_time = time.time()
+        finally:
+            stream.stop_stream()
+            stream.close()
+            p.terminate()
+        logger.info(f"Total time taken: {time.time() - start_time:.4f} seconds")
+    try:
+        data_to_send = f"{text}".encode("utf-8")
+        await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, data_to_send)
+        await play_audio_stream()
+    except Exception as e:
+        logger.error(f"Error in listen_to_F5TTS: {e}")
+    finally:
+        client_socket.close()
+if __name__ == "__main__":
+    text_to_send = "As a Reader assistant, I'm familiar with new technology. which are key to its improved performance in terms of both training speed and inference efficiency. Let's break down the components"
+    asyncio.run(listen_to_F5TTS(text_to_send))

src/f5_tts/socket_server.py CHANGED Viewed

@@ -1,142 +1,213 @@
 import argparse
 import gc
 import socket
 import struct
-import torch
-import torchaudio
 import traceback
 from importlib.resources import files
-from threading import Thread
-from cached_path import cached_path
-from infer.utils_infer import infer_batch_process, preprocess_ref_audio_text, load_vocoder, load_model
-from model.backbones.dit import DiT
 class TTSStreamingProcessor:
-    def __init__(self, ckpt_file, vocab_file, ref_audio, ref_text, device=None, dtype=torch.float32):
         self.device = device or (
-            "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
         )
-        # Load the model using the provided checkpoint and vocab files
-        self.model = load_model(
-            model_cls=DiT,
-            model_cfg=dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4),
             ckpt_path=ckpt_file,
-            mel_spec_type="vocos",  # or "bigvgan" depending on vocoder
             vocab_file=vocab_file,
             ode_method="euler",
             use_ema=True,
             device=self.device,
         ).to(self.device, dtype=dtype)
-        # Load the vocoder
-        self.vocoder = load_vocoder(is_local=False)
-        # Set sampling rate for streaming
-        self.sampling_rate = 24000  # Consistency with client
-        # Set reference audio and text
-        self.ref_audio = ref_audio
-        self.ref_text = ref_text
-        # Warm up the model
-        self._warm_up()
     def _warm_up(self):
-        """Warm up the model with a dummy input to ensure it's ready for real-time processing."""
-        print("Warming up the model...")
-        ref_audio, ref_text = preprocess_ref_audio_text(self.ref_audio, self.ref_text)
-        audio, sr = torchaudio.load(ref_audio)
         gen_text = "Warm-up text for the model."
-        # Pass the vocoder as an argument here
-        infer_batch_process((audio, sr), ref_text, [gen_text], self.model, self.vocoder, device=self.device)
-        print("Warm-up completed.")
-    def generate_stream(self, text, play_steps_in_s=0.5):
-        """Generate audio in chunks and yield them in real-time."""
-        # Preprocess the reference audio and text
-        ref_audio, ref_text = preprocess_ref_audio_text(self.ref_audio, self.ref_text)
-        # Load reference audio
-        audio, sr = torchaudio.load(ref_audio)
-        # Run inference for the input text
-        audio_chunk, final_sample_rate, _ = infer_batch_process(
-            (audio, sr),
-            ref_text,
-            [text],
             self.model,
             self.vocoder,
-            device=self.device,  # Pass vocoder here
         )
-        # Break the generated audio into chunks and send them
-        chunk_size = int(final_sample_rate * play_steps_in_s)
-        if len(audio_chunk) < chunk_size:
-            packed_audio = struct.pack(f"{len(audio_chunk)}f", *audio_chunk)
-            yield packed_audio
-            return
-        for i in range(0, len(audio_chunk), chunk_size):
-            chunk = audio_chunk[i : i + chunk_size]
-            # Check if it's the final chunk
-            if i + chunk_size >= len(audio_chunk):
-                chunk = audio_chunk[i:]
-            # Send the chunk if it is not empty
-            if len(chunk) > 0:
-                packed_audio = struct.pack(f"{len(chunk)}f", *chunk)
-                yield packed_audio
-def handle_client(client_socket, processor):
-    try:
-        while True:
-            # Receive data from the client
-            data = client_socket.recv(1024).decode("utf-8")
-            if not data:
-                break
-            try:
-                # The client sends the text input
-                text = data.strip()
-                # Generate and stream audio chunks
-                for audio_chunk in processor.generate_stream(text):
-                    client_socket.sendall(audio_chunk)
-                # Send end-of-audio signal
-                client_socket.sendall(b"END_OF_AUDIO")
-            except Exception as inner_e:
-                print(f"Error during processing: {inner_e}")
-                traceback.print_exc()  # Print the full traceback to diagnose the issue
-                break
     except Exception as e:
-        print(f"Error handling client: {e}")
         traceback.print_exc()
-    finally:
-        client_socket.close()
 def start_server(host, port, processor):
-    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-    server.bind((host, port))
-    server.listen(5)
-    print(f"Server listening on {host}:{port}")
-    while True:
-        client_socket, addr = server.accept()
-        print(f"Accepted connection from {addr}")
-        client_handler = Thread(target=handle_client, args=(client_socket, processor))
-        client_handler.start()
 if __name__ == "__main__":
@@ -145,9 +216,14 @@ if __name__ == "__main__":
     parser.add_argument("--host", default="0.0.0.0")
     parser.add_argument("--port", default=9998)
     parser.add_argument(
         "--ckpt_file",
-        default=str(cached_path("hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors")),
         help="Path to the model checkpoint file",
     )
     parser.add_argument(
@@ -175,6 +251,7 @@ if __name__ == "__main__":
     try:
         # Initialize the processor with the model and vocoder
         processor = TTSStreamingProcessor(
             ckpt_file=args.ckpt_file,
             vocab_file=args.vocab_file,
             ref_audio=args.ref_audio,

 import argparse
 import gc
+import logging
+import numpy as np
+import queue
 import socket
 import struct
+import threading
 import traceback
+import wave
 from importlib.resources import files
+import torch
+import torchaudio
+from huggingface_hub import hf_hub_download
+from omegaconf import OmegaConf
+from f5_tts.model.backbones.dit import DiT  # noqa: F401. used for config
+from f5_tts.infer.utils_infer import (
+    chunk_text,
+    preprocess_ref_audio_text,
+    load_vocoder,
+    load_model,
+    infer_batch_process,
+)
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class AudioFileWriterThread(threading.Thread):
+    """Threaded file writer to avoid blocking the TTS streaming process."""
+    def __init__(self, output_file, sampling_rate):
+        super().__init__()
+        self.output_file = output_file
+        self.sampling_rate = sampling_rate
+        self.queue = queue.Queue()
+        self.stop_event = threading.Event()
+        self.audio_data = []
+    def run(self):
+        """Process queued audio data and write it to a file."""
+        logger.info("AudioFileWriterThread started.")
+        with wave.open(self.output_file, "wb") as wf:
+            wf.setnchannels(1)
+            wf.setsampwidth(2)
+            wf.setframerate(self.sampling_rate)
+            while not self.stop_event.is_set() or not self.queue.empty():
+                try:
+                    chunk = self.queue.get(timeout=0.1)
+                    if chunk is not None:
+                        chunk = np.int16(chunk * 32767)
+                        self.audio_data.append(chunk)
+                        wf.writeframes(chunk.tobytes())
+                except queue.Empty:
+                    continue
+    def add_chunk(self, chunk):
+        """Add a new chunk to the queue."""
+        self.queue.put(chunk)
+    def stop(self):
+        """Stop writing and ensure all queued data is written."""
+        self.stop_event.set()
+        self.join()
+        logger.info("Audio writing completed.")
 class TTSStreamingProcessor:
+    def __init__(self, model, ckpt_file, vocab_file, ref_audio, ref_text, device=None, dtype=torch.float32):
         self.device = device or (
+            "cuda"
+            if torch.cuda.is_available()
+            else "xpu"
+            if torch.xpu.is_available()
+            else "mps"
+            if torch.backends.mps.is_available()
+            else "cpu"
         )
+        model_cfg = OmegaConf.load(str(files("f5_tts").joinpath(f"configs/{model}.yaml")))
+        self.model_cls = globals()[model_cfg.model.backbone]
+        self.model_arc = model_cfg.model.arch
+        self.mel_spec_type = model_cfg.model.mel_spec.mel_spec_type
+        self.sampling_rate = model_cfg.model.mel_spec.target_sample_rate
+        self.model = self.load_ema_model(ckpt_file, vocab_file, dtype)
+        self.vocoder = self.load_vocoder_model()
+        self.update_reference(ref_audio, ref_text)
+        self._warm_up()
+        self.file_writer_thread = None
+        self.first_package = True
+    def load_ema_model(self, ckpt_file, vocab_file, dtype):
+        return load_model(
+            self.model_cls,
+            self.model_arc,
             ckpt_path=ckpt_file,
+            mel_spec_type=self.mel_spec_type,
             vocab_file=vocab_file,
             ode_method="euler",
             use_ema=True,
             device=self.device,
         ).to(self.device, dtype=dtype)
+    def load_vocoder_model(self):
+        return load_vocoder(vocoder_name=self.mel_spec_type, is_local=False, local_path=None, device=self.device)
+    def update_reference(self, ref_audio, ref_text):
+        self.ref_audio, self.ref_text = preprocess_ref_audio_text(ref_audio, ref_text)
+        self.audio, self.sr = torchaudio.load(self.ref_audio)
+        ref_audio_duration = self.audio.shape[-1] / self.sr
+        ref_text_byte_len = len(self.ref_text.encode("utf-8"))
+        self.max_chars = int(ref_text_byte_len / (ref_audio_duration) * (25 - ref_audio_duration))
+        self.few_chars = int(ref_text_byte_len / (ref_audio_duration) * (25 - ref_audio_duration) / 2)
+        self.min_chars = int(ref_text_byte_len / (ref_audio_duration) * (25 - ref_audio_duration) / 4)
     def _warm_up(self):
+        logger.info("Warming up the model...")
         gen_text = "Warm-up text for the model."
+        for _ in infer_batch_process(
+            (self.audio, self.sr),
+            self.ref_text,
+            [gen_text],
             self.model,
             self.vocoder,
+            progress=None,
+            device=self.device,
+            streaming=True,
+        ):
+            pass
+        logger.info("Warm-up completed.")
+    def generate_stream(self, text, conn):
+        text_batches = chunk_text(text, max_chars=self.max_chars)
+        if self.first_package:
+            text_batches = chunk_text(text_batches[0], max_chars=self.few_chars) + text_batches[1:]
+            text_batches = chunk_text(text_batches[0], max_chars=self.min_chars) + text_batches[1:]
+            self.first_package = False
+        audio_stream = infer_batch_process(
+            (self.audio, self.sr),
+            self.ref_text,
+            text_batches,
+            self.model,
+            self.vocoder,
+            progress=None,
+            device=self.device,
+            streaming=True,
+            chunk_size=2048,
         )
+        # Reset the file writer thread
+        if self.file_writer_thread is not None:
+            self.file_writer_thread.stop()
+        self.file_writer_thread = AudioFileWriterThread("output.wav", self.sampling_rate)
+        self.file_writer_thread.start()
+        for audio_chunk, _ in audio_stream:
+            if len(audio_chunk) > 0:
+                logger.info(f"Generated audio chunk of size: {len(audio_chunk)}")
+                # Send audio chunk via socket
+                conn.sendall(struct.pack(f"{len(audio_chunk)}f", *audio_chunk))
+                # Write to file asynchronously
+                self.file_writer_thread.add_chunk(audio_chunk)
+        logger.info("Finished sending audio stream.")
+        conn.sendall(b"END")  # Send end signal
+        # Ensure all audio data is written before exiting
+        self.file_writer_thread.stop()
+def handle_client(conn, processor):
+    try:
+        with conn:
+            conn.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
+            while True:
+                data = conn.recv(1024)
+                if not data:
+                    processor.first_package = True
+                    break
+                data_str = data.decode("utf-8").strip()
+                logger.info(f"Received text: {data_str}")
+                try:
+                    processor.generate_stream(data_str, conn)
+                except Exception as inner_e:
+                    logger.error(f"Error during processing: {inner_e}")
+                    traceback.print_exc()
+                    break
     except Exception as e:
+        logger.error(f"Error handling client: {e}")
         traceback.print_exc()
 def start_server(host, port, processor):
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        s.bind((host, port))
+        s.listen()
+        logger.info(f"Server started on {host}:{port}")
+        while True:
+            conn, addr = s.accept()
+            logger.info(f"Connected by {addr}")
+            handle_client(conn, processor)
 if __name__ == "__main__":
     parser.add_argument("--host", default="0.0.0.0")
     parser.add_argument("--port", default=9998)
+    parser.add_argument(
+        "--model",
+        default="F5TTS_v1_Base",
+        help="The model name, e.g. F5TTS_v1_Base",
+    )
     parser.add_argument(
         "--ckpt_file",
+        default=str(hf_hub_download(repo_id="SWivid/F5-TTS", filename="F5TTS_v1_Base/model_1250000.safetensors")),
         help="Path to the model checkpoint file",
     )
     parser.add_argument(
     try:
         # Initialize the processor with the model and vocoder
         processor = TTSStreamingProcessor(
+            model=args.model,
             ckpt_file=args.ckpt_file,
             vocab_file=args.vocab_file,
             ref_audio=args.ref_audio,

src/f5_tts/train/README.md CHANGED Viewed

@@ -40,10 +40,10 @@ Once your datasets are prepared, you can start the training process.
 accelerate config
 # .yaml files are under src/f5_tts/configs directory
-accelerate launch src/f5_tts/train/train.py --config-name F5TTS_Base_train.yaml
 # possible to overwrite accelerate and hydra config
-accelerate launch --mixed_precision=fp16 src/f5_tts/train/train.py --config-name F5TTS_Small_train.yaml ++datasets.batch_size_per_gpu=19200
 ```
 ### 2. Finetuning practice
@@ -53,7 +53,7 @@ Gradio UI training/finetuning with `src/f5_tts/train/finetune_gradio.py` see [#1
 The `use_ema = True` is harmful for early-stage finetuned checkpoints (which goes just few updates, thus ema weights still dominated by pretrained ones), try turn it off and see if provide better results.
-### 3. Wandb Logging
 The `wandb/` dir will be created under path you run training/finetuning scripts.
@@ -62,7 +62,7 @@ By default, the training script does NOT use logging (assuming you didn't manual
 To turn on wandb logging, you can either:
 1. Manually login with `wandb login`: Learn more [here](https://docs.wandb.ai/ref/cli/wandb-login)
-2. Automatically login programmatically by setting an environment variable: Get an API KEY at https://wandb.ai/site/ and set the environment variable as follows:
 On Mac & Linux:
@@ -75,7 +75,7 @@ On Windows:
 ```
 set WANDB_API_KEY=<YOUR WANDB API KEY>
 ```
-Moreover, if you couldn't access Wandb and want to log metrics offline, you can the environment variable as follows:
 ```
 export WANDB_MODE=offline

 accelerate config
 # .yaml files are under src/f5_tts/configs directory
+accelerate launch src/f5_tts/train/train.py --config-name F5TTS_v1_Base.yaml
 # possible to overwrite accelerate and hydra config
+accelerate launch --mixed_precision=fp16 src/f5_tts/train/train.py --config-name F5TTS_v1_Base.yaml ++datasets.batch_size_per_gpu=19200
 ```
 ### 2. Finetuning practice
 The `use_ema = True` is harmful for early-stage finetuned checkpoints (which goes just few updates, thus ema weights still dominated by pretrained ones), try turn it off and see if provide better results.
+### 3. W&B Logging
 The `wandb/` dir will be created under path you run training/finetuning scripts.
 To turn on wandb logging, you can either:
 1. Manually login with `wandb login`: Learn more [here](https://docs.wandb.ai/ref/cli/wandb-login)
+2. Automatically login programmatically by setting an environment variable: Get an API KEY at https://wandb.ai/authorize and set the environment variable as follows:
 On Mac & Linux:
 ```
 set WANDB_API_KEY=<YOUR WANDB API KEY>
 ```
+Moreover, if you couldn't access W&B and want to log metrics offline, you can set the environment variable as follows:
 ```
 export WANDB_MODE=offline

src/f5_tts/train/datasets/prepare_csv_wavs.py CHANGED Viewed

@@ -1,12 +1,17 @@
 import os
 import sys
 sys.path.append(os.getcwd())
 import argparse
 import csv
 import json
-import shutil
 from importlib.resources import files
 from pathlib import Path
@@ -29,32 +34,157 @@ def is_csv_wavs_format(input_dataset_dir):
     return metadata.exists() and metadata.is_file() and wavs.exists() and wavs.is_dir()
-def prepare_csv_wavs_dir(input_dir):
     assert is_csv_wavs_format(input_dir), f"not csv_wavs format: {input_dir}"
     input_dir = Path(input_dir)
     metadata_path = input_dir / "metadata.csv"
     audio_path_text_pairs = read_audio_text_pairs(metadata_path.as_posix())
-    sub_result, durations = [], []
-    vocab_set = set()
     polyphone = True
-    for audio_path, text in audio_path_text_pairs:
-        if not Path(audio_path).exists():
-            print(f"audio {audio_path} not found, skipping")
-            continue
-        audio_duration = get_audio_duration(audio_path)
-        # assume tokenizer = "pinyin"  ("pinyin" | "char")
-        text = convert_char_to_pinyin([text], polyphone=polyphone)[0]
-        sub_result.append({"audio_path": audio_path, "text": text, "duration": audio_duration})
-        durations.append(audio_duration)
-        vocab_set.update(list(text))
     return sub_result, durations, vocab_set
-def get_audio_duration(audio_path):
-    audio, sample_rate = torchaudio.load(audio_path)
-    return audio.shape[1] / sample_rate
 def read_audio_text_pairs(csv_file_path):
@@ -76,36 +206,27 @@ def read_audio_text_pairs(csv_file_path):
 def save_prepped_dataset(out_dir, result, duration_list, text_vocab_set, is_finetune):
     out_dir = Path(out_dir)
-    # save preprocessed dataset to disk
     out_dir.mkdir(exist_ok=True, parents=True)
     print(f"\nSaving to {out_dir} ...")
-    # dataset = Dataset.from_dict({"audio_path": audio_path_list, "text": text_list, "duration": duration_list})  # oom
-    # dataset.save_to_disk(f"{out_dir}/raw", max_shard_size="2GB")
     raw_arrow_path = out_dir / "raw.arrow"
-    with ArrowWriter(path=raw_arrow_path.as_posix(), writer_batch_size=1) as writer:
         for line in tqdm(result, desc="Writing to raw.arrow ..."):
             writer.write(line)
-    # dup a json separately saving duration in case for DynamicBatchSampler ease
     dur_json_path = out_dir / "duration.json"
     with open(dur_json_path.as_posix(), "w", encoding="utf-8") as f:
         json.dump({"duration": duration_list}, f, ensure_ascii=False)
-    # vocab map, i.e. tokenizer
-    # add alphabets and symbols (optional, if plan to ft on de/fr etc.)
-    # if tokenizer == "pinyin":
-    #     text_vocab_set.update([chr(i) for i in range(32, 127)] + [chr(i) for i in range(192, 256)])
     voca_out_path = out_dir / "vocab.txt"
-    with open(voca_out_path.as_posix(), "w") as f:
-        for vocab in sorted(text_vocab_set):
-            f.write(vocab + "\n")
     if is_finetune:
         file_vocab_finetune = PRETRAINED_VOCAB_PATH.as_posix()
         shutil.copy2(file_vocab_finetune, voca_out_path)
     else:
-        with open(voca_out_path, "w") as f:
             for vocab in sorted(text_vocab_set):
                 f.write(vocab + "\n")
@@ -115,24 +236,48 @@ def save_prepped_dataset(out_dir, result, duration_list, text_vocab_set, is_fine
     print(f"For {dataset_name}, total {sum(duration_list)/3600:.2f} hours")
-def prepare_and_save_set(inp_dir, out_dir, is_finetune: bool = True):
     if is_finetune:
         assert PRETRAINED_VOCAB_PATH.exists(), f"pretrained vocab.txt not found: {PRETRAINED_VOCAB_PATH}"
-    sub_result, durations, vocab_set = prepare_csv_wavs_dir(inp_dir)
     save_prepped_dataset(out_dir, sub_result, durations, vocab_set, is_finetune)
 def cli():
-    # finetune: python scripts/prepare_csv_wavs.py /path/to/input_dir /path/to/output_dir_pinyin
-    # pretrain: python scripts/prepare_csv_wavs.py /path/to/output_dir_pinyin --pretrain
-    parser = argparse.ArgumentParser(description="Prepare and save dataset.")
-    parser.add_argument("inp_dir", type=str, help="Input directory containing the data.")
-    parser.add_argument("out_dir", type=str, help="Output directory to save the prepared data.")
-    parser.add_argument("--pretrain", action="store_true", help="Enable for new pretrain, otherwise is a fine-tune")
-    args = parser.parse_args()
-    prepare_and_save_set(args.inp_dir, args.out_dir, is_finetune=not args.pretrain)
 if __name__ == "__main__":

 import os
 import sys
+import signal
+import subprocess  # For invoking ffprobe
+import shutil
+import concurrent.futures
+import multiprocessing
+from contextlib import contextmanager
 sys.path.append(os.getcwd())
 import argparse
 import csv
 import json
 from importlib.resources import files
 from pathlib import Path
     return metadata.exists() and metadata.is_file() and wavs.exists() and wavs.is_dir()
+# Configuration constants
+BATCH_SIZE = 100  # Batch size for text conversion
+MAX_WORKERS = max(1, multiprocessing.cpu_count() - 1)  # Leave one CPU free
+THREAD_NAME_PREFIX = "AudioProcessor"
+CHUNK_SIZE = 100  # Number of files to process per worker batch
+executor = None  # Global executor for cleanup
+@contextmanager
+def graceful_exit():
+    """Context manager for graceful shutdown on signals"""
+    def signal_handler(signum, frame):
+        print("\nReceived signal to terminate. Cleaning up...")
+        if executor is not None:
+            print("Shutting down executor...")
+            executor.shutdown(wait=False, cancel_futures=True)
+        sys.exit(1)
+    # Set up signal handlers
+    signal.signal(signal.SIGINT, signal_handler)
+    signal.signal(signal.SIGTERM, signal_handler)
+    try:
+        yield
+    finally:
+        if executor is not None:
+            executor.shutdown(wait=False)
+def process_audio_file(audio_path, text, polyphone):
+    """Process a single audio file by checking its existence and extracting duration."""
+    if not Path(audio_path).exists():
+        print(f"audio {audio_path} not found, skipping")
+        return None
+    try:
+        audio_duration = get_audio_duration(audio_path)
+        if audio_duration <= 0:
+            raise ValueError(f"Duration {audio_duration} is non-positive.")
+        return (audio_path, text, audio_duration)
+    except Exception as e:
+        print(f"Warning: Failed to process {audio_path} due to error: {e}. Skipping corrupt file.")
+        return None
+def batch_convert_texts(texts, polyphone, batch_size=BATCH_SIZE):
+    """Convert a list of texts to pinyin in batches."""
+    converted_texts = []
+    for i in range(0, len(texts), batch_size):
+        batch = texts[i : i + batch_size]
+        converted_batch = convert_char_to_pinyin(batch, polyphone=polyphone)
+        converted_texts.extend(converted_batch)
+    return converted_texts
+def prepare_csv_wavs_dir(input_dir, num_workers=None):
+    global executor
     assert is_csv_wavs_format(input_dir), f"not csv_wavs format: {input_dir}"
     input_dir = Path(input_dir)
     metadata_path = input_dir / "metadata.csv"
     audio_path_text_pairs = read_audio_text_pairs(metadata_path.as_posix())
     polyphone = True
+    total_files = len(audio_path_text_pairs)
+    # Use provided worker count or calculate optimal number
+    worker_count = num_workers if num_workers is not None else min(MAX_WORKERS, total_files)
+    print(f"\nProcessing {total_files} audio files using {worker_count} workers...")
+    with graceful_exit():
+        # Initialize thread pool with optimized settings
+        with concurrent.futures.ThreadPoolExecutor(
+            max_workers=worker_count, thread_name_prefix=THREAD_NAME_PREFIX
+        ) as exec:
+            executor = exec
+            results = []
+            # Process files in chunks for better efficiency
+            for i in range(0, len(audio_path_text_pairs), CHUNK_SIZE):
+                chunk = audio_path_text_pairs[i : i + CHUNK_SIZE]
+                # Submit futures in order
+                chunk_futures = [executor.submit(process_audio_file, pair[0], pair[1], polyphone) for pair in chunk]
+                # Iterate over futures in the original submission order to preserve ordering
+                for future in tqdm(
+                    chunk_futures,
+                    total=len(chunk),
+                    desc=f"Processing chunk {i//CHUNK_SIZE + 1}/{(total_files + CHUNK_SIZE - 1)//CHUNK_SIZE}",
+                ):
+                    try:
+                        result = future.result()
+                        if result is not None:
+                            results.append(result)
+                    except Exception as e:
+                        print(f"Error processing file: {e}")
+            executor = None
+    # Filter out failed results
+    processed = [res for res in results if res is not None]
+    if not processed:
+        raise RuntimeError("No valid audio files were processed!")
+    # Batch process text conversion
+    raw_texts = [item[1] for item in processed]
+    converted_texts = batch_convert_texts(raw_texts, polyphone, batch_size=BATCH_SIZE)
+    # Prepare final results
+    sub_result = []
+    durations = []
+    vocab_set = set()
+    for (audio_path, _, duration), conv_text in zip(processed, converted_texts):
+        sub_result.append({"audio_path": audio_path, "text": conv_text, "duration": duration})
+        durations.append(duration)
+        vocab_set.update(list(conv_text))
     return sub_result, durations, vocab_set
+def get_audio_duration(audio_path, timeout=5):
+    """
+    Get the duration of an audio file in seconds using ffmpeg's ffprobe.
+    Falls back to torchaudio.load() if ffprobe fails.
+    """
+    try:
+        cmd = [
+            "ffprobe",
+            "-v",
+            "error",
+            "-show_entries",
+            "format=duration",
+            "-of",
+            "default=noprint_wrappers=1:nokey=1",
+            audio_path,
+        ]
+        result = subprocess.run(
+            cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=True, timeout=timeout
+        )
+        duration_str = result.stdout.strip()
+        if duration_str:
+            return float(duration_str)
+        raise ValueError("Empty duration string from ffprobe.")
+    except (subprocess.TimeoutExpired, subprocess.SubprocessError, ValueError) as e:
+        print(f"Warning: ffprobe failed for {audio_path} with error: {e}. Falling back to torchaudio.")
+        try:
+            audio, sample_rate = torchaudio.load(audio_path)
+            return audio.shape[1] / sample_rate
+        except Exception as e:
+            raise RuntimeError(f"Both ffprobe and torchaudio failed for {audio_path}: {e}")
 def read_audio_text_pairs(csv_file_path):
 def save_prepped_dataset(out_dir, result, duration_list, text_vocab_set, is_finetune):
     out_dir = Path(out_dir)
     out_dir.mkdir(exist_ok=True, parents=True)
     print(f"\nSaving to {out_dir} ...")
+    # Save dataset with improved batch size for better I/O performance
     raw_arrow_path = out_dir / "raw.arrow"
+    with ArrowWriter(path=raw_arrow_path.as_posix(), writer_batch_size=100) as writer:
         for line in tqdm(result, desc="Writing to raw.arrow ..."):
             writer.write(line)
+    # Save durations to JSON
     dur_json_path = out_dir / "duration.json"
     with open(dur_json_path.as_posix(), "w", encoding="utf-8") as f:
         json.dump({"duration": duration_list}, f, ensure_ascii=False)
+    # Handle vocab file - write only once based on finetune flag
     voca_out_path = out_dir / "vocab.txt"
     if is_finetune:
         file_vocab_finetune = PRETRAINED_VOCAB_PATH.as_posix()
         shutil.copy2(file_vocab_finetune, voca_out_path)
     else:
+        with open(voca_out_path.as_posix(), "w") as f:
             for vocab in sorted(text_vocab_set):
                 f.write(vocab + "\n")
     print(f"For {dataset_name}, total {sum(duration_list)/3600:.2f} hours")
+def prepare_and_save_set(inp_dir, out_dir, is_finetune: bool = True, num_workers: int = None):
     if is_finetune:
         assert PRETRAINED_VOCAB_PATH.exists(), f"pretrained vocab.txt not found: {PRETRAINED_VOCAB_PATH}"
+    sub_result, durations, vocab_set = prepare_csv_wavs_dir(inp_dir, num_workers=num_workers)
     save_prepped_dataset(out_dir, sub_result, durations, vocab_set, is_finetune)
 def cli():
+    try:
+        # Before processing, check if ffprobe is available.
+        if shutil.which("ffprobe") is None:
+            print(
+                "Warning: ffprobe is not available. Duration extraction will rely on torchaudio (which may be slower)."
+            )
+        # Usage examples in help text
+        parser = argparse.ArgumentParser(
+            description="Prepare and save dataset.",
+            epilog="""
+Examples:
+    # For fine-tuning (default):
+    python prepare_csv_wavs.py /input/dataset/path /output/dataset/path
+    # For pre-training:
+    python prepare_csv_wavs.py /input/dataset/path /output/dataset/path --pretrain
+    # With custom worker count:
+    python prepare_csv_wavs.py /input/dataset/path /output/dataset/path --workers 4
+            """,
+        )
+        parser.add_argument("inp_dir", type=str, help="Input directory containing the data.")
+        parser.add_argument("out_dir", type=str, help="Output directory to save the prepared data.")
+        parser.add_argument("--pretrain", action="store_true", help="Enable for new pretrain, otherwise is a fine-tune")
+        parser.add_argument("--workers", type=int, help=f"Number of worker threads (default: {MAX_WORKERS})")
+        args = parser.parse_args()
+        prepare_and_save_set(args.inp_dir, args.out_dir, is_finetune=not args.pretrain, num_workers=args.workers)
+    except KeyboardInterrupt:
+        print("\nOperation cancelled by user. Cleaning up...")
+        if executor is not None:
+            executor.shutdown(wait=False, cancel_futures=True)
+        sys.exit(1)
 if __name__ == "__main__":

src/f5_tts/train/finetune_cli.py CHANGED Viewed

@@ -1,12 +1,13 @@
 import argparse
 import os
 import shutil
 from cached_path import cached_path
 from f5_tts.model import CFM, UNetT, DiT, Trainer
 from f5_tts.model.utils import get_tokenizer
 from f5_tts.model.dataset import load_dataset
-from importlib.resources import files
 # -------------------------- Dataset Settings --------------------------- #
@@ -20,19 +21,14 @@ mel_spec_type = "vocos"  # 'vocos' or 'bigvgan'
 # -------------------------- Argument Parsing --------------------------- #
 def parse_args():
-    # batch_size_per_gpu = 1000 settting for gpu 8GB
-    # batch_size_per_gpu = 1600 settting for gpu 12GB
-    # batch_size_per_gpu = 2000 settting for gpu 16GB
-    # batch_size_per_gpu = 3200 settting for gpu 24GB
-    # num_warmup_updates = 300 for 5000 sample about 10 hours
-    # change save_per_updates , last_per_steps change this value what you need  ,
     parser = argparse.ArgumentParser(description="Train CFM Model")
     parser.add_argument(
-        "--exp_name", type=str, default="F5TTS_Base", choices=["F5TTS_Base", "E2TTS_Base"], help="Experiment name"
     )
     parser.add_argument("--dataset_name", type=str, default="Emilia_ZH_EN", help="Name of the dataset to use")
     parser.add_argument("--learning_rate", type=float, default=1e-5, help="Learning rate for training")
@@ -44,9 +40,15 @@ def parse_args():
     parser.add_argument("--grad_accumulation_steps", type=int, default=1, help="Gradient accumulation steps")
     parser.add_argument("--max_grad_norm", type=float, default=1.0, help="Max gradient norm for clipping")
     parser.add_argument("--epochs", type=int, default=100, help="Number of training epochs")
-    parser.add_argument("--num_warmup_updates", type=int, default=300, help="Warmup steps")
-    parser.add_argument("--save_per_updates", type=int, default=10000, help="Save checkpoint every X steps")
-    parser.add_argument("--last_per_steps", type=int, default=50000, help="Save last checkpoint every X steps")
     parser.add_argument("--finetune", action="store_true", help="Use Finetune")
     parser.add_argument("--pretrain", type=str, default=None, help="the path to the checkpoint")
     parser.add_argument(
@@ -61,7 +63,7 @@ def parse_args():
     parser.add_argument(
         "--log_samples",
         action="store_true",
-        help="Log inferenced samples per ckpt save steps",
     )
     parser.add_argument("--logger", type=str, default=None, choices=["wandb", "tensorboard"], help="logger")
     parser.add_argument(
@@ -82,19 +84,54 @@ def main():
     checkpoint_path = str(files("f5_tts").joinpath(f"../../ckpts/{args.dataset_name}"))
     # Model parameters based on experiment name
-    if args.exp_name == "F5TTS_Base":
         wandb_resume_id = None
         model_cls = DiT
-        model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
         if args.finetune:
             if args.pretrain is None:
                 ckpt_path = str(cached_path("hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.pt"))
             else:
                 ckpt_path = args.pretrain
     elif args.exp_name == "E2TTS_Base":
         wandb_resume_id = None
         model_cls = UNetT
-        model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
         if args.finetune:
             if args.pretrain is None:
                 ckpt_path = str(cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.pt"))
@@ -105,12 +142,16 @@ def main():
         if not os.path.isdir(checkpoint_path):
             os.makedirs(checkpoint_path, exist_ok=True)
-        file_checkpoint = os.path.join(checkpoint_path, os.path.basename(ckpt_path))
         if not os.path.isfile(file_checkpoint):
             shutil.copy2(ckpt_path, file_checkpoint)
             print("copy checkpoint for finetune")
     # Use the tokenizer and tokenizer_path provided in the command line arguments
     tokenizer = args.tokenizer
     if tokenizer == "custom":
         if not args.tokenizer_path:
@@ -145,8 +186,9 @@ def main():
         args.learning_rate,
         num_warmup_updates=args.num_warmup_updates,
         save_per_updates=args.save_per_updates,
         checkpoint_path=checkpoint_path,
-        batch_size=args.batch_size_per_gpu,
         batch_size_type=args.batch_size_type,
         max_samples=args.max_samples,
         grad_accumulation_steps=args.grad_accumulation_steps,
@@ -156,7 +198,7 @@ def main():
         wandb_run_name=args.exp_name,
         wandb_resume_id=wandb_resume_id,
         log_samples=args.log_samples,
-        last_per_steps=args.last_per_steps,
         bnb_optimizer=args.bnb_optimizer,
     )

 import argparse
 import os
 import shutil
+from importlib.resources import files
 from cached_path import cached_path
 from f5_tts.model import CFM, UNetT, DiT, Trainer
 from f5_tts.model.utils import get_tokenizer
 from f5_tts.model.dataset import load_dataset
 # -------------------------- Dataset Settings --------------------------- #
 # -------------------------- Argument Parsing --------------------------- #
 def parse_args():
     parser = argparse.ArgumentParser(description="Train CFM Model")
     parser.add_argument(
+        "--exp_name",
+        type=str,
+        default="F5TTS_v1_Base",
+        choices=["F5TTS_v1_Base", "F5TTS_Base", "E2TTS_Base"],
+        help="Experiment name",
     )
     parser.add_argument("--dataset_name", type=str, default="Emilia_ZH_EN", help="Name of the dataset to use")
     parser.add_argument("--learning_rate", type=float, default=1e-5, help="Learning rate for training")
     parser.add_argument("--grad_accumulation_steps", type=int, default=1, help="Gradient accumulation steps")
     parser.add_argument("--max_grad_norm", type=float, default=1.0, help="Max gradient norm for clipping")
     parser.add_argument("--epochs", type=int, default=100, help="Number of training epochs")
+    parser.add_argument("--num_warmup_updates", type=int, default=300, help="Warmup updates")
+    parser.add_argument("--save_per_updates", type=int, default=10000, help="Save checkpoint every X updates")
+    parser.add_argument(
+        "--keep_last_n_checkpoints",
+        type=int,
+        default=-1,
+        help="-1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints",
+    )
+    parser.add_argument("--last_per_updates", type=int, default=50000, help="Save last checkpoint every X updates")
     parser.add_argument("--finetune", action="store_true", help="Use Finetune")
     parser.add_argument("--pretrain", type=str, default=None, help="the path to the checkpoint")
     parser.add_argument(
     parser.add_argument(
         "--log_samples",
         action="store_true",
+        help="Log inferenced samples per ckpt save updates",
     )
     parser.add_argument("--logger", type=str, default=None, choices=["wandb", "tensorboard"], help="logger")
     parser.add_argument(
     checkpoint_path = str(files("f5_tts").joinpath(f"../../ckpts/{args.dataset_name}"))
     # Model parameters based on experiment name
+    if args.exp_name == "F5TTS_v1_Base":
         wandb_resume_id = None
         model_cls = DiT
+        model_cfg = dict(
+            dim=1024,
+            depth=22,
+            heads=16,
+            ff_mult=2,
+            text_dim=512,
+            conv_layers=4,
+        )
+        if args.finetune:
+            if args.pretrain is None:
+                ckpt_path = str(cached_path("hf://SWivid/F5-TTS/F5TTS_v1_Base/model_1250000.safetensors"))
+            else:
+                ckpt_path = args.pretrain
+    elif args.exp_name == "F5TTS_Base":
+        wandb_resume_id = None
+        model_cls = DiT
+        model_cfg = dict(
+            dim=1024,
+            depth=22,
+            heads=16,
+            ff_mult=2,
+            text_dim=512,
+            text_mask_padding=False,
+            conv_layers=4,
+            pe_attn_head=1,
+        )
         if args.finetune:
             if args.pretrain is None:
                 ckpt_path = str(cached_path("hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.pt"))
             else:
                 ckpt_path = args.pretrain
     elif args.exp_name == "E2TTS_Base":
         wandb_resume_id = None
         model_cls = UNetT
+        model_cfg = dict(
+            dim=1024,
+            depth=24,
+            heads=16,
+            ff_mult=4,
+            text_mask_padding=False,
+            pe_attn_head=1,
+        )
         if args.finetune:
             if args.pretrain is None:
                 ckpt_path = str(cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.pt"))
         if not os.path.isdir(checkpoint_path):
             os.makedirs(checkpoint_path, exist_ok=True)
+        file_checkpoint = os.path.basename(ckpt_path)
+        if not file_checkpoint.startswith("pretrained_"):  # Change: Add 'pretrained_' prefix to copied model
+            file_checkpoint = "pretrained_" + file_checkpoint
+        file_checkpoint = os.path.join(checkpoint_path, file_checkpoint)
         if not os.path.isfile(file_checkpoint):
             shutil.copy2(ckpt_path, file_checkpoint)
             print("copy checkpoint for finetune")
     # Use the tokenizer and tokenizer_path provided in the command line arguments
     tokenizer = args.tokenizer
     if tokenizer == "custom":
         if not args.tokenizer_path:
         args.learning_rate,
         num_warmup_updates=args.num_warmup_updates,
         save_per_updates=args.save_per_updates,
+        keep_last_n_checkpoints=args.keep_last_n_checkpoints,
         checkpoint_path=checkpoint_path,
+        batch_size_per_gpu=args.batch_size_per_gpu,
         batch_size_type=args.batch_size_type,
         max_samples=args.max_samples,
         grad_accumulation_steps=args.grad_accumulation_steps,
         wandb_run_name=args.exp_name,
         wandb_resume_id=wandb_resume_id,
         log_samples=args.log_samples,
+        last_per_updates=args.last_per_updates,
         bnb_optimizer=args.bnb_optimizer,
     )

src/f5_tts/train/finetune_gradio.py CHANGED Viewed

@@ -1,36 +1,36 @@
-import threading
-import queue
-import re
 import gc
 import json
 import os
 import platform
 import psutil
 import random
 import signal
 import shutil
 import subprocess
 import sys
 import tempfile
 import time
 from glob import glob
 import click
 import gradio as gr
 import librosa
-import numpy as np
 import torch
 import torchaudio
 from datasets import Dataset as Dataset_
 from datasets.arrow_writer import ArrowWriter
-from safetensors.torch import save_file
-from scipy.io import wavfile
-from cached_path import cached_path
 from f5_tts.api import F5TTS
 from f5_tts.model.utils import convert_char_to_pinyin
 from f5_tts.infer.utils_infer import transcribe
-from importlib.resources import files
 training_process = None
@@ -46,7 +46,15 @@ path_data = str(files("f5_tts").joinpath("../../data"))
 path_project_ckpts = str(files("f5_tts").joinpath("../../ckpts"))
 file_train = str(files("f5_tts").joinpath("train/finetune_cli.py"))
-device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
 # Save settings from a JSON file
@@ -62,7 +70,8 @@ def save_settings(
     epochs,
     num_warmup_updates,
     save_per_updates,
-    last_per_steps,
     finetune,
     file_checkpoint_train,
     tokenizer_type,
@@ -86,7 +95,8 @@ def save_settings(
         "epochs": epochs,
         "num_warmup_updates": num_warmup_updates,
         "save_per_updates": save_per_updates,
-        "last_per_steps": last_per_steps,
         "finetune": finetune,
         "file_checkpoint_train": file_checkpoint_train,
         "tokenizer_type": tokenizer_type,
@@ -106,73 +116,56 @@ def load_settings(project_name):
     path_project = os.path.join(path_project_ckpts, project_name)
     file_setting = os.path.join(path_project, "setting.json")
-    if not os.path.isfile(file_setting):
-        settings = {
-            "exp_name": "F5TTS_Base",
-            "learning_rate": 1e-05,
-            "batch_size_per_gpu": 1000,
-            "batch_size_type": "frame",
-            "max_samples": 64,
-            "grad_accumulation_steps": 1,
-            "max_grad_norm": 1,
-            "epochs": 100,
-            "num_warmup_updates": 2,
-            "save_per_updates": 300,
-            "last_per_steps": 100,
-            "finetune": True,
-            "file_checkpoint_train": "",
-            "tokenizer_type": "pinyin",
-            "tokenizer_file": "",
-            "mixed_precision": "none",
-            "logger": "wandb",
-            "bnb_optimizer": False,
-        }
-        return (
-            settings["exp_name"],
-            settings["learning_rate"],
-            settings["batch_size_per_gpu"],
-            settings["batch_size_type"],
-            settings["max_samples"],
-            settings["grad_accumulation_steps"],
-            settings["max_grad_norm"],
-            settings["epochs"],
-            settings["num_warmup_updates"],
-            settings["save_per_updates"],
-            settings["last_per_steps"],
-            settings["finetune"],
-            settings["file_checkpoint_train"],
-            settings["tokenizer_type"],
-            settings["tokenizer_file"],
-            settings["mixed_precision"],
-            settings["logger"],
-            settings["bnb_optimizer"],
-        )
-    with open(file_setting, "r") as f:
-        settings = json.load(f)
-        if "logger" not in settings:
-            settings["logger"] = "wandb"
-        if "bnb_optimizer" not in settings:
-            settings["bnb_optimizer"] = False
     return (
-        settings["exp_name"],
-        settings["learning_rate"],
-        settings["batch_size_per_gpu"],
-        settings["batch_size_type"],
-        settings["max_samples"],
-        settings["grad_accumulation_steps"],
-        settings["max_grad_norm"],
-        settings["epochs"],
-        settings["num_warmup_updates"],
-        settings["save_per_updates"],
-        settings["last_per_steps"],
-        settings["finetune"],
-        settings["file_checkpoint_train"],
-        settings["tokenizer_type"],
-        settings["tokenizer_file"],
-        settings["mixed_precision"],
-        settings["logger"],
-        settings["bnb_optimizer"],
     )
@@ -369,17 +362,18 @@ def terminate_process(pid):
 def start_training(
     dataset_name="",
-    exp_name="F5TTS_Base",
-    learning_rate=1e-4,
-    batch_size_per_gpu=400,
-    batch_size_type="frame",
     max_samples=64,
-    grad_accumulation_steps=1,
     max_grad_norm=1.0,
-    epochs=11,
-    num_warmup_updates=200,
-    save_per_updates=400,
-    last_per_steps=800,
     finetune=True,
     file_checkpoint_train="",
     tokenizer_type="pinyin",
@@ -438,18 +432,19 @@ def start_training(
         fp16 = ""
     cmd = (
-        f"accelerate launch {fp16} {file_train} --exp_name {exp_name} "
-        f"--learning_rate {learning_rate} "
-        f"--batch_size_per_gpu {batch_size_per_gpu} "
-        f"--batch_size_type {batch_size_type} "
-        f"--max_samples {max_samples} "
-        f"--grad_accumulation_steps {grad_accumulation_steps} "
-        f"--max_grad_norm {max_grad_norm} "
-        f"--epochs {epochs} "
-        f"--num_warmup_updates {num_warmup_updates} "
-        f"--save_per_updates {save_per_updates} "
-        f"--last_per_steps {last_per_steps} "
-        f"--dataset_name {dataset_name}"
     )
     if finetune:
@@ -482,7 +477,8 @@ def start_training(
         epochs,
         num_warmup_updates,
         save_per_updates,
-        last_per_steps,
         finetune,
         file_checkpoint_train,
         tokenizer_type,
@@ -548,7 +544,7 @@ def start_training(
                         output = stdout_queue.get_nowait()
                         print(output, end="")
                         match = re.search(
-                            r"Epoch (\d+)/(\d+):\s+(\d+)%\|.*\[(\d+:\d+)<.*?loss=(\d+\.\d+), step=(\d+)", output
                         )
                         if match:
                             current_epoch = match.group(1)
@@ -556,13 +552,13 @@ def start_training(
                             percent_complete = match.group(3)
                             elapsed_time = match.group(4)
                             loss = match.group(5)
-                            current_step = match.group(6)
                             message = (
                                 f"Epoch: {current_epoch}/{total_epochs}, "
                                 f"Progress: {percent_complete}%, "
                                 f"Elapsed Time: {elapsed_time}, "
                                 f"Loss: {loss}, "
-                                f"Step: {current_step}"
                             )
                             yield message, gr.update(interactive=False), gr.update(interactive=True)
                         elif output.strip():
@@ -801,14 +797,14 @@ def create_metadata(name_project, ch_tokenizer, progress=gr.Progress()):
             print(f"Error processing {file_audio}: {e}")
             continue
-        if duration < 1 or duration > 25:
-            if duration > 25:
-                error_files.append([file_audio, "duration > 25 sec"])
             if duration < 1:
                 error_files.append([file_audio, "duration < 1 sec "])
             continue
         if len(text) < 3:
-            error_files.append([file_audio, "very small text len 3"])
             continue
         text = clear_text(text)
@@ -875,40 +871,37 @@ def check_user(value):
 def calculate_train(
     name_project,
     batch_size_type,
     max_samples,
-    learning_rate,
     num_warmup_updates,
-    save_per_updates,
-    last_per_steps,
     finetune,
 ):
     path_project = os.path.join(path_data, name_project)
-    file_duraction = os.path.join(path_project, "duration.json")
-    if not os.path.isfile(file_duraction):
         return (
-            1000,
             max_samples,
             num_warmup_updates,
-            save_per_updates,
-            last_per_steps,
             "project not found !",
-            learning_rate,
         )
-    with open(file_duraction, "r") as file:
         data = json.load(file)
     duration_list = data["duration"]
-    samples = len(duration_list)
-    hours = sum(duration_list) / 3600
-    # if torch.cuda.is_available():
-    # gpu_properties = torch.cuda.get_device_properties(0)
-    # total_memory = gpu_properties.total_memory / (1024**3)
-    # elif torch.backends.mps.is_available():
-    # total_memory = psutil.virtual_memory().available / (1024**3)
     if torch.cuda.is_available():
         gpu_count = torch.cuda.device_count()
@@ -916,57 +909,39 @@ def calculate_train(
         for i in range(gpu_count):
             gpu_properties = torch.cuda.get_device_properties(i)
             total_memory += gpu_properties.total_memory / (1024**3)  # in GB
     elif torch.backends.mps.is_available():
         gpu_count = 1
         total_memory = psutil.virtual_memory().available / (1024**3)
     if batch_size_type == "frame":
-        batch = int(total_memory * 0.5)
-        batch = (lambda num: num + 1 if num % 2 != 0 else num)(batch)
-        batch_size_per_gpu = int(38400 / batch)
-    else:
-        batch_size_per_gpu = int(total_memory / 8)
-        batch_size_per_gpu = (lambda num: num + 1 if num % 2 != 0 else num)(batch_size_per_gpu)
-        batch = batch_size_per_gpu
-    if batch_size_per_gpu <= 0:
-        batch_size_per_gpu = 1
-    if samples < 64:
-        max_samples = int(samples * 0.25)
-    else:
-        max_samples = 64
-    num_warmup_updates = int(samples * 0.05)
-    save_per_updates = int(samples * 0.10)
-    last_per_steps = int(save_per_updates * 0.25)
-    max_samples = (lambda num: num + 1 if num % 2 != 0 else num)(max_samples)
-    num_warmup_updates = (lambda num: num + 1 if num % 2 != 0 else num)(num_warmup_updates)
-    save_per_updates = (lambda num: num + 1 if num % 2 != 0 else num)(save_per_updates)
-    last_per_steps = (lambda num: num + 1 if num % 2 != 0 else num)(last_per_steps)
-    if last_per_steps <= 0:
-        last_per_steps = 2
-    total_hours = hours
-    mel_hop_length = 256
-    mel_sampling_rate = 24000
-    # target
-    wanted_max_updates = 1000000
-    # train params
-    gpus = gpu_count
-    frames_per_gpu = batch_size_per_gpu  # 8 * 38400 = 307200
-    grad_accum = 1
-    # intermediate
-    mini_batch_frames = frames_per_gpu * grad_accum * gpus
-    mini_batch_hours = mini_batch_frames * mel_hop_length / mel_sampling_rate / 3600
-    updates_per_epoch = total_hours / mini_batch_hours
-    # steps_per_epoch = updates_per_epoch * grad_accum
-    epochs = wanted_max_updates / updates_per_epoch
     if finetune:
         learning_rate = 1e-5
@@ -974,20 +949,18 @@ def calculate_train(
         learning_rate = 7.5e-5
     return (
         batch_size_per_gpu,
         max_samples,
         num_warmup_updates,
-        save_per_updates,
-        last_per_steps,
-        samples,
-        learning_rate,
-        int(epochs),
     )
 def extract_and_save_ema_model(checkpoint_path: str, new_checkpoint_path: str, safetensors: bool) -> str:
     try:
-        checkpoint = torch.load(checkpoint_path)
         print("Original Checkpoint Keys:", checkpoint.keys())
         ema_model_state_dict = checkpoint.get("ema_model_state_dict", None)
@@ -1018,7 +991,11 @@ def expand_model_embeddings(ckpt_path, new_ckpt_path, num_new_tokens=42):
     torch.backends.cudnn.deterministic = True
     torch.backends.cudnn.benchmark = False
-    ckpt = torch.load(ckpt_path, map_location="cpu")
     ema_sd = ckpt.get("ema_model_state_dict", {})
     embed_key_ema = "ema_model.transformer.text_embed.text_embed.weight"
@@ -1086,9 +1063,11 @@ def vocab_extend(project_name, symbols, model_type):
     with open(file_vocab_project, "w", encoding="utf-8") as f:
         f.write("\n".join(vocab))
-    if model_type == "F5-TTS":
         ckpt_path = str(cached_path("hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.pt"))
-    else:
         ckpt_path = str(cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.pt"))
     vocab_size_new = len(miss_symbols)
@@ -1096,7 +1075,9 @@ def vocab_extend(project_name, symbols, model_type):
     dataset_name = name_project.replace("_pinyin", "").replace("_char", "")
     new_ckpt_path = os.path.join(path_project_ckpts, dataset_name)
     os.makedirs(new_ckpt_path, exist_ok=True)
-    new_ckpt_file = os.path.join(new_ckpt_path, "model_1200000.pt")
     size = expand_model_embeddings(ckpt_path, new_ckpt_file, num_new_tokens=vocab_size_new)
@@ -1226,21 +1207,21 @@ def infer(
         vocab_file = os.path.join(path_data, project, "vocab.txt")
         tts_api = F5TTS(
-            model_type=exp_name, ckpt_file=file_checkpoint, vocab_file=vocab_file, device=device_test, use_ema=use_ema
         )
         print("update >> ", device_test, file_checkpoint, use_ema)
     with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
         tts_api.infer(
-            gen_text=gen_text.lower().strip(),
-            ref_text=ref_text.lower().strip(),
             ref_file=ref_audio,
             nfe_step=nfe_step,
-            file_wave=f.name,
             speed=speed,
-            seed=seed,
             remove_silence=remove_silence,
         )
         return f.name, tts_api.device, str(tts_api.seed)
@@ -1256,12 +1237,22 @@ def get_checkpoints_project(project_name, is_gradio=True):
     if os.path.isdir(path_project_ckpts):
         files_checkpoints = glob(os.path.join(path_project_ckpts, project_name, "*.pt"))
-        files_checkpoints = sorted(
-            files_checkpoints,
-            key=lambda x: int(os.path.basename(x).split("_")[1].split(".")[0])
-            if os.path.basename(x) != "model_last.pt"
-            else float("inf"),
         )
     else:
         files_checkpoints = []
@@ -1312,7 +1303,21 @@ def get_gpu_stats():
                 f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
                 f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
             )
     elif torch.backends.mps.is_available():
         gpu_count = 1
         gpu_stats += "MPS GPU\n"
@@ -1375,14 +1380,14 @@ def get_audio_select(file_sample):
 with gr.Blocks() as app:
     gr.Markdown(
         """
-# E2/F5 TTS Automatic Finetune
-This is a local web UI for F5 TTS with advanced batch processing support. This app supports the following TTS models:
 * [F5-TTS](https://arxiv.org/abs/2410.06885) (A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching)
 * [E2 TTS](https://arxiv.org/abs/2406.18009) (Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS)
-The checkpoints support English and Chinese.
 For tutorial and updates check here (https://github.com/SWivid/F5-TTS/discussions/143)
 """
@@ -1459,7 +1464,9 @@ Check the vocabulary for fine-tuning Emilia_ZH_EN to ensure all symbols are incl
 Using the extended model, you can finetune to a new language that is missing symbols in the vocab. This creates a new model with a new vocabulary size and saves it in your ckpts/project folder.
 ```""")
-            exp_name_extend = gr.Radio(label="Model", choices=["F5-TTS", "E2-TTS"], value="F5-TTS")
             with gr.Row():
                 txt_extend = gr.Textbox(
@@ -1528,9 +1535,9 @@ Skip this step if you have your dataset, raw.arrow, duration.json, and vocab.txt
                 fn=get_random_sample_prepare, inputs=[cm_project], outputs=[random_text_prepare, random_audio_prepare]
             )
-        with gr.TabItem("Train Data"):
             gr.Markdown("""```plaintext
-The auto-setting is still experimental. Please make sure that the epochs, save per updates, and last per steps are set correctly, or change them manually as needed.
 If you encounter a memory error, try reducing the batch size per GPU to a smaller number.
 ```""")
             with gr.Row():
@@ -1544,11 +1551,13 @@ If you encounter a memory error, try reducing the batch size per GPU to a smalle
                 file_checkpoint_train = gr.Textbox(label="Path to the Pretrained Checkpoint", value="")
             with gr.Row():
-                exp_name = gr.Radio(label="Model", choices=["F5TTS_Base", "E2TTS_Base"], value="F5TTS_Base")
                 learning_rate = gr.Number(label="Learning Rate", value=1e-5, step=1e-5)
             with gr.Row():
-                batch_size_per_gpu = gr.Number(label="Batch Size per GPU", value=1000)
                 max_samples = gr.Number(label="Max Samples", value=64)
             with gr.Row():
@@ -1556,59 +1565,70 @@ If you encounter a memory error, try reducing the batch size per GPU to a smalle
                 max_grad_norm = gr.Number(label="Max Gradient Norm", value=1.0)
             with gr.Row():
-                epochs = gr.Number(label="Epochs", value=10)
-                num_warmup_updates = gr.Number(label="Warmup Updates", value=2)
             with gr.Row():
-                save_per_updates = gr.Number(label="Save per Updates", value=300)
-                last_per_steps = gr.Number(label="Last per Steps", value=100)
             with gr.Row():
                 ch_8bit_adam = gr.Checkbox(label="Use 8-bit Adam optimizer")
-                mixed_precision = gr.Radio(label="mixed_precision", choices=["none", "fp16", "bf16"], value="none")
                 cd_logger = gr.Radio(label="logger", choices=["wandb", "tensorboard"], value="wandb")
                 start_button = gr.Button("Start Training")
                 stop_button = gr.Button("Stop Training", interactive=False)
             if projects_selelect is not None:
                 (
-                    exp_namev,
-                    learning_ratev,
-                    batch_size_per_gpuv,
-                    batch_size_typev,
-                    max_samplesv,
-                    grad_accumulation_stepsv,
-                    max_grad_normv,
-                    epochsv,
-                    num_warmupv_updatesv,
-                    save_per_updatesv,
-                    last_per_stepsv,
-                    finetunev,
-                    file_checkpoint_trainv,
-                    tokenizer_typev,
-                    tokenizer_filev,
-                    mixed_precisionv,
-                    cd_loggerv,
-                    ch_8bit_adamv,
                 ) = load_settings(projects_selelect)
-                exp_name.value = exp_namev
-                learning_rate.value = learning_ratev
-                batch_size_per_gpu.value = batch_size_per_gpuv
-                batch_size_type.value = batch_size_typev
-                max_samples.value = max_samplesv
-                grad_accumulation_steps.value = grad_accumulation_stepsv
-                max_grad_norm.value = max_grad_normv
-                epochs.value = epochsv
-                num_warmup_updates.value = num_warmupv_updatesv
-                save_per_updates.value = save_per_updatesv
-                last_per_steps.value = last_per_stepsv
-                ch_finetune.value = finetunev
-                file_checkpoint_train.value = file_checkpoint_trainv
-                tokenizer_type.value = tokenizer_typev
-                tokenizer_file.value = tokenizer_filev
-                mixed_precision.value = mixed_precisionv
-                cd_logger.value = cd_loggerv
-                ch_8bit_adam.value = ch_8bit_adamv
             ch_stream = gr.Checkbox(label="Stream Output Experiment", value=True)
             txt_info_train = gr.Text(label="Info", value="")
@@ -1659,7 +1679,8 @@ If you encounter a memory error, try reducing the batch size per GPU to a smalle
                     epochs,
                     num_warmup_updates,
                     save_per_updates,
-                    last_per_steps,
                     ch_finetune,
                     file_checkpoint_train,
                     tokenizer_type,
@@ -1677,23 +1698,21 @@ If you encounter a memory error, try reducing the batch size per GPU to a smalle
                 fn=calculate_train,
                 inputs=[
                     cm_project,
                     batch_size_type,
                     max_samples,
-                    learning_rate,
                     num_warmup_updates,
-                    save_per_updates,
-                    last_per_steps,
                     ch_finetune,
                 ],
                 outputs=[
                     batch_size_per_gpu,
                     max_samples,
                     num_warmup_updates,
-                    save_per_updates,
-                    last_per_steps,
                     lb_samples,
-                    learning_rate,
-                    epochs,
                 ],
             )
@@ -1713,15 +1732,16 @@ If you encounter a memory error, try reducing the batch size per GPU to a smalle
                     epochs,
                     num_warmup_updates,
                     save_per_updates,
-                    last_per_steps,
                     ch_finetune,
                     file_checkpoint_train,
                     tokenizer_type,
                     tokenizer_file,
                     mixed_precision,
                     cd_logger,
                 ]
                 return output_components
             outputs = setup_load_settings()
@@ -1742,7 +1762,9 @@ If you encounter a memory error, try reducing the batch size per GPU to a smalle
             gr.Markdown("""```plaintext
 SOS: Check the use_ema setting (True or False) for your model to see what works best for you. use seed -1 from random
 ```""")
-            exp_name = gr.Radio(label="Model", choices=["F5-TTS", "E2-TTS"], value="F5-TTS")
             list_checkpoints, checkpoint_select = get_checkpoints_project(projects_selelect, False)
             with gr.Row():
@@ -1796,9 +1818,9 @@ SOS: Check the use_ema setting (True or False) for your model to see what works
             bt_checkpoint_refresh.click(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
             cm_project.change(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
-        with gr.TabItem("Reduce Checkpoint"):
             gr.Markdown("""```plaintext
-Reduce the model size from 5GB to 1.3GB. The new checkpoint can be used for inference or fine-tuning afterward, but it cannot be used to continue training.
 ```""")
             txt_path_checkpoint = gr.Text(label="Path to Checkpoint:")
             txt_path_checkpoint_small = gr.Text(label="Path to Output:")

 import gc
 import json
+import numpy as np
 import os
 import platform
 import psutil
+import queue
 import random
+import re
 import signal
 import shutil
 import subprocess
 import sys
 import tempfile
+import threading
 import time
 from glob import glob
+from importlib.resources import files
+from scipy.io import wavfile
 import click
 import gradio as gr
 import librosa
 import torch
 import torchaudio
+from cached_path import cached_path
 from datasets import Dataset as Dataset_
 from datasets.arrow_writer import ArrowWriter
+from safetensors.torch import load_file, save_file
 from f5_tts.api import F5TTS
 from f5_tts.model.utils import convert_char_to_pinyin
 from f5_tts.infer.utils_infer import transcribe
 training_process = None
 path_project_ckpts = str(files("f5_tts").joinpath("../../ckpts"))
 file_train = str(files("f5_tts").joinpath("train/finetune_cli.py"))
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "xpu"
+    if torch.xpu.is_available()
+    else "mps"
+    if torch.backends.mps.is_available()
+    else "cpu"
+)
 # Save settings from a JSON file
     epochs,
     num_warmup_updates,
     save_per_updates,
+    keep_last_n_checkpoints,
+    last_per_updates,
     finetune,
     file_checkpoint_train,
     tokenizer_type,
         "epochs": epochs,
         "num_warmup_updates": num_warmup_updates,
         "save_per_updates": save_per_updates,
+        "keep_last_n_checkpoints": keep_last_n_checkpoints,
+        "last_per_updates": last_per_updates,
         "finetune": finetune,
         "file_checkpoint_train": file_checkpoint_train,
         "tokenizer_type": tokenizer_type,
     path_project = os.path.join(path_project_ckpts, project_name)
     file_setting = os.path.join(path_project, "setting.json")
+    # Default settings
+    default_settings = {
+        "exp_name": "F5TTS_v1_Base",
+        "learning_rate": 1e-5,
+        "batch_size_per_gpu": 1,
+        "batch_size_type": "sample",
+        "max_samples": 64,
+        "grad_accumulation_steps": 4,
+        "max_grad_norm": 1,
+        "epochs": 100,
+        "num_warmup_updates": 100,
+        "save_per_updates": 500,
+        "keep_last_n_checkpoints": -1,
+        "last_per_updates": 100,
+        "finetune": True,
+        "file_checkpoint_train": "",
+        "tokenizer_type": "pinyin",
+        "tokenizer_file": "",
+        "mixed_precision": "none",
+        "logger": "wandb",
+        "bnb_optimizer": False,
+    }
+    # Load settings from file if it exists
+    if os.path.isfile(file_setting):
+        with open(file_setting, "r") as f:
+            file_settings = json.load(f)
+        default_settings.update(file_settings)
+    # Return as a tuple in the correct order
     return (
+        default_settings["exp_name"],
+        default_settings["learning_rate"],
+        default_settings["batch_size_per_gpu"],
+        default_settings["batch_size_type"],
+        default_settings["max_samples"],
+        default_settings["grad_accumulation_steps"],
+        default_settings["max_grad_norm"],
+        default_settings["epochs"],
+        default_settings["num_warmup_updates"],
+        default_settings["save_per_updates"],
+        default_settings["keep_last_n_checkpoints"],
+        default_settings["last_per_updates"],
+        default_settings["finetune"],
+        default_settings["file_checkpoint_train"],
+        default_settings["tokenizer_type"],
+        default_settings["tokenizer_file"],
+        default_settings["mixed_precision"],
+        default_settings["logger"],
+        default_settings["bnb_optimizer"],
     )
 def start_training(
     dataset_name="",
+    exp_name="F5TTS_v1_Base",
+    learning_rate=1e-5,
+    batch_size_per_gpu=1,
+    batch_size_type="sample",
     max_samples=64,
+    grad_accumulation_steps=4,
     max_grad_norm=1.0,
+    epochs=100,
+    num_warmup_updates=100,
+    save_per_updates=500,
+    keep_last_n_checkpoints=-1,
+    last_per_updates=100,
     finetune=True,
     file_checkpoint_train="",
     tokenizer_type="pinyin",
         fp16 = ""
     cmd = (
+        f"accelerate launch {fp16} {file_train} --exp_name {exp_name}"
+        f" --learning_rate {learning_rate}"
+        f" --batch_size_per_gpu {batch_size_per_gpu}"
+        f" --batch_size_type {batch_size_type}"
+        f" --max_samples {max_samples}"
+        f" --grad_accumulation_steps {grad_accumulation_steps}"
+        f" --max_grad_norm {max_grad_norm}"
+        f" --epochs {epochs}"
+        f" --num_warmup_updates {num_warmup_updates}"
+        f" --save_per_updates {save_per_updates}"
+        f" --keep_last_n_checkpoints {keep_last_n_checkpoints}"
+        f" --last_per_updates {last_per_updates}"
+        f" --dataset_name {dataset_name}"
     )
     if finetune:
         epochs,
         num_warmup_updates,
         save_per_updates,
+        keep_last_n_checkpoints,
+        last_per_updates,
         finetune,
         file_checkpoint_train,
         tokenizer_type,
                         output = stdout_queue.get_nowait()
                         print(output, end="")
                         match = re.search(
+                            r"Epoch (\d+)/(\d+):\s+(\d+)%\|.*\[(\d+:\d+)<.*?loss=(\d+\.\d+), update=(\d+)", output
                         )
                         if match:
                             current_epoch = match.group(1)
                             percent_complete = match.group(3)
                             elapsed_time = match.group(4)
                             loss = match.group(5)
+                            current_update = match.group(6)
                             message = (
                                 f"Epoch: {current_epoch}/{total_epochs}, "
                                 f"Progress: {percent_complete}%, "
                                 f"Elapsed Time: {elapsed_time}, "
                                 f"Loss: {loss}, "
+                                f"Update: {current_update}"
                             )
                             yield message, gr.update(interactive=False), gr.update(interactive=True)
                         elif output.strip():
             print(f"Error processing {file_audio}: {e}")
             continue
+        if duration < 1 or duration > 30:
+            if duration > 30:
+                error_files.append([file_audio, "duration > 30 sec"])
             if duration < 1:
                 error_files.append([file_audio, "duration < 1 sec "])
             continue
         if len(text) < 3:
+            error_files.append([file_audio, "very short text length 3"])
             continue
         text = clear_text(text)
 def calculate_train(
     name_project,
+    epochs,
+    learning_rate,
+    batch_size_per_gpu,
     batch_size_type,
     max_samples,
     num_warmup_updates,
     finetune,
 ):
     path_project = os.path.join(path_data, name_project)
+    file_duration = os.path.join(path_project, "duration.json")
+    hop_length = 256
+    sampling_rate = 24000
+    if not os.path.isfile(file_duration):
         return (
+            epochs,
+            learning_rate,
+            batch_size_per_gpu,
             max_samples,
             num_warmup_updates,
             "project not found !",
         )
+    with open(file_duration, "r") as file:
         data = json.load(file)
     duration_list = data["duration"]
+    max_sample_length = max(duration_list) * sampling_rate / hop_length
+    total_samples = len(duration_list)
+    total_duration = sum(duration_list)
     if torch.cuda.is_available():
         gpu_count = torch.cuda.device_count()
         for i in range(gpu_count):
             gpu_properties = torch.cuda.get_device_properties(i)
             total_memory += gpu_properties.total_memory / (1024**3)  # in GB
+    elif torch.xpu.is_available():
+        gpu_count = torch.xpu.device_count()
+        total_memory = 0
+        for i in range(gpu_count):
+            gpu_properties = torch.xpu.get_device_properties(i)
+            total_memory += gpu_properties.total_memory / (1024**3)
     elif torch.backends.mps.is_available():
         gpu_count = 1
         total_memory = psutil.virtual_memory().available / (1024**3)
+    avg_gpu_memory = total_memory / gpu_count
+    # rough estimate of batch size
     if batch_size_type == "frame":
+        batch_size_per_gpu = max(int(38400 * (avg_gpu_memory - 5) / 75), int(max_sample_length))
+    elif batch_size_type == "sample":
+        batch_size_per_gpu = int(200 / (total_duration / total_samples))
+    if total_samples < 64:
+        max_samples = int(total_samples * 0.25)
+    num_warmup_updates = max(num_warmup_updates, int(total_samples * 0.05))
+    # take 1.2M updates as the maximum
+    max_updates = 1200000
+    if batch_size_type == "frame":
+        mini_batch_duration = batch_size_per_gpu * gpu_count * hop_length / sampling_rate
+        updates_per_epoch = total_duration / mini_batch_duration
+    elif batch_size_type == "sample":
+        updates_per_epoch = total_samples / batch_size_per_gpu / gpu_count
+    epochs = int(max_updates / updates_per_epoch)
     if finetune:
         learning_rate = 1e-5
         learning_rate = 7.5e-5
     return (
+        epochs,
+        learning_rate,
         batch_size_per_gpu,
         max_samples,
         num_warmup_updates,
+        total_samples,
     )
 def extract_and_save_ema_model(checkpoint_path: str, new_checkpoint_path: str, safetensors: bool) -> str:
     try:
+        checkpoint = torch.load(checkpoint_path, weights_only=True)
         print("Original Checkpoint Keys:", checkpoint.keys())
         ema_model_state_dict = checkpoint.get("ema_model_state_dict", None)
     torch.backends.cudnn.deterministic = True
     torch.backends.cudnn.benchmark = False
+    if ckpt_path.endswith(".safetensors"):
+        ckpt = load_file(ckpt_path, device="cpu")
+        ckpt = {"ema_model_state_dict": ckpt}
+    elif ckpt_path.endswith(".pt"):
+        ckpt = torch.load(ckpt_path, map_location="cpu")
     ema_sd = ckpt.get("ema_model_state_dict", {})
     embed_key_ema = "ema_model.transformer.text_embed.text_embed.weight"
     with open(file_vocab_project, "w", encoding="utf-8") as f:
         f.write("\n".join(vocab))
+    if model_type == "F5TTS_v1_Base":
+        ckpt_path = str(cached_path("hf://SWivid/F5-TTS/F5TTS_v1_Base/model_1250000.safetensors"))
+    elif model_type == "F5TTS_Base":
         ckpt_path = str(cached_path("hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.pt"))
+    elif model_type == "E2TTS_Base":
         ckpt_path = str(cached_path("hf://SWivid/E2-TTS/E2TTS_Base/model_1200000.pt"))
     vocab_size_new = len(miss_symbols)
     dataset_name = name_project.replace("_pinyin", "").replace("_char", "")
     new_ckpt_path = os.path.join(path_project_ckpts, dataset_name)
     os.makedirs(new_ckpt_path, exist_ok=True)
+    # Add pretrained_ prefix to model when copying for consistency with finetune_cli.py
+    new_ckpt_file = os.path.join(new_ckpt_path, "pretrained_" + os.path.basename(ckpt_path))
     size = expand_model_embeddings(ckpt_path, new_ckpt_file, num_new_tokens=vocab_size_new)
         vocab_file = os.path.join(path_data, project, "vocab.txt")
         tts_api = F5TTS(
+            model=exp_name, ckpt_file=file_checkpoint, vocab_file=vocab_file, device=device_test, use_ema=use_ema
         )
         print("update >> ", device_test, file_checkpoint, use_ema)
     with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
         tts_api.infer(
             ref_file=ref_audio,
+            ref_text=ref_text.lower().strip(),
+            gen_text=gen_text.lower().strip(),
             nfe_step=nfe_step,
             speed=speed,
             remove_silence=remove_silence,
+            file_wave=f.name,
+            seed=seed,
         )
         return f.name, tts_api.device, str(tts_api.seed)
     if os.path.isdir(path_project_ckpts):
         files_checkpoints = glob(os.path.join(path_project_ckpts, project_name, "*.pt"))
+        # Separate pretrained and regular checkpoints
+        pretrained_checkpoints = [f for f in files_checkpoints if "pretrained_" in os.path.basename(f)]
+        regular_checkpoints = [
+            f
+            for f in files_checkpoints
+            if "pretrained_" not in os.path.basename(f) and "model_last.pt" not in os.path.basename(f)
+        ]
+        last_checkpoint = [f for f in files_checkpoints if "model_last.pt" in os.path.basename(f)]
+        # Sort regular checkpoints by number
+        regular_checkpoints = sorted(
+            regular_checkpoints, key=lambda x: int(os.path.basename(x).split("_")[1].split(".")[0])
         )
+        # Combine in order: pretrained, regular, last
+        files_checkpoints = pretrained_checkpoints + regular_checkpoints + last_checkpoint
     else:
         files_checkpoints = []
                 f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
                 f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
             )
+    elif torch.xpu.is_available():
+        gpu_count = torch.xpu.device_count()
+        for i in range(gpu_count):
+            gpu_name = torch.xpu.get_device_name(i)
+            gpu_properties = torch.xpu.get_device_properties(i)
+            total_memory = gpu_properties.total_memory / (1024**3)  # in GB
+            allocated_memory = torch.xpu.memory_allocated(i) / (1024**2)  # in MB
+            reserved_memory = torch.xpu.memory_reserved(i) / (1024**2)  # in MB
+            gpu_stats += (
+                f"GPU {i} Name: {gpu_name}\n"
+                f"Total GPU memory (GPU {i}): {total_memory:.2f} GB\n"
+                f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
+                f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
+            )
     elif torch.backends.mps.is_available():
         gpu_count = 1
         gpu_stats += "MPS GPU\n"
 with gr.Blocks() as app:
     gr.Markdown(
         """
+# F5 TTS Automatic Finetune
+This is a local web UI for F5 TTS finetuning support. This app supports the following TTS models:
 * [F5-TTS](https://arxiv.org/abs/2410.06885) (A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching)
 * [E2 TTS](https://arxiv.org/abs/2406.18009) (Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS)
+The pretrained checkpoints support English and Chinese.
 For tutorial and updates check here (https://github.com/SWivid/F5-TTS/discussions/143)
 """
 Using the extended model, you can finetune to a new language that is missing symbols in the vocab. This creates a new model with a new vocabulary size and saves it in your ckpts/project folder.
 ```""")
+            exp_name_extend = gr.Radio(
+                label="Model", choices=["F5TTS_v1_Base", "F5TTS_Base", "E2TTS_Base"], value="F5TTS_v1_Base"
+            )
             with gr.Row():
                 txt_extend = gr.Textbox(
                 fn=get_random_sample_prepare, inputs=[cm_project], outputs=[random_text_prepare, random_audio_prepare]
             )
+        with gr.TabItem("Train Model"):
             gr.Markdown("""```plaintext
+The auto-setting is still experimental. Set a large value of epoch if not sure; and keep last N checkpoints if limited disk space.
 If you encounter a memory error, try reducing the batch size per GPU to a smaller number.
 ```""")
             with gr.Row():
                 file_checkpoint_train = gr.Textbox(label="Path to the Pretrained Checkpoint", value="")
             with gr.Row():
+                exp_name = gr.Radio(
+                    label="Model", choices=["F5TTS_v1_Base", "F5TTS_Base", "E2TTS_Base"], value="F5TTS_v1_Base"
+                )
                 learning_rate = gr.Number(label="Learning Rate", value=1e-5, step=1e-5)
             with gr.Row():
+                batch_size_per_gpu = gr.Number(label="Batch Size per GPU", value=3200)
                 max_samples = gr.Number(label="Max Samples", value=64)
             with gr.Row():
                 max_grad_norm = gr.Number(label="Max Gradient Norm", value=1.0)
             with gr.Row():
+                epochs = gr.Number(label="Epochs", value=100)
+                num_warmup_updates = gr.Number(label="Warmup Updates", value=100)
             with gr.Row():
+                save_per_updates = gr.Number(label="Save per Updates", value=500)
+                keep_last_n_checkpoints = gr.Number(
+                    label="Keep Last N Checkpoints",
+                    value=-1,
+                    step=1,
+                    precision=0,
+                    info="-1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints",
+                )
+                last_per_updates = gr.Number(label="Last per Updates", value=100)
             with gr.Row():
                 ch_8bit_adam = gr.Checkbox(label="Use 8-bit Adam optimizer")
+                mixed_precision = gr.Radio(label="mixed_precision", choices=["none", "fp16", "bf16"], value="fp16")
                 cd_logger = gr.Radio(label="logger", choices=["wandb", "tensorboard"], value="wandb")
                 start_button = gr.Button("Start Training")
                 stop_button = gr.Button("Stop Training", interactive=False)
             if projects_selelect is not None:
                 (
+                    exp_name_value,
+                    learning_rate_value,
+                    batch_size_per_gpu_value,
+                    batch_size_type_value,
+                    max_samples_value,
+                    grad_accumulation_steps_value,
+                    max_grad_norm_value,
+                    epochs_value,
+                    num_warmup_updates_value,
+                    save_per_updates_value,
+                    keep_last_n_checkpoints_value,
+                    last_per_updates_value,
+                    finetune_value,
+                    file_checkpoint_train_value,
+                    tokenizer_type_value,
+                    tokenizer_file_value,
+                    mixed_precision_value,
+                    logger_value,
+                    bnb_optimizer_value,
                 ) = load_settings(projects_selelect)
+                # Assigning values to the respective components
+                exp_name.value = exp_name_value
+                learning_rate.value = learning_rate_value
+                batch_size_per_gpu.value = batch_size_per_gpu_value
+                batch_size_type.value = batch_size_type_value
+                max_samples.value = max_samples_value
+                grad_accumulation_steps.value = grad_accumulation_steps_value
+                max_grad_norm.value = max_grad_norm_value
+                epochs.value = epochs_value
+                num_warmup_updates.value = num_warmup_updates_value
+                save_per_updates.value = save_per_updates_value
+                keep_last_n_checkpoints.value = keep_last_n_checkpoints_value
+                last_per_updates.value = last_per_updates_value
+                ch_finetune.value = finetune_value
+                file_checkpoint_train.value = file_checkpoint_train_value
+                tokenizer_type.value = tokenizer_type_value
+                tokenizer_file.value = tokenizer_file_value
+                mixed_precision.value = mixed_precision_value
+                cd_logger.value = logger_value
+                ch_8bit_adam.value = bnb_optimizer_value
             ch_stream = gr.Checkbox(label="Stream Output Experiment", value=True)
             txt_info_train = gr.Text(label="Info", value="")
                     epochs,
                     num_warmup_updates,
                     save_per_updates,
+                    keep_last_n_checkpoints,
+                    last_per_updates,
                     ch_finetune,
                     file_checkpoint_train,
                     tokenizer_type,
                 fn=calculate_train,
                 inputs=[
                     cm_project,
+                    epochs,
+                    learning_rate,
+                    batch_size_per_gpu,
                     batch_size_type,
                     max_samples,
                     num_warmup_updates,
                     ch_finetune,
                 ],
                 outputs=[
+                    epochs,
+                    learning_rate,
                     batch_size_per_gpu,
                     max_samples,
                     num_warmup_updates,
                     lb_samples,
                 ],
             )
                     epochs,
                     num_warmup_updates,
                     save_per_updates,
+                    keep_last_n_checkpoints,
+                    last_per_updates,
                     ch_finetune,
                     file_checkpoint_train,
                     tokenizer_type,
                     tokenizer_file,
                     mixed_precision,
                     cd_logger,
+                    ch_8bit_adam,
                 ]
                 return output_components
             outputs = setup_load_settings()
             gr.Markdown("""```plaintext
 SOS: Check the use_ema setting (True or False) for your model to see what works best for you. use seed -1 from random
 ```""")
+            exp_name = gr.Radio(
+                label="Model", choices=["F5TTS_v1_Base", "F5TTS_Base", "E2TTS_Base"], value="F5TTS_v1_Base"
+            )
             list_checkpoints, checkpoint_select = get_checkpoints_project(projects_selelect, False)
             with gr.Row():
             bt_checkpoint_refresh.click(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
             cm_project.change(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
+        with gr.TabItem("Prune Checkpoint"):
             gr.Markdown("""```plaintext
+Reduce the Base model size from 5GB to 1.3GB. The new checkpoint file prunes out optimizer and etc., can be used for inference or finetuning afterward, but not able to resume pretraining.
 ```""")
             txt_path_checkpoint = gr.Text(label="Path to Checkpoint:")
             txt_path_checkpoint_small = gr.Text(label="Path to Output:")

src/f5_tts/train/train.py CHANGED Viewed

@@ -4,8 +4,9 @@ import os
 from importlib.resources import files
 import hydra
-from f5_tts.model import CFM, DiT, Trainer, UNetT
 from f5_tts.model.dataset import load_dataset
 from f5_tts.model.utils import get_tokenizer
@@ -14,9 +15,13 @@ os.chdir(str(files("f5_tts").joinpath("../..")))  # change working directory to
 @hydra.main(version_base="1.3", config_path=str(files("f5_tts").joinpath("configs")), config_name=None)
 def main(cfg):
     tokenizer = cfg.model.tokenizer
     mel_spec_type = cfg.model.mel_spec.mel_spec_type
     exp_name = f"{cfg.model.name}_{mel_spec_type}_{cfg.model.tokenizer}_{cfg.datasets.name}"
     # set text tokenizer
     if tokenizer != "custom":
@@ -26,14 +31,8 @@ def main(cfg):
     vocab_char_map, vocab_size = get_tokenizer(tokenizer_path, tokenizer)
     # set model
-    if "F5TTS" in cfg.model.name:
-        model_cls = DiT
-    elif "E2TTS" in cfg.model.name:
-        model_cls = UNetT
-    wandb_resume_id = None
     model = CFM(
-        transformer=model_cls(**cfg.model.arch, text_num_embeds=vocab_size, mel_dim=cfg.model.mel_spec.n_mel_channels),
         mel_spec_kwargs=cfg.model.mel_spec,
         vocab_char_map=vocab_char_map,
     )
@@ -45,8 +44,9 @@ def main(cfg):
         learning_rate=cfg.optim.learning_rate,
         num_warmup_updates=cfg.optim.num_warmup_updates,
         save_per_updates=cfg.ckpts.save_per_updates,
         checkpoint_path=str(files("f5_tts").joinpath(f"../../{cfg.ckpts.save_dir}")),
-        batch_size=cfg.datasets.batch_size_per_gpu,
         batch_size_type=cfg.datasets.batch_size_type,
         max_samples=cfg.datasets.max_samples,
         grad_accumulation_steps=cfg.optim.grad_accumulation_steps,
@@ -55,12 +55,13 @@ def main(cfg):
         wandb_project="CFM-TTS",
         wandb_run_name=exp_name,
         wandb_resume_id=wandb_resume_id,
-        last_per_steps=cfg.ckpts.last_per_steps,
-        log_samples=True,
         bnb_optimizer=cfg.optim.bnb_optimizer,
         mel_spec_type=mel_spec_type,
         is_local_vocoder=cfg.model.vocoder.is_local,
         local_vocoder_path=cfg.model.vocoder.local_path,
     )
     train_dataset = load_dataset(cfg.datasets.name, tokenizer, mel_spec_kwargs=cfg.model.mel_spec)

 from importlib.resources import files
 import hydra
+from omegaconf import OmegaConf
+from f5_tts.model import CFM, DiT, UNetT, Trainer  # noqa: F401. used for config
 from f5_tts.model.dataset import load_dataset
 from f5_tts.model.utils import get_tokenizer
 @hydra.main(version_base="1.3", config_path=str(files("f5_tts").joinpath("configs")), config_name=None)
 def main(cfg):
+    model_cls = globals()[cfg.model.backbone]
+    model_arc = cfg.model.arch
     tokenizer = cfg.model.tokenizer
     mel_spec_type = cfg.model.mel_spec.mel_spec_type
     exp_name = f"{cfg.model.name}_{mel_spec_type}_{cfg.model.tokenizer}_{cfg.datasets.name}"
+    wandb_resume_id = None
     # set text tokenizer
     if tokenizer != "custom":
     vocab_char_map, vocab_size = get_tokenizer(tokenizer_path, tokenizer)
     # set model
     model = CFM(
+        transformer=model_cls(**model_arc, text_num_embeds=vocab_size, mel_dim=cfg.model.mel_spec.n_mel_channels),
         mel_spec_kwargs=cfg.model.mel_spec,
         vocab_char_map=vocab_char_map,
     )
         learning_rate=cfg.optim.learning_rate,
         num_warmup_updates=cfg.optim.num_warmup_updates,
         save_per_updates=cfg.ckpts.save_per_updates,
+        keep_last_n_checkpoints=cfg.ckpts.keep_last_n_checkpoints,
         checkpoint_path=str(files("f5_tts").joinpath(f"../../{cfg.ckpts.save_dir}")),
+        batch_size_per_gpu=cfg.datasets.batch_size_per_gpu,
         batch_size_type=cfg.datasets.batch_size_type,
         max_samples=cfg.datasets.max_samples,
         grad_accumulation_steps=cfg.optim.grad_accumulation_steps,
         wandb_project="CFM-TTS",
         wandb_run_name=exp_name,
         wandb_resume_id=wandb_resume_id,
+        last_per_updates=cfg.ckpts.last_per_updates,
+        log_samples=cfg.ckpts.log_samples,
         bnb_optimizer=cfg.optim.bnb_optimizer,
         mel_spec_type=mel_spec_type,
         is_local_vocoder=cfg.model.vocoder.is_local,
         local_vocoder_path=cfg.model.vocoder.local_path,
+        cfg_dict=OmegaConf.to_container(cfg, resolve=True),
     )
     train_dataset = load_dataset(cfg.datasets.name, tokenizer, mel_spec_kwargs=cfg.model.mel_spec)