win10 commited on
Commit
0894981
·
verified ·
1 Parent(s): 857affb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -3
README.md CHANGED
@@ -1,3 +1,164 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ datasets:
4
+ - facebook/multilingual_librispeech
5
+ - parler-tts/libritts_r_filtered
6
+ - amphion/Emilia-Dataset
7
+ - parler-tts/mls_eng
8
+ language:
9
+ - en
10
+ - zh
11
+ - ja
12
+ - ko
13
+ pipeline_tag: text-to-speech
14
+ ---
15
+ ## Model Description
16
+
17
+ OuteTTS-0.2-500M is our improved successor to the v0.1 release.
18
+ The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself.
19
+ Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance.
20
+
21
+ Special thanks to **Hugging Face** for providing GPU grant that supported the training of this model.
22
+
23
+ ## Key Improvements
24
+
25
+ - **Enhanced Accuracy**: Significantly improved prompt following and output coherence compared to the previous version
26
+ - **Natural Speech**: Produces more natural and fluid speech synthesis
27
+ - **Expanded Vocabulary**: Trained on over 5 billion audio prompt tokens
28
+ - **Voice Cloning**: Improved voice cloning capabilities with greater diversity and accuracy
29
+ - **Multilingual Support**: New experimental support for Chinese, Japanese, and Korean languages
30
+
31
+ ## Speech Demo
32
+
33
+ <video width="1280" height="720" controls>
34
+ <source src="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/resolve/main/media/demo.mp4" type="video/mp4">
35
+ Your browser does not support the video tag.
36
+ </video>
37
+
38
+ ## Usage
39
+
40
+ ### Installation
41
+
42
+ [![GitHub](https://img.shields.io/badge/GitHub-OuteTTS-181717?logo=github)](https://github.com/edwko/OuteTTS)
43
+
44
+ ```bash
45
+ pip install outetts
46
+ ```
47
+
48
+ ### Interface Usage
49
+
50
+ ```python
51
+ import outetts
52
+
53
+ # Configure the model
54
+ model_config = outetts.HFModelConfig_v1(
55
+ model_path="OuteAI/OuteTTS-0.2-500M",
56
+ language="en", # Supported languages in v0.2: en, zh, ja, ko
57
+ )
58
+
59
+ # Initialize the interface
60
+ interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
61
+
62
+ # Optional: Create a speaker profile (use a 10-15 second audio clip)
63
+ # speaker = interface.create_speaker(
64
+ # audio_path="path/to/audio/file",
65
+ # transcript="Transcription of the audio file."
66
+ # )
67
+
68
+ # Optional: Save and load speaker profiles
69
+ # interface.save_speaker(speaker, "speaker.json")
70
+ # speaker = interface.load_speaker("speaker.json")
71
+
72
+ # Optional: Load speaker from default presets
73
+ interface.print_default_speakers()
74
+ speaker = interface.load_default_speaker(name="male_1")
75
+
76
+ output = interface.generate(
77
+ text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
78
+ # Lower temperature values may result in a more stable tone,
79
+ # while higher values can introduce varied and expressive speech
80
+ temperature=0.1,
81
+ repetition_penalty=1.1,
82
+ max_length=4096,
83
+
84
+ # Optional: Use a speaker profile for consistent voice characteristics
85
+ # Without a speaker profile, the model will generate a voice with random characteristics
86
+ speaker=speaker,
87
+ )
88
+
89
+ # Save the synthesized speech to a file
90
+ output.save("output.wav")
91
+
92
+ # Optional: Play the synthesized speech
93
+ # output.play()
94
+ ```
95
+
96
+ ## Using GGUF Model
97
+
98
+ ```python
99
+ # Configure the GGUF model
100
+ model_config = outetts.GGUFModelConfig_v1(
101
+ model_path="local/path/to/model.gguf",
102
+ language="en", # Supported languages in v0.2: en, zh, ja, ko
103
+ n_gpu_layers=0,
104
+ )
105
+
106
+ # Initialize the GGUF interface
107
+ interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
108
+ ```
109
+
110
+ # Configure the model with bfloat16 and flash attention
111
+
112
+ ```python
113
+ import outetts
114
+ import torch
115
+
116
+ model_config = outetts.HFModelConfig_v1(
117
+ model_path="OuteAI/OuteTTS-0.2-500M",
118
+ language="en", # Supported languages in v0.2: en, zh, ja, ko
119
+ dtype=torch.bfloat16,
120
+ additional_model_config={
121
+ 'attn_implementation': "flash_attention_2"
122
+ }
123
+ )
124
+ ```
125
+
126
+ ## Creating a Speaker for Voice Cloning
127
+
128
+ To achieve the best results when creating a speaker profile, consider the following recommendations:
129
+
130
+ 1. **Audio Clip Duration:**
131
+ - Use an audio clip of around **10-15 seconds**.
132
+ - This duration provides sufficient data for the model to learn the speaker's characteristics while keeping the input manageable. The model's context length is 4096 tokens, allowing it to generate around 54 seconds of audio in total. However, when a speaker profile is included, this capacity is reduced proportionally to the length of the speaker's audio clip.
133
+
134
+ 2. **Audio Quality:**
135
+ - Ensure the audio is **clear and noise-free**. Background noise or distortions can reduce the model's ability to extract accurate voice features.
136
+
137
+ 3. **Accurate Transcription:**
138
+ - Provide a highly **accurate transcription** of the audio clip. Mismatches between the audio and transcription can lead to suboptimal results.
139
+
140
+ 4. **Speaker Familiarity:**
141
+ - The model performs best with voices that are similar to those seen during training. Using a voice that is **significantly different from typical training samples** (e.g., unique accents, rare vocal characteristics) might result in inaccurate replication.
142
+ - In such cases, you may need to **fine-tune the model** specifically on your target speaker's voice to achieve a better representation.
143
+
144
+ 5. **Parameter Adjustments:**
145
+ - Adjust parameters like `temperature` in the `generate` function to refine the expressive quality and consistency of the synthesized voice.
146
+
147
+ ## Model Specifications
148
+ - **Base Model**: Qwen-2.5-0.5B
149
+ - **Parameter Count**: 500M
150
+ - **Language Support**:
151
+ - Primary: English
152
+ - Experimental: Chinese, Japanese, Korean
153
+ - **License**: CC BY NC 4.0
154
+
155
+ ## Training Datasets
156
+ - Emilia-Dataset (CC BY NC 4.0)
157
+ - LibriTTS-R (CC BY 4.0)
158
+ - Multilingual LibriSpeech (MLS) (CC BY 4.0)
159
+
160
+ ## Credits & References
161
+ - [WavTokenizer](https://github.com/jishengpeng/WavTokenizer)
162
+ - [CTC Forced Alignment](https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html)
163
+ - [Qwen-2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
164
+ - [OuteAI/OuteTTS-0.2-500M](https://huggingface.co/OuteAI/OuteTTS-0.2-500M)