Spaces:
Running
Running
Update app.py
Browse files
app.py
CHANGED
@@ -237,33 +237,48 @@ def create_interface():
|
|
237 |
# """)
|
238 |
|
239 |
# Evaluation Criteria
|
240 |
-
with gr.Row():
|
241 |
-
|
242 |
-
|
243 |
-
|
244 |
-
|
245 |
-
|
246 |
-
|
247 |
-
|
248 |
-
|
249 |
-
|
250 |
-
|
251 |
-
|
252 |
-
|
253 |
-
|
254 |
-
|
255 |
-
|
256 |
-
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
|
262 |
-
|
263 |
-
|
264 |
-
|
265 |
|
266 |
-
gr.Markdown("---")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
267 |
|
268 |
# Search and Filter Section
|
269 |
with gr.Row():
|
|
|
237 |
# """)
|
238 |
|
239 |
# Evaluation Criteria
|
240 |
+
# with gr.Row():
|
241 |
+
# with gr.Column():
|
242 |
+
# gr.HTML("""
|
243 |
+
# <div style="text-align: center; padding: 1rem; background: rgba(102, 126, 234, 0.1); border-radius: 8px;">
|
244 |
+
# <div style="font-size: 2rem;">🎭</div>
|
245 |
+
# <strong>Naturalness</strong><br>
|
246 |
+
# <small>Human-like quality & emotional expression</small>
|
247 |
+
# </div>
|
248 |
+
# """)
|
249 |
+
# with gr.Column():
|
250 |
+
# gr.HTML("""
|
251 |
+
# <div style="text-align: center; padding: 1rem; background: rgba(102, 126, 234, 0.1); border-radius: 8px;">
|
252 |
+
# <div style="font-size: 2rem;">🗣️</div>
|
253 |
+
# <strong>Intelligibility</strong><br>
|
254 |
+
# <small>Clarity & pronunciation accuracy</small>
|
255 |
+
# </div>
|
256 |
+
# """)
|
257 |
+
# with gr.Column():
|
258 |
+
# gr.HTML("""
|
259 |
+
# <div style="text-align: center; padding: 1rem; background: rgba(102, 126, 234, 0.1); border-radius: 8px;">
|
260 |
+
# <div style="font-size: 2rem;">🎛️</div>
|
261 |
+
# <strong>Controllability</strong><br>
|
262 |
+
# <small>Tone, pace & parameter flexibility</small>
|
263 |
+
# </div>
|
264 |
+
# """)
|
265 |
|
266 |
+
# gr.Markdown("---")
|
267 |
+
gr.Markdown("""
|
268 |
+
## 🔑 Key Findings
|
269 |
+
|
270 |
+
1. **Outstanding Speech Quality**
|
271 |
+
Several models—namely **Kokoro-82M**, **csm-1b**, **Spark-TTS-0.5B**, **Orpheus-3b-0.1-ft**, **F5-TTS**, and **Llasa-3B**—delivered exceptionally natural, clear, and realistic synthesized speech. Among these, **csm-1b** and **F5-TTS** stood out as the most well-rounded: they combined top-tier naturalness and intelligibility with solid controllability.
|
272 |
+
|
273 |
+
2. **Superior Controllability**
|
274 |
+
**Zonos-v0.1-transformer** emerged as the leader in fine-grained control: it offers detailed adjustments for prosody, emotion, and audio quality, making it ideal for use cases that demand precise voice modulation.
|
275 |
+
|
276 |
+
3. **Performance vs. Footprint Trade-off**
|
277 |
+
Smaller models (e.g., **Kokoro-82M** at 82 million parameters) can still achieve “Good” or “Excellent” ratings in many scenarios, especially when efficient inference or low VRAM usage is critical. Larger models (1 billion–3 billion+ parameters) generally offer more versatility—handling multilingual synthesis, zero-shot voice cloning, and multi-speaker generation—but require heavier compute resources.
|
278 |
+
|
279 |
+
4. **Special Notes on Multilingual & Cloning Capabilities**
|
280 |
+
**Spark-TTS-0.5B** and **XTTS-v2** excel at cross-lingual and zero-shot voice cloning, making them strong candidates for projects that need multi-language support or short-clip cloning. **Llama-OuteTTS-1.0-1B** and **MegaTTS3** also offer multilingual input handling, though they may require careful sampling parameter tuning to achieve optimal results.
|
281 |
+
""")
|
282 |
|
283 |
# Search and Filter Section
|
284 |
with gr.Row():
|