|
--- |
|
license: mit |
|
base_model: bilalfaye/speecht5_tts-wolof |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: speecht5_tts-wolof-v0.2 |
|
results: [] |
|
language: |
|
- wo |
|
- fr |
|
pipeline_tag: text-to-speech |
|
--- |
|
|
|
# **speecht5_tts-wolof-v0.2** |
|
|
|
This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages. |
|
|
|
## **Model Description** |
|
|
|
This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages. |
|
|
|
--- |
|
|
|
## **Installation Instructions for Users** |
|
|
|
To install the necessary dependencies, run the following command: |
|
|
|
```bash |
|
pip install transformers datasets torch |
|
``` |
|
|
|
## **Model Loading and Speech Generation Code** |
|
|
|
```python |
|
import torch |
|
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan |
|
from datasets import load_dataset |
|
from IPython.display import Audio, display |
|
|
|
def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"): |
|
""" Load the SpeechT5 model, processor, and vocoder for text-to-speech. """ |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
processor = SpeechT5Processor.from_pretrained(checkpoint) |
|
model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device) |
|
vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device) |
|
|
|
return processor, model, vocoder, device |
|
|
|
# Load the model |
|
processor, model, vocoder, device = load_speech_model() |
|
|
|
# Load speaker embeddings (pretrained from CMU Arctic dataset) |
|
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation") |
|
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0) |
|
|
|
def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder): |
|
""" Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """ |
|
|
|
inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions) |
|
inputs = {key: value.to(model.device) for key, value in inputs.items()} |
|
|
|
speech = model.generate( |
|
inputs["input_ids"], |
|
speaker_embeddings=speaker_embedding.to(model.device), |
|
vocoder=vocoder, |
|
num_beams=7, |
|
temperature=0.6, |
|
no_repeat_ngram_size=3, |
|
repetition_penalty=1.5, |
|
) |
|
|
|
speech = speech.detach().cpu().numpy() |
|
display(Audio(speech, rate=16000)) |
|
|
|
# Example usage French |
|
text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français." |
|
generate_speech_from_text(text) |
|
|
|
# Example usage Wolof |
|
text = "ñu ne ñoom ñooy nattukaay satélite yi" |
|
generate_speech_from_text(text) |
|
``` |
|
|
|
--- |
|
|
|
## **Intended Uses & Limitations** |
|
|
|
### **Intended Uses** |
|
- **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech. |
|
- **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages. |
|
- **Linguistic Research:** Facilitates speech synthesis research in low-resource languages. |
|
|
|
### **Limitations** |
|
- **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning. |
|
- **Pronunciation Variations:** Some complex or uncommon words may be mispronounced. |
|
- **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles. |
|
|
|
--- |
|
|
|
## **Training and Evaluation Data** |
|
|
|
The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages. |
|
|
|
--- |
|
|
|
## **Training Procedure** |
|
|
|
### **Training Hyperparameters** |
|
|
|
| Hyperparameter | Value | |
|
|----------------------------|---------| |
|
| Learning Rate | 1e-05 | |
|
| Training Batch Size | 8 | |
|
| Evaluation Batch Size | 2 | |
|
| Gradient Accumulation Steps| 8 | |
|
| Total Train Batch Size | 64 | |
|
| Optimizer | Adam (β1=0.9, β2=0.999, ϵ=1e-08) | |
|
| Learning Rate Scheduler | Linear | |
|
| Warmup Steps | 500 | |
|
| Training Steps | 25,500 | |
|
| Mixed Precision Training | AMP (Automatic Mixed Precision) | |
|
|
|
### **Training Results** |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | |
|
|:-------------:|:-------:|:-----:|:---------------:| |
|
| 0.5372 | 0.9995 | 954 | 0.4398 | |
|
| 0.4646 | 2.0 | 1909 | 0.4214 | |
|
| 0.4505 | 2.9995 | 2863 | 0.4163 | |
|
| 0.4443 | 4.0 | 3818 | 0.4109 | |
|
| 0.4403 | 4.9995 | 4772 | 0.4080 | |
|
| 0.4368 | 6.0 | 5727 | 0.4057 | |
|
| 0.4343 | 6.9995 | 6681 | 0.4034 | |
|
| 0.4315 | 8.0 | 7636 | 0.4018 | |
|
| 0.4311 | 8.9995 | 8590 | 0.4015 | |
|
| 0.4273 | 10.0 | 9545 | 0.4017 | |
|
| 0.4282 | 10.9995 | 10499 | 0.3990 | |
|
| 0.4249 | 12.0 | 11454 | 0.3986 | |
|
| 0.4242 | 12.9995 | 12408 | 0.3973 | |
|
| 0.4225 | 14.0 | 13363 | 0.3966 | |
|
| 0.4217 | 14.9995 | 14317 | 0.3951 | |
|
| 0.4208 | 16.0 | 15272 | 0.3950 | |
|
| 0.4200 | 16.9995 | 16226 | 0.3950 | |
|
| 0.4202 | 18.0 | 17181 | 0.3952 | |
|
| 0.4200 | 18.9995 | 18135 | 0.3943 | |
|
| 0.4183 | 20.0 | 19090 | 0.3962 | |
|
| 0.4175 | 20.9995 | 20044 | 0.3937 | |
|
| 0.4161 | 22.0 | 20999 | 0.3940 | |
|
| 0.4193 | 22.9995 | 21953 | 0.3932 | |
|
| 0.4177 | 24.0 | 22908 | 0.3939 | |
|
| 0.4166 | 24.9995 | 23862 | 0.3936 | |
|
| 0.4156 | 26.0 | 24817 | 0.3938 | |
|
|
|
--- |
|
|
|
## **Framework Versions** |
|
|
|
- **Transformers**: 4.41.2 |
|
- **PyTorch**: 2.4.0+cu121 |
|
- **Datasets**: 3.2.0 |
|
- **Tokenizers**: 0.19.1 |
|
|
|
--- |
|
|
|
## **Author** |
|
|
|
- **Bilal FAYE** |
|
|
|
This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀 |