---
license: mit
base_model: bilalfaye/speecht5_tts-wolof
tags:
- generated_from_trainer
model-index:
- name: speecht5_tts-wolof-v0.2
  results: []
language:
- wo
- fr
pipeline_tag: text-to-speech
---

# **speecht5_tts-wolof-v0.2**  

This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages.  

## **Model Description**  

This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.  

---  

## **Installation Instructions for Users**  

To install the necessary dependencies, run the following command:  

```bash
pip install transformers datasets torch
```

## **Model Loading and Speech Generation Code**  

```python
import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
from datasets import load_dataset
from IPython.display import Audio, display

def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
    """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    processor = SpeechT5Processor.from_pretrained(checkpoint)
    model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
    vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)

    return processor, model, vocoder, device

# Load the model
processor, model, vocoder, device = load_speech_model()

# Load speaker embeddings (pretrained from CMU Arctic dataset)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):  
    """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """  

    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
    inputs = {key: value.to(model.device) for key, value in inputs.items()}

    speech = model.generate(
        inputs["input_ids"],
        speaker_embeddings=speaker_embedding.to(model.device),
        vocoder=vocoder,
        num_beams=7,
        temperature=0.6,
        no_repeat_ngram_size=3,
        repetition_penalty=1.5,
    )

    speech = speech.detach().cpu().numpy()
    display(Audio(speech, rate=16000))

# Example usage French
text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
generate_speech_from_text(text)

# Example usage Wolof
text = "ñu ne ñoom ñooy nattukaay satélite yi"
generate_speech_from_text(text)
```

---  

## **Intended Uses & Limitations**  

### **Intended Uses**  
- **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech.  
- **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages.  
- **Linguistic Research:** Facilitates speech synthesis research in low-resource languages.  

### **Limitations**  
- **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning.  
- **Pronunciation Variations:** Some complex or uncommon words may be mispronounced.  
- **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles.  

---  

## **Training and Evaluation Data**  

The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages.  

---

## **Training Procedure**  

### **Training Hyperparameters**  

| Hyperparameter             | Value   |
|----------------------------|---------|
| Learning Rate              | 1e-05   |
| Training Batch Size        | 8       |
| Evaluation Batch Size      | 2       |
| Gradient Accumulation Steps| 8       |
| Total Train Batch Size     | 64      |
| Optimizer                  | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
| Learning Rate Scheduler    | Linear  |
| Warmup Steps               | 500     |
| Training Steps             | 25,500  |
| Mixed Precision Training   | AMP (Automatic Mixed Precision) |

### **Training Results**  

| Training Loss | Epoch   | Step  | Validation Loss |
|:-------------:|:-------:|:-----:|:---------------:|
| 0.5372        | 0.9995  | 954   | 0.4398          |
| 0.4646        | 2.0     | 1909  | 0.4214          |
| 0.4505        | 2.9995  | 2863  | 0.4163          |
| 0.4443        | 4.0     | 3818  | 0.4109          |
| 0.4403        | 4.9995  | 4772  | 0.4080          |
| 0.4368        | 6.0     | 5727  | 0.4057          |
| 0.4343        | 6.9995  | 6681  | 0.4034          |
| 0.4315        | 8.0     | 7636  | 0.4018          |
| 0.4311        | 8.9995  | 8590  | 0.4015          |
| 0.4273        | 10.0    | 9545  | 0.4017          |
| 0.4282        | 10.9995 | 10499 | 0.3990          |
| 0.4249        | 12.0    | 11454 | 0.3986          |
| 0.4242        | 12.9995 | 12408 | 0.3973          |
| 0.4225        | 14.0    | 13363 | 0.3966          |
| 0.4217        | 14.9995 | 14317 | 0.3951          |
| 0.4208        | 16.0    | 15272 | 0.3950          |
| 0.4200        | 16.9995 | 16226 | 0.3950          |
| 0.4202        | 18.0    | 17181 | 0.3952          |
| 0.4200        | 18.9995 | 18135 | 0.3943          |
| 0.4183        | 20.0    | 19090 | 0.3962          |
| 0.4175        | 20.9995 | 20044 | 0.3937          |
| 0.4161        | 22.0    | 20999 | 0.3940          |
| 0.4193        | 22.9995 | 21953 | 0.3932          |
| 0.4177        | 24.0    | 22908 | 0.3939          |
| 0.4166        | 24.9995 | 23862 | 0.3936          |
| 0.4156        | 26.0    | 24817 | 0.3938          |

---

## **Framework Versions**  

- **Transformers**: 4.41.2  
- **PyTorch**: 2.4.0+cu121  
- **Datasets**: 3.2.0  
- **Tokenizers**: 0.19.1  

---

## **Author**  

- **Bilal FAYE**  

This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀