File size: 6,806 Bytes
a1f9c89 1aacb5e 8293fd5 1aacb5e e3317c8 2aa573f e3317c8 a1f9c89 e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e 8293fd5 e3317c8 8293fd5 e3317c8 8293fd5 1aacb5e e3317c8 1aacb5e e3317c8 1aacb5e e3317c8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
---
license: mit
base_model: bilalfaye/speecht5_tts-wolof
tags:
- generated_from_trainer
model-index:
- name: speecht5_tts-wolof-v0.2
results: []
language:
- wo
- fr
pipeline_tag: text-to-speech
---
# **speecht5_tts-wolof-v0.2**
This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages.
## **Model Description**
This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.
---
## **Installation Instructions for Users**
To install the necessary dependencies, run the following command:
```bash
pip install transformers datasets torch
```
## **Model Loading and Speech Generation Code**
```python
import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
from datasets import load_dataset
from IPython.display import Audio, display
def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
""" Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = SpeechT5Processor.from_pretrained(checkpoint)
model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)
return processor, model, vocoder, device
# Load the model
processor, model, vocoder, device = load_speech_model()
# Load speaker embeddings (pretrained from CMU Arctic dataset)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
""" Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """
inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
inputs = {key: value.to(model.device) for key, value in inputs.items()}
speech = model.generate(
inputs["input_ids"],
speaker_embeddings=speaker_embedding.to(model.device),
vocoder=vocoder,
num_beams=7,
temperature=0.6,
no_repeat_ngram_size=3,
repetition_penalty=1.5,
)
speech = speech.detach().cpu().numpy()
display(Audio(speech, rate=16000))
# Example usage French
text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
generate_speech_from_text(text)
# Example usage Wolof
text = "ñu ne ñoom ñooy nattukaay satélite yi"
generate_speech_from_text(text)
```
---
## **Intended Uses & Limitations**
### **Intended Uses**
- **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech.
- **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages.
- **Linguistic Research:** Facilitates speech synthesis research in low-resource languages.
### **Limitations**
- **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning.
- **Pronunciation Variations:** Some complex or uncommon words may be mispronounced.
- **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles.
---
## **Training and Evaluation Data**
The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages.
---
## **Training Procedure**
### **Training Hyperparameters**
| Hyperparameter | Value |
|----------------------------|---------|
| Learning Rate | 1e-05 |
| Training Batch Size | 8 |
| Evaluation Batch Size | 2 |
| Gradient Accumulation Steps| 8 |
| Total Train Batch Size | 64 |
| Optimizer | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
| Learning Rate Scheduler | Linear |
| Warmup Steps | 500 |
| Training Steps | 25,500 |
| Mixed Precision Training | AMP (Automatic Mixed Precision) |
### **Training Results**
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-------:|:-----:|:---------------:|
| 0.5372 | 0.9995 | 954 | 0.4398 |
| 0.4646 | 2.0 | 1909 | 0.4214 |
| 0.4505 | 2.9995 | 2863 | 0.4163 |
| 0.4443 | 4.0 | 3818 | 0.4109 |
| 0.4403 | 4.9995 | 4772 | 0.4080 |
| 0.4368 | 6.0 | 5727 | 0.4057 |
| 0.4343 | 6.9995 | 6681 | 0.4034 |
| 0.4315 | 8.0 | 7636 | 0.4018 |
| 0.4311 | 8.9995 | 8590 | 0.4015 |
| 0.4273 | 10.0 | 9545 | 0.4017 |
| 0.4282 | 10.9995 | 10499 | 0.3990 |
| 0.4249 | 12.0 | 11454 | 0.3986 |
| 0.4242 | 12.9995 | 12408 | 0.3973 |
| 0.4225 | 14.0 | 13363 | 0.3966 |
| 0.4217 | 14.9995 | 14317 | 0.3951 |
| 0.4208 | 16.0 | 15272 | 0.3950 |
| 0.4200 | 16.9995 | 16226 | 0.3950 |
| 0.4202 | 18.0 | 17181 | 0.3952 |
| 0.4200 | 18.9995 | 18135 | 0.3943 |
| 0.4183 | 20.0 | 19090 | 0.3962 |
| 0.4175 | 20.9995 | 20044 | 0.3937 |
| 0.4161 | 22.0 | 20999 | 0.3940 |
| 0.4193 | 22.9995 | 21953 | 0.3932 |
| 0.4177 | 24.0 | 22908 | 0.3939 |
| 0.4166 | 24.9995 | 23862 | 0.3936 |
| 0.4156 | 26.0 | 24817 | 0.3938 |
---
## **Framework Versions**
- **Transformers**: 4.41.2
- **PyTorch**: 2.4.0+cu121
- **Datasets**: 3.2.0
- **Tokenizers**: 0.19.1
---
## **Author**
- **Bilal FAYE**
This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀 |