File size: 6,806 Bytes
a1f9c89
1aacb5e
8293fd5
1aacb5e
 
 
 
 
e3317c8
 
2aa573f
e3317c8
a1f9c89
 
e3317c8
1aacb5e
e3317c8
1aacb5e
e3317c8
1aacb5e
e3317c8
1aacb5e
e3317c8
1aacb5e
e3317c8
1aacb5e
e3317c8
1aacb5e
e3317c8
 
 
1aacb5e
e3317c8
1aacb5e
e3317c8
 
 
 
 
1aacb5e
e3317c8
 
 
 
 
 
 
 
1aacb5e
e3317c8
1aacb5e
e3317c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1aacb5e
 
 
8293fd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3317c8
8293fd5
e3317c8
8293fd5
 
 
 
 
 
 
1aacb5e
e3317c8
 
 
 
 
 
 
 
 
 
 
 
1aacb5e
e3317c8
1aacb5e
e3317c8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
license: mit
base_model: bilalfaye/speecht5_tts-wolof
tags:
- generated_from_trainer
model-index:
- name: speecht5_tts-wolof-v0.2
  results: []
language:
- wo
- fr
pipeline_tag: text-to-speech
---

# **speecht5_tts-wolof-v0.2**  

This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages.  

## **Model Description**  

This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.  

---  

## **Installation Instructions for Users**  

To install the necessary dependencies, run the following command:  

```bash
pip install transformers datasets torch
```

## **Model Loading and Speech Generation Code**  

```python
import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
from datasets import load_dataset
from IPython.display import Audio, display

def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
    """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    processor = SpeechT5Processor.from_pretrained(checkpoint)
    model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
    vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)

    return processor, model, vocoder, device

# Load the model
processor, model, vocoder, device = load_speech_model()

# Load speaker embeddings (pretrained from CMU Arctic dataset)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):  
    """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """  

    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
    inputs = {key: value.to(model.device) for key, value in inputs.items()}

    speech = model.generate(
        inputs["input_ids"],
        speaker_embeddings=speaker_embedding.to(model.device),
        vocoder=vocoder,
        num_beams=7,
        temperature=0.6,
        no_repeat_ngram_size=3,
        repetition_penalty=1.5,
    )

    speech = speech.detach().cpu().numpy()
    display(Audio(speech, rate=16000))

# Example usage French
text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
generate_speech_from_text(text)

# Example usage Wolof
text = "ñu ne ñoom ñooy nattukaay satélite yi"
generate_speech_from_text(text)
```

---  

## **Intended Uses & Limitations**  

### **Intended Uses**  
- **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech.  
- **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages.  
- **Linguistic Research:** Facilitates speech synthesis research in low-resource languages.  

### **Limitations**  
- **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning.  
- **Pronunciation Variations:** Some complex or uncommon words may be mispronounced.  
- **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles.  

---  

## **Training and Evaluation Data**  

The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages.  

---

## **Training Procedure**  

### **Training Hyperparameters**  

| Hyperparameter             | Value   |
|----------------------------|---------|
| Learning Rate              | 1e-05   |
| Training Batch Size        | 8       |
| Evaluation Batch Size      | 2       |
| Gradient Accumulation Steps| 8       |
| Total Train Batch Size     | 64      |
| Optimizer                  | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
| Learning Rate Scheduler    | Linear  |
| Warmup Steps               | 500     |
| Training Steps             | 25,500  |
| Mixed Precision Training   | AMP (Automatic Mixed Precision) |

### **Training Results**  

| Training Loss | Epoch   | Step  | Validation Loss |
|:-------------:|:-------:|:-----:|:---------------:|
| 0.5372        | 0.9995  | 954   | 0.4398          |
| 0.4646        | 2.0     | 1909  | 0.4214          |
| 0.4505        | 2.9995  | 2863  | 0.4163          |
| 0.4443        | 4.0     | 3818  | 0.4109          |
| 0.4403        | 4.9995  | 4772  | 0.4080          |
| 0.4368        | 6.0     | 5727  | 0.4057          |
| 0.4343        | 6.9995  | 6681  | 0.4034          |
| 0.4315        | 8.0     | 7636  | 0.4018          |
| 0.4311        | 8.9995  | 8590  | 0.4015          |
| 0.4273        | 10.0    | 9545  | 0.4017          |
| 0.4282        | 10.9995 | 10499 | 0.3990          |
| 0.4249        | 12.0    | 11454 | 0.3986          |
| 0.4242        | 12.9995 | 12408 | 0.3973          |
| 0.4225        | 14.0    | 13363 | 0.3966          |
| 0.4217        | 14.9995 | 14317 | 0.3951          |
| 0.4208        | 16.0    | 15272 | 0.3950          |
| 0.4200        | 16.9995 | 16226 | 0.3950          |
| 0.4202        | 18.0    | 17181 | 0.3952          |
| 0.4200        | 18.9995 | 18135 | 0.3943          |
| 0.4183        | 20.0    | 19090 | 0.3962          |
| 0.4175        | 20.9995 | 20044 | 0.3937          |
| 0.4161        | 22.0    | 20999 | 0.3940          |
| 0.4193        | 22.9995 | 21953 | 0.3932          |
| 0.4177        | 24.0    | 22908 | 0.3939          |
| 0.4166        | 24.9995 | 23862 | 0.3936          |
| 0.4156        | 26.0    | 24817 | 0.3938          |

---

## **Framework Versions**  

- **Transformers**: 4.41.2  
- **PyTorch**: 2.4.0+cu121  
- **Datasets**: 3.2.0  
- **Tokenizers**: 0.19.1  

---

## **Author**  

- **Bilal FAYE**  

This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀