bilalfaye commited on
Commit
e3317c8
·
verified ·
1 Parent(s): 1ab6db4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -34
README.md CHANGED
@@ -6,47 +6,124 @@ tags:
6
  model-index:
7
  - name: speecht5_tts-wolof-v0.2
8
  results: []
 
 
 
 
9
  ---
10
 
11
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
- should probably proofread and complete it, then remove this comment. -->
13
 
14
- # speecht5_tts-wolof-v0.2
15
 
16
- This model is a fine-tuned version of [bilalfaye/speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) on an unknown dataset.
17
- It achieves the following results on the evaluation set:
18
- - Loss: 0.3938
19
 
20
- ## Model description
21
 
22
- More information needed
23
 
24
- ## Intended uses & limitations
25
 
26
- More information needed
27
 
28
- ## Training and evaluation data
 
 
29
 
30
- More information needed
31
 
32
- ## Training procedure
 
 
 
 
33
 
34
- ### Training hyperparameters
 
 
 
 
 
 
 
35
 
36
- The following hyperparameters were used during training:
37
- - learning_rate: 1e-05
38
- - train_batch_size: 16
39
- - eval_batch_size: 8
40
- - seed: 42
41
- - gradient_accumulation_steps: 2
42
- - total_train_batch_size: 32
43
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
44
- - lr_scheduler_type: linear
45
- - lr_scheduler_warmup_steps: 500
46
- - num_epochs: 30
47
- - mixed_precision_training: Native AMP
48
 
49
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  | Training Loss | Epoch | Step | Validation Loss |
52
  |:-------------:|:-------:|:-----:|:---------------:|
@@ -66,9 +143,9 @@ The following hyperparameters were used during training:
66
  | 0.4225 | 14.0 | 13363 | 0.3966 |
67
  | 0.4217 | 14.9995 | 14317 | 0.3951 |
68
  | 0.4208 | 16.0 | 15272 | 0.3950 |
69
- | 0.42 | 16.9995 | 16226 | 0.3950 |
70
  | 0.4202 | 18.0 | 17181 | 0.3952 |
71
- | 0.42 | 18.9995 | 18135 | 0.3943 |
72
  | 0.4183 | 20.0 | 19090 | 0.3962 |
73
  | 0.4175 | 20.9995 | 20044 | 0.3937 |
74
  | 0.4161 | 22.0 | 20999 | 0.3940 |
@@ -77,10 +154,19 @@ The following hyperparameters were used during training:
77
  | 0.4166 | 24.9995 | 23862 | 0.3936 |
78
  | 0.4156 | 26.0 | 24817 | 0.3938 |
79
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
- ### Framework versions
82
 
83
- - Transformers 4.41.2
84
- - Pytorch 2.4.0+cu121
85
- - Datasets 3.2.0
86
- - Tokenizers 0.19.1
 
6
  model-index:
7
  - name: speecht5_tts-wolof-v0.2
8
  results: []
9
+ language:
10
+ - wo
11
+ - en
12
+ pipeline_tag: text-to-speech
13
  ---
14
 
15
+ # **speecht5_tts-wolof-v0.2**
 
16
 
17
+ This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages.
18
 
19
+ ## **Model Description**
 
 
20
 
21
+ This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.
22
 
23
+ ---
24
 
25
+ ## **Installation Instructions for Users**
26
 
27
+ To install the necessary dependencies, run the following command:
28
 
29
+ ```bash
30
+ pip install transformers datasets torch
31
+ ```
32
 
33
+ ## **Model Loading and Speech Generation Code**
34
 
35
+ ```python
36
+ import torch
37
+ from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
38
+ from datasets import load_dataset
39
+ from IPython.display import Audio, display
40
 
41
+ def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
42
+ """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
43
+
44
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
45
+
46
+ processor = SpeechT5Processor.from_pretrained(checkpoint)
47
+ model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
48
+ vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)
49
 
50
+ return processor, model, vocoder, device
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ # Load the model
53
+ processor, model, vocoder, device = load_speech_model()
54
+
55
+ # Load speaker embeddings (pretrained from CMU Arctic dataset)
56
+ embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
57
+ speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
58
+
59
+ def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
60
+ """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """
61
+
62
+ inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
63
+ inputs = {key: value.to(model.device) for key, value in inputs.items()}
64
+
65
+ speech = model.generate(
66
+ inputs["input_ids"],
67
+ speaker_embeddings=speaker_embedding.to(model.device),
68
+ vocoder=vocoder,
69
+ num_beams=7,
70
+ temperature=0.6,
71
+ no_repeat_ngram_size=3,
72
+ repetition_penalty=1.5,
73
+ )
74
+
75
+ speech = speech.detach().cpu().numpy()
76
+ display(Audio(speech, rate=16000))
77
+
78
+ # Example usage French
79
+ text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
80
+ generate_speech_from_text(text)
81
+
82
+ # Example usage Wolof
83
+ text = "ñu ne ñoom ñooy nattukaay satélite yi"
84
+ generate_speech_from_text(text)
85
+ ```
86
+
87
+ ---
88
+
89
+ ## **Intended Uses & Limitations**
90
+
91
+ ### **Intended Uses**
92
+ - **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech.
93
+ - **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages.
94
+ - **Linguistic Research:** Facilitates speech synthesis research in low-resource languages.
95
+
96
+ ### **Limitations**
97
+ - **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning.
98
+ - **Pronunciation Variations:** Some complex or uncommon words may be mispronounced.
99
+ - **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles.
100
+
101
+ ---
102
+
103
+ ## **Training and Evaluation Data**
104
+
105
+ The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages.
106
+
107
+ ---
108
+
109
+ ## **Training Procedure**
110
+
111
+ ### **Training Hyperparameters**
112
+
113
+ | Hyperparameter | Value |
114
+ |----------------------------|---------|
115
+ | Learning Rate | 1e-05 |
116
+ | Training Batch Size | 8 |
117
+ | Evaluation Batch Size | 2 |
118
+ | Gradient Accumulation Steps| 8 |
119
+ | Total Train Batch Size | 64 |
120
+ | Optimizer | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
121
+ | Learning Rate Scheduler | Linear |
122
+ | Warmup Steps | 500 |
123
+ | Training Steps | 25,500 |
124
+ | Mixed Precision Training | AMP (Automatic Mixed Precision) |
125
+
126
+ ### **Training Results**
127
 
128
  | Training Loss | Epoch | Step | Validation Loss |
129
  |:-------------:|:-------:|:-----:|:---------------:|
 
143
  | 0.4225 | 14.0 | 13363 | 0.3966 |
144
  | 0.4217 | 14.9995 | 14317 | 0.3951 |
145
  | 0.4208 | 16.0 | 15272 | 0.3950 |
146
+ | 0.4200 | 16.9995 | 16226 | 0.3950 |
147
  | 0.4202 | 18.0 | 17181 | 0.3952 |
148
+ | 0.4200 | 18.9995 | 18135 | 0.3943 |
149
  | 0.4183 | 20.0 | 19090 | 0.3962 |
150
  | 0.4175 | 20.9995 | 20044 | 0.3937 |
151
  | 0.4161 | 22.0 | 20999 | 0.3940 |
 
154
  | 0.4166 | 24.9995 | 23862 | 0.3936 |
155
  | 0.4156 | 26.0 | 24817 | 0.3938 |
156
 
157
+ ---
158
+
159
+ ## **Framework Versions**
160
+
161
+ - **Transformers**: 4.41.2
162
+ - **PyTorch**: 2.4.0+cu121
163
+ - **Datasets**: 3.2.0
164
+ - **Tokenizers**: 0.19.1
165
+
166
+ ---
167
+
168
+ ## **Author**
169
 
170
+ - **Bilal FAYE**
171
 
172
+ This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀