bilalfaye
/

speecht5_tts-wolof-v0.2

@@ -6,47 +6,124 @@ tags:
 model-index:
 - name: speecht5_tts-wolof-v0.2
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# speecht5_tts-wolof-v0.2
-This model is a fine-tuned version of [bilalfaye/speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.3938
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-05
-- train_batch_size: 16
-- eval_batch_size: 8
-- seed: 42
-- gradient_accumulation_steps: 2
-- total_train_batch_size: 32
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 500
-- num_epochs: 30
-- mixed_precision_training: Native AMP
-### Training results
 | Training Loss | Epoch   | Step  | Validation Loss |
 |:-------------:|:-------:|:-----:|:---------------:|
@@ -66,9 +143,9 @@ The following hyperparameters were used during training:
 | 0.4225        | 14.0    | 13363 | 0.3966          |
 | 0.4217        | 14.9995 | 14317 | 0.3951          |
 | 0.4208        | 16.0    | 15272 | 0.3950          |
-| 0.42          | 16.9995 | 16226 | 0.3950          |
 | 0.4202        | 18.0    | 17181 | 0.3952          |
-| 0.42          | 18.9995 | 18135 | 0.3943          |
 | 0.4183        | 20.0    | 19090 | 0.3962          |
 | 0.4175        | 20.9995 | 20044 | 0.3937          |
 | 0.4161        | 22.0    | 20999 | 0.3940          |
@@ -77,10 +154,19 @@ The following hyperparameters were used during training:
 | 0.4166        | 24.9995 | 23862 | 0.3936          |
 | 0.4156        | 26.0    | 24817 | 0.3938          |
-### Framework versions
-- Transformers 4.41.2
-- Pytorch 2.4.0+cu121
-- Datasets 3.2.0
-- Tokenizers 0.19.1

 model-index:
 - name: speecht5_tts-wolof-v0.2
   results: []
+language:
+- wo
+- en
+pipeline_tag: text-to-speech
 ---
+# **speecht5_tts-wolof-v0.2**
+This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages.
+## **Model Description**
+This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.
+---
+## **Installation Instructions for Users**
+To install the necessary dependencies, run the following command:
+```bash
+pip install transformers datasets torch
+```
+## **Model Loading and Speech Generation Code**
+```python
+import torch
+from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
+from datasets import load_dataset
+from IPython.display import Audio, display
+def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
+    """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    processor = SpeechT5Processor.from_pretrained(checkpoint)
+    model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
+    vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)
+    return processor, model, vocoder, device
+# Load the model
+processor, model, vocoder, device = load_speech_model()
+# Load speaker embeddings (pretrained from CMU Arctic dataset)
+embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
+speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
+def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
+    """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """
+    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
+    inputs = {key: value.to(model.device) for key, value in inputs.items()}
+    speech = model.generate(
+        inputs["input_ids"],
+        speaker_embeddings=speaker_embedding.to(model.device),
+        vocoder=vocoder,
+        num_beams=7,
+        temperature=0.6,
+        no_repeat_ngram_size=3,
+        repetition_penalty=1.5,
+    )
+    speech = speech.detach().cpu().numpy()
+    display(Audio(speech, rate=16000))
+# Example usage French
+text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
+generate_speech_from_text(text)
+# Example usage Wolof
+text = "ñu ne ñoom ñooy nattukaay satélite yi"
+generate_speech_from_text(text)
+```
+---
+## **Intended Uses & Limitations**
+### **Intended Uses**
+- **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech.
+- **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages.
+- **Linguistic Research:** Facilitates speech synthesis research in low-resource languages.
+### **Limitations**
+- **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning.
+- **Pronunciation Variations:** Some complex or uncommon words may be mispronounced.
+- **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles.
+---
+## **Training and Evaluation Data**
+The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages.
+---
+## **Training Procedure**
+### **Training Hyperparameters**
+| Hyperparameter             | Value   |
+|----------------------------|---------|
+| Learning Rate              | 1e-05   |
+| Training Batch Size        | 8       |
+| Evaluation Batch Size      | 2       |
+| Gradient Accumulation Steps| 8       |
+| Total Train Batch Size     | 64      |
+| Optimizer                  | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
+| Learning Rate Scheduler    | Linear  |
+| Warmup Steps               | 500     |
+| Training Steps             | 25,500  |
+| Mixed Precision Training   | AMP (Automatic Mixed Precision) |
+### **Training Results**
 | Training Loss | Epoch   | Step  | Validation Loss |
 |:-------------:|:-------:|:-----:|:---------------:|
 | 0.4225        | 14.0    | 13363 | 0.3966          |
 | 0.4217        | 14.9995 | 14317 | 0.3951          |
 | 0.4208        | 16.0    | 15272 | 0.3950          |
+| 0.4200        | 16.9995 | 16226 | 0.3950          |
 | 0.4202        | 18.0    | 17181 | 0.3952          |
+| 0.4200        | 18.9995 | 18135 | 0.3943          |
 | 0.4183        | 20.0    | 19090 | 0.3962          |
 | 0.4175        | 20.9995 | 20044 | 0.3937          |
 | 0.4161        | 22.0    | 20999 | 0.3940          |
 | 0.4166        | 24.9995 | 23862 | 0.3936          |
 | 0.4156        | 26.0    | 24817 | 0.3938          |
+---
+## **Framework Versions**
+- **Transformers**: 4.41.2
+- **PyTorch**: 2.4.0+cu121
+- **Datasets**: 3.2.0
+- **Tokenizers**: 0.19.1
+---
+## **Author**
+- **Bilal FAYE**
+This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀