--- license: mit base_model: bilalfaye/speecht5_tts-wolof tags: - generated_from_trainer model-index: - name: speecht5_tts-wolof-v0.2 results: [] language: - wo - fr pipeline_tag: text-to-speech --- # **speecht5_tts-wolof-v0.2** This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages. ## **Model Description** This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages. --- ## **Installation Instructions for Users** To install the necessary dependencies, run the following command: ```bash pip install transformers datasets torch ``` ## **Model Loading and Speech Generation Code** ```python import torch from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan from datasets import load_dataset from IPython.display import Audio, display def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"): """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """ device = torch.device("cuda" if torch.cuda.is_available() else "cpu") processor = SpeechT5Processor.from_pretrained(checkpoint) model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device) vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device) return processor, model, vocoder, device # Load the model processor, model, vocoder, device = load_speech_model() # Load speaker embeddings (pretrained from CMU Arctic dataset) embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation") speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0) def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder): """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """ inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions) inputs = {key: value.to(model.device) for key, value in inputs.items()} speech = model.generate( inputs["input_ids"], speaker_embeddings=speaker_embedding.to(model.device), vocoder=vocoder, num_beams=7, temperature=0.6, no_repeat_ngram_size=3, repetition_penalty=1.5, ) speech = speech.detach().cpu().numpy() display(Audio(speech, rate=16000)) # Example usage French text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français." generate_speech_from_text(text) # Example usage Wolof text = "ñu ne ñoom ñooy nattukaay satélite yi" generate_speech_from_text(text) ``` --- ## **Intended Uses & Limitations** ### **Intended Uses** - **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech. - **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages. - **Linguistic Research:** Facilitates speech synthesis research in low-resource languages. ### **Limitations** - **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning. - **Pronunciation Variations:** Some complex or uncommon words may be mispronounced. - **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles. --- ## **Training and Evaluation Data** The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages. --- ## **Training Procedure** ### **Training Hyperparameters** | Hyperparameter | Value | |----------------------------|---------| | Learning Rate | 1e-05 | | Training Batch Size | 8 | | Evaluation Batch Size | 2 | | Gradient Accumulation Steps| 8 | | Total Train Batch Size | 64 | | Optimizer | Adam (β1=0.9, β2=0.999, ϵ=1e-08) | | Learning Rate Scheduler | Linear | | Warmup Steps | 500 | | Training Steps | 25,500 | | Mixed Precision Training | AMP (Automatic Mixed Precision) | ### **Training Results** | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-------:|:-----:|:---------------:| | 0.5372 | 0.9995 | 954 | 0.4398 | | 0.4646 | 2.0 | 1909 | 0.4214 | | 0.4505 | 2.9995 | 2863 | 0.4163 | | 0.4443 | 4.0 | 3818 | 0.4109 | | 0.4403 | 4.9995 | 4772 | 0.4080 | | 0.4368 | 6.0 | 5727 | 0.4057 | | 0.4343 | 6.9995 | 6681 | 0.4034 | | 0.4315 | 8.0 | 7636 | 0.4018 | | 0.4311 | 8.9995 | 8590 | 0.4015 | | 0.4273 | 10.0 | 9545 | 0.4017 | | 0.4282 | 10.9995 | 10499 | 0.3990 | | 0.4249 | 12.0 | 11454 | 0.3986 | | 0.4242 | 12.9995 | 12408 | 0.3973 | | 0.4225 | 14.0 | 13363 | 0.3966 | | 0.4217 | 14.9995 | 14317 | 0.3951 | | 0.4208 | 16.0 | 15272 | 0.3950 | | 0.4200 | 16.9995 | 16226 | 0.3950 | | 0.4202 | 18.0 | 17181 | 0.3952 | | 0.4200 | 18.9995 | 18135 | 0.3943 | | 0.4183 | 20.0 | 19090 | 0.3962 | | 0.4175 | 20.9995 | 20044 | 0.3937 | | 0.4161 | 22.0 | 20999 | 0.3940 | | 0.4193 | 22.9995 | 21953 | 0.3932 | | 0.4177 | 24.0 | 22908 | 0.3939 | | 0.4166 | 24.9995 | 23862 | 0.3936 | | 0.4156 | 26.0 | 24817 | 0.3938 | --- ## **Framework Versions** - **Transformers**: 4.41.2 - **PyTorch**: 2.4.0+cu121 - **Datasets**: 3.2.0 - **Tokenizers**: 0.19.1 --- ## **Author** - **Bilal FAYE** This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀