Update README.md

2aa573f verified about 2 months ago

6.81 kB

	---
	license: mit
	base_model: bilalfaye/speecht5_tts-wolof
	tags:
	- generated_from_trainer
	model-index:
	- name: speecht5_tts-wolof-v0.2
	results: []
	language:
	- wo
	- fr
	pipeline_tag: text-to-speech
	---

	# speecht5_tts-wolof-v0.2

	This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both Wolof and French. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a custom tokenizer and additional fine-tuning to improve performance across these two languages.

	## Model Description

	This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to generate natural speech in both Wolof and French. The model maintains the same general structure but learns a more robust alignment between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.

	---

	## Installation Instructions for Users

	To install the necessary dependencies, run the following command:

	```bash
	pip install transformers datasets torch
	```

	## Model Loading and Speech Generation Code

	```python
	import torch
	from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
	from datasets import load_dataset
	from IPython.display import Audio, display

	def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
	""" Load the SpeechT5 model, processor, and vocoder for text-to-speech. """

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	processor = SpeechT5Processor.from_pretrained(checkpoint)
	model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
	vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)

	return processor, model, vocoder, device

	# Load the model
	processor, model, vocoder, device = load_speech_model()

	# Load speaker embeddings (pretrained from CMU Arctic dataset)
	embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
	speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

	def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
	""" Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """

	inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
	inputs = {key: value.to(model.device) for key, value in inputs.items()}

	speech = model.generate(
	inputs["input_ids"],
	speaker_embeddings=speaker_embedding.to(model.device),
	vocoder=vocoder,
	num_beams=7,
	temperature=0.6,
	no_repeat_ngram_size=3,
	repetition_penalty=1.5,
	)

	speech = speech.detach().cpu().numpy()
	display(Audio(speech, rate=16000))

	# Example usage French
	text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
	generate_speech_from_text(text)

	# Example usage Wolof
	text = "ñu ne ñoom ñooy nattukaay satélite yi"
	generate_speech_from_text(text)
	```

	---

	## Intended Uses & Limitations

	### Intended Uses
	- Multilingual TTS: Converts Wolof and French text into natural-sounding speech.
	- Voice Assistants & Speech Interfaces: Can be used for audio-based applications supporting both languages.
	- Linguistic Research: Facilitates speech synthesis research in low-resource languages.

	### Limitations
	- Data Dependency: The quality of synthesized speech depends on the dataset used for fine-tuning.
	- Pronunciation Variations: Some complex or uncommon words may be mispronounced.
	- Limited Speaker Variety: The model was trained on a single speaker embedding and may not generalize well to different voice profiles.

	---

	## Training and Evaluation Data

	The model was fine-tuned on an extended dataset containing text in both Wolof and French, ensuring improved synthesis capabilities across these two languages.

	---

	## Training Procedure

	### Training Hyperparameters

	\| Hyperparameter \| Value \|
	\|----------------------------\|---------\|
	\| Learning Rate \| 1e-05 \|
	\| Training Batch Size \| 8 \|
	\| Evaluation Batch Size \| 2 \|
	\| Gradient Accumulation Steps\| 8 \|
	\| Total Train Batch Size \| 64 \|
	\| Optimizer \| Adam (β1=0.9, β2=0.999, ϵ=1e-08) \|
	\| Learning Rate Scheduler \| Linear \|
	\| Warmup Steps \| 500 \|
	\| Training Steps \| 25,500 \|
	\| Mixed Precision Training \| AMP (Automatic Mixed Precision) \|

	### Training Results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:-------:\|:-----:\|:---------------:\|
	\| 0.5372 \| 0.9995 \| 954 \| 0.4398 \|
	\| 0.4646 \| 2.0 \| 1909 \| 0.4214 \|
	\| 0.4505 \| 2.9995 \| 2863 \| 0.4163 \|
	\| 0.4443 \| 4.0 \| 3818 \| 0.4109 \|
	\| 0.4403 \| 4.9995 \| 4772 \| 0.4080 \|
	\| 0.4368 \| 6.0 \| 5727 \| 0.4057 \|
	\| 0.4343 \| 6.9995 \| 6681 \| 0.4034 \|
	\| 0.4315 \| 8.0 \| 7636 \| 0.4018 \|
	\| 0.4311 \| 8.9995 \| 8590 \| 0.4015 \|
	\| 0.4273 \| 10.0 \| 9545 \| 0.4017 \|
	\| 0.4282 \| 10.9995 \| 10499 \| 0.3990 \|
	\| 0.4249 \| 12.0 \| 11454 \| 0.3986 \|
	\| 0.4242 \| 12.9995 \| 12408 \| 0.3973 \|
	\| 0.4225 \| 14.0 \| 13363 \| 0.3966 \|
	\| 0.4217 \| 14.9995 \| 14317 \| 0.3951 \|
	\| 0.4208 \| 16.0 \| 15272 \| 0.3950 \|
	\| 0.4200 \| 16.9995 \| 16226 \| 0.3950 \|
	\| 0.4202 \| 18.0 \| 17181 \| 0.3952 \|
	\| 0.4200 \| 18.9995 \| 18135 \| 0.3943 \|
	\| 0.4183 \| 20.0 \| 19090 \| 0.3962 \|
	\| 0.4175 \| 20.9995 \| 20044 \| 0.3937 \|
	\| 0.4161 \| 22.0 \| 20999 \| 0.3940 \|
	\| 0.4193 \| 22.9995 \| 21953 \| 0.3932 \|
	\| 0.4177 \| 24.0 \| 22908 \| 0.3939 \|
	\| 0.4166 \| 24.9995 \| 23862 \| 0.3936 \|
	\| 0.4156 \| 26.0 \| 24817 \| 0.3938 \|

	---

	## Framework Versions

	- Transformers: 4.41.2
	- PyTorch: 2.4.0+cu121
	- Datasets: 3.2.0
	- Tokenizers: 0.19.1

	---

	## Author

	- Bilal FAYE

	This model contributes to enhancing TTS accessibility for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀

	---
	license: mit
	base_model: bilalfaye/speecht5_tts-wolof
	tags:
	- generated_from_trainer
	model-index:
	- name: speecht5_tts-wolof-v0.2
	results: []
	language:
	- wo
	- fr
	pipeline_tag: text-to-speech
	---

	# speecht5_tts-wolof-v0.2

	This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both Wolof and French. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a custom tokenizer and additional fine-tuning to improve performance across these two languages.

	## Model Description

	This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to generate natural speech in both Wolof and French. The model maintains the same general structure but learns a more robust alignment between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.

	---

	## Installation Instructions for Users

	To install the necessary dependencies, run the following command:

	```bash
	pip install transformers datasets torch
	```

	## Model Loading and Speech Generation Code

	```python
	import torch
	from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
	from datasets import load_dataset
	from IPython.display import Audio, display

	def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
	""" Load the SpeechT5 model, processor, and vocoder for text-to-speech. """

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	processor = SpeechT5Processor.from_pretrained(checkpoint)
	model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
	vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)

	return processor, model, vocoder, device

	# Load the model
	processor, model, vocoder, device = load_speech_model()

	# Load speaker embeddings (pretrained from CMU Arctic dataset)
	embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
	speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

	def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
	""" Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """

	inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
	inputs = {key: value.to(model.device) for key, value in inputs.items()}

	speech = model.generate(
	inputs["input_ids"],
	speaker_embeddings=speaker_embedding.to(model.device),
	vocoder=vocoder,
	num_beams=7,
	temperature=0.6,
	no_repeat_ngram_size=3,
	repetition_penalty=1.5,
	)

	speech = speech.detach().cpu().numpy()
	display(Audio(speech, rate=16000))

	# Example usage French
	text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
	generate_speech_from_text(text)

	# Example usage Wolof
	text = "ñu ne ñoom ñooy nattukaay satélite yi"
	generate_speech_from_text(text)
	```

	---

	## Intended Uses & Limitations

	### Intended Uses
	- Multilingual TTS: Converts Wolof and French text into natural-sounding speech.
	- Voice Assistants & Speech Interfaces: Can be used for audio-based applications supporting both languages.
	- Linguistic Research: Facilitates speech synthesis research in low-resource languages.

	### Limitations
	- Data Dependency: The quality of synthesized speech depends on the dataset used for fine-tuning.
	- Pronunciation Variations: Some complex or uncommon words may be mispronounced.
	- Limited Speaker Variety: The model was trained on a single speaker embedding and may not generalize well to different voice profiles.

	---

	## Training and Evaluation Data

	The model was fine-tuned on an extended dataset containing text in both Wolof and French, ensuring improved synthesis capabilities across these two languages.

	---

	## Training Procedure

	### Training Hyperparameters

	\| Hyperparameter \| Value \|
	\|----------------------------\|---------\|
	\| Learning Rate \| 1e-05 \|
	\| Training Batch Size \| 8 \|
	\| Evaluation Batch Size \| 2 \|
	\| Gradient Accumulation Steps\| 8 \|
	\| Total Train Batch Size \| 64 \|
	\| Optimizer \| Adam (β1=0.9, β2=0.999, ϵ=1e-08) \|
	\| Learning Rate Scheduler \| Linear \|
	\| Warmup Steps \| 500 \|
	\| Training Steps \| 25,500 \|
	\| Mixed Precision Training \| AMP (Automatic Mixed Precision) \|

	### Training Results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:-------:\|:-----:\|:---------------:\|
	\| 0.5372 \| 0.9995 \| 954 \| 0.4398 \|
	\| 0.4646 \| 2.0 \| 1909 \| 0.4214 \|
	\| 0.4505 \| 2.9995 \| 2863 \| 0.4163 \|
	\| 0.4443 \| 4.0 \| 3818 \| 0.4109 \|
	\| 0.4403 \| 4.9995 \| 4772 \| 0.4080 \|
	\| 0.4368 \| 6.0 \| 5727 \| 0.4057 \|
	\| 0.4343 \| 6.9995 \| 6681 \| 0.4034 \|
	\| 0.4315 \| 8.0 \| 7636 \| 0.4018 \|
	\| 0.4311 \| 8.9995 \| 8590 \| 0.4015 \|
	\| 0.4273 \| 10.0 \| 9545 \| 0.4017 \|
	\| 0.4282 \| 10.9995 \| 10499 \| 0.3990 \|
	\| 0.4249 \| 12.0 \| 11454 \| 0.3986 \|
	\| 0.4242 \| 12.9995 \| 12408 \| 0.3973 \|
	\| 0.4225 \| 14.0 \| 13363 \| 0.3966 \|
	\| 0.4217 \| 14.9995 \| 14317 \| 0.3951 \|
	\| 0.4208 \| 16.0 \| 15272 \| 0.3950 \|
	\| 0.4200 \| 16.9995 \| 16226 \| 0.3950 \|
	\| 0.4202 \| 18.0 \| 17181 \| 0.3952 \|
	\| 0.4200 \| 18.9995 \| 18135 \| 0.3943 \|
	\| 0.4183 \| 20.0 \| 19090 \| 0.3962 \|
	\| 0.4175 \| 20.9995 \| 20044 \| 0.3937 \|
	\| 0.4161 \| 22.0 \| 20999 \| 0.3940 \|
	\| 0.4193 \| 22.9995 \| 21953 \| 0.3932 \|
	\| 0.4177 \| 24.0 \| 22908 \| 0.3939 \|
	\| 0.4166 \| 24.9995 \| 23862 \| 0.3936 \|
	\| 0.4156 \| 26.0 \| 24817 \| 0.3938 \|

	---

	## Framework Versions

	- Transformers: 4.41.2
	- PyTorch: 2.4.0+cu121
	- Datasets: 3.2.0
	- Tokenizers: 0.19.1

	---

	## Author

	- Bilal FAYE

	This model contributes to enhancing TTS accessibility for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀