alakxender's picture
Update README.md
e490b1c verified
metadata
library_name: transformers
datasets:
  - alakxender/dv_syn_speech_md
language:
  - dv
pipeline_tag: text-to-audio
license: mit
base_model:
  - facebook/mms-tts-div
tags:
  - dhivehi-tts

Divehi TTS – Female Voice (VITS-based)

This is a fine-tuned VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) model for Divehi speech synthesis. The model produces female voice audio from Thaana-scripted Divehi text. Fine-tuned from Meta’s MMS-TTS architecture using a curated dataset of synthetic Divehi speech.

Model Details

Field Value
Model ID alakxender/mms-tts-div-finetuned-md-f01
Base Architecture MMS-TTS (VITS)
Language Divehi (dv)
Voice Female
Sampling Rate 16 kHz
Tokenizer VITSTokenizer
Inference Engine Transformers (🤗 Hugging Face)

Usage

from transformers import VitsModel, VitsTokenizer
import torchaudio

tokenizer = VitsTokenizer.from_pretrained("alakxender/mms-tts-div-finetuned-md-f01")
model = VitsModel.from_pretrained("alakxender/mms-tts-div-finetuned-md-f01")

text = "މޫސުން ވަރަށް ގޯސްވެ، ފުވައްމުލަކުން ފެށިގެން އައްޑުއަށް އޮރެންޖް އެލާޓް ނެރެފި"
inputs = tokenizer(text, return_tensors="pt")
waveform = model.generate(**inputs).waveform[0]

torchaudio.save("output.wav", waveform.unsqueeze(0), 16000)

Evaluation Summary

  • Model: alakxender/mms-tts-div-finetuned-md-f01
  • Evaluated Samples: 3
  • Avg Estimated MOS (UTMOS): 2.349
    {
      "5": "Excellent (very natural)",
      "4": "Good (mostly natural)",
      "3": "Fair (some robotic quality)",
      "2": "Poor (noticeably unnatural)",
      "1": "Bad (unintelligible or very synthetic)"
    }
    
  • Artifacts:
    • 🎵 Audio: outputs/audio/
    • 📊 Spectrograms: outputs/spectrograms/
    • 📄 Report: outputs/report.txt
    • 📈 MOS Scores: outputs/mos_scores.txt

Acknowledgements