|
--- |
|
library_name: transformers |
|
datasets: |
|
- Sunbird/salt |
|
language: |
|
- lg |
|
- en |
|
- nyn |
|
- ach |
|
- teo |
|
- lgg |
|
base_model: |
|
- openai/whisper-large-v2 |
|
--- |
|
|
|
# Whisper large for Ugandan languages |
|
|
|
This model is an adaptation of whisper-large-v2 for the following languages widely spoken in Uganda: |
|
Luganda, Acholi, Lugbara, Ateso, Runyankole and English (Ugandan accent). |
|
|
|
## Training |
|
|
|
The model was trained with the SALT dataset, Common Voice (Luganda) and FLEURS datasets. |
|
To help with generalisation in practical settings, training used addition of random noise |
|
and random downsampling to 8kHz to simulate phone speech. |
|
|
|
# Usage |
|
|
|
The model is used in a similar way to the base Whisper model. |
|
The model will attempt to auto-detect the language and provide a transcription. |
|
However, note that language detection is not always accurate and results may be |
|
improved by specifying it instead. The languages in this model are not supported |
|
by the base Whisper model, so the format is slightly different: |
|
|
|
|
|
```python |
|
import transformers |
|
import datasets |
|
import torch |
|
|
|
processor = transformers.WhisperProcessor.from_pretrained( |
|
"Sunbird/asr-whisper-large-v2-salt") |
|
model = transformers.WhisperForConditionalGeneration.from_pretrained( |
|
"Sunbird/asr-whisper-large-v2-salt") |
|
|
|
SALT_LANGUAGE_TOKENS_WHISPER = { |
|
'eng': 50259, # English (Ugandan) |
|
'ach': 50357, # Acholi |
|
'lgg': 50356, # Lugbara |
|
'lug': 50355, # Luganda |
|
'nyn': 50354, # Runyankole |
|
'teo': 50353, # Ateso |
|
} |
|
|
|
# Get some test audio |
|
ds = datasets.load_dataset('Sunbird/salt', 'multispeaker-lug', split='test') |
|
audio = ds[0]['audio'] |
|
sample_rate = ds[0]['sample_rate'] |
|
|
|
# Specify a language from one of the above. |
|
lang = 'lug' |
|
|
|
# Apply the model |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
input_features = processor( |
|
audio, sampling_rate=sample_rate, return_tensors="pt").input_features |
|
input_features = input_features.to(device) |
|
predicted_ids = model.to(device).generate( |
|
input_features, |
|
# Optionally set language=None here instead to auto-detect. |
|
language=processor.tokenizer.decode(SALT_LANGUAGE_TOKENS_WHISPER[lang]), |
|
forced_decoder_ids=None) |
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
|
|
|
print(transcription) |
|
# Ekikoola kya kasooli kya kyenvu wabula langi yaakyo etera okuba eya kitaka wansi. |
|
``` |
|
|
|
|
|
#### Performance Metrics |
|
|
|
|
|
| Lang | CER | WER | |
|
|----------|-----|-----| |
|
| eng | 0.005 | 0.013 | |
|
| lug | 0.020 | 0.095 | |
|
| ach | 0.059 | 0.242 | |
|
| lgg | 0.059 | 0.227 | |
|
| teo | 0.069 | 0.256 | |
|
| nyn | 0.079 | 0.316 | |
|
| xog | 0.100 | 0.461 | |
|
| myx | 0.119 | 0.475 | |
|
| swa | 0.183 | 0.249 | |
|
| kin | 0.216 | 0.474 | |
|
| mean | 0.091 | 0.281 | |