codeceejay commited on
Commit
6392b75
·
1 Parent(s): 0b482ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -2
README.md CHANGED
@@ -1,6 +1,62 @@
1
  HIYACCENT: An Improved Nigerian-Accented Speech Recognition System Based on Contrastive Learning
2
 
3
- The global objective of this research was to develop a more robust model for the Nigerian English Speakers whose English pronunciations are heavily affected by their mother tongue. The developed model is then compared to the performance of the state of the art models. The project was motivated by the poor performance of existing models on Nigerian Accented English (NAE) Speakers.
 
 
4
 
5
- The Wav2Vec-HIYACCENT model was proposed which introduced a new layer to the Novel Facebook Wav2vec to capture the disparity between the baseline model and NAE. A CTC loss was also inserted on top of the model which adds flexibility to the speech-text alignment. This resulted in over 20% improvement in the performance for NAE.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
 
1
  HIYACCENT: An Improved Nigerian-Accented Speech Recognition System Based on Contrastive Learning
2
 
3
+ The global objective of this research was to develop a more robust model for the Nigerian English Speakers whose English pronunciations are heavily affected by their mother tongue. For this, the Wav2Vec-HIYACCENT model was proposed which introduced a new layer to the Novel Facebook Wav2vec to capture the disparity between the baseline model and Nigerian English Speeches. A CTC loss was also inserted on top of the model which adds flexibility to the speech-text alignment. This resulted in over 20% improvement in the performance for NAE.T
4
+
5
+ Fine-tuned facebook/wav2vec2-large on English using the UISpeech Corpus. When using this model, make sure that your speech input is sampled at 16kHz.
6
 
7
+
8
+ The script used for training can be found here: https://github.com/amceejay/HIYACCENT-NE-Speech-Recognition-System
9
+
10
+ Usage
11
+ The model can be used directly (without a language model) as follows...
12
+
13
+ Using the ASRecognition library:
14
+ from asrecognition import ASREngine
15
+
16
+ asr = ASREngine("fr", model_path="codeceejay/HIYACCENT_Wav2Vec2")
17
+
18
+ audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
19
+ transcriptions = asr.transcribe(audio_paths)
20
+
21
+
22
+
23
+ Writing your own inference speech:
24
+ import torch
25
+ import librosa
26
+ from datasets import load_dataset
27
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
28
+
29
+ LANG_ID = "en"
30
+ MODEL_ID = "codeceejay/HIYACCENT_Wav2Vec2"
31
+ SAMPLES = 10
32
+
33
+ #You can use common_voice/timit or Nigerian Accented Speeches can also be found here: https://openslr.org/70/
34
+ test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
35
+
36
+ processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
37
+ model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
38
+
39
+ # Preprocessing the datasets.
40
+ # We need to read the audio files as arrays
41
+ def speech_file_to_array_fn(batch):
42
+ speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
43
+ batch["speech"] = speech_array
44
+ batch["sentence"] = batch["sentence"].upper()
45
+ return batch
46
+
47
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
48
+ inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
49
+
50
+ with torch.no_grad():
51
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
52
+
53
+ predicted_ids = torch.argmax(logits, dim=-1)
54
+ predicted_sentences = processor.batch_decode(predicted_ids)
55
+
56
+ for i, predicted_sentence in enumerate(predicted_sentences):
57
+ print("-" * 100)
58
+ print("Reference:", test_dataset[i]["sentence"])
59
+ print("Prediction:", predicted_sentence)
60
+
61
+
62