codeceejay commited on
Commit
70f6d06
·
1 Parent(s): 6392b75

Create Readme2

Browse files
Files changed (1) hide show
  1. Readme2 +179 -0
Readme2 ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ language: en
2
+ datasets:
3
+ - common_voice
4
+ metrics:
5
+ - wer
6
+ - cer
7
+ tags:
8
+ - audio
9
+ - automatic-speech-recognition
10
+ - speech
11
+ - xlsr-fine-tuning-week
12
+ license: apache-2.0
13
+ model-index:
14
+ - name: Wav2Vec2 English by Jonatas Grosman
15
+ results:
16
+ - task:
17
+ name: Speech Recognition
18
+ type: automatic-speech-recognition
19
+ dataset:
20
+ name: Common Voice en
21
+ type: common_voice
22
+ args: en
23
+ metrics:
24
+ - name: Test WER
25
+ type: wer
26
+ value: 21.53
27
+ - name: Test CER
28
+ type: cer
29
+ value: 9.66
30
+ ---
31
+ # Wav2vec2-Large-English
32
+
33
+ Fine-tuned [facebook/wav2vec2-large](https://huggingface.co/facebook/wav2vec2-large) on English using the [Common Voice](https://huggingface.co/datasets/common_voice).
34
+ When using this model, make sure that your speech input is sampled at 16kHz.
35
+
36
+ This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)
37
+
38
+ The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint
39
+
40
+ ## Usage
41
+
42
+ The model can be used directly (without a language model) as follows...
43
+
44
+ Using the [ASRecognition](https://github.com/jonatasgrosman/asrecognition) library:
45
+
46
+ ```python
47
+ from asrecognition import ASREngine
48
+ asr = ASREngine("fr", model_path="jonatasgrosman/wav2vec2-large-english")
49
+ audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
50
+ transcriptions = asr.transcribe(audio_paths)
51
+ ```
52
+
53
+ Writing your own inference script:
54
+
55
+ ```python
56
+ import torch
57
+ import librosa
58
+ from datasets import load_dataset
59
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
60
+ LANG_ID = "en"
61
+ MODEL_ID = "jonatasgrosman/wav2vec2-large-english"
62
+ SAMPLES = 10
63
+ test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
64
+ processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
65
+ model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
66
+ # Preprocessing the datasets.
67
+ # We need to read the audio files as arrays
68
+ def speech_file_to_array_fn(batch):
69
+ speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
70
+ batch["speech"] = speech_array
71
+ batch["sentence"] = batch["sentence"].upper()
72
+ return batch
73
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
74
+ inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
75
+ with torch.no_grad():
76
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
77
+ predicted_ids = torch.argmax(logits, dim=-1)
78
+ predicted_sentences = processor.batch_decode(predicted_ids)
79
+ for i, predicted_sentence in enumerate(predicted_sentences):
80
+ print("-" * 100)
81
+ print("Reference:", test_dataset[i]["sentence"])
82
+ print("Prediction:", predicted_sentence)
83
+ ```
84
+
85
+ | Reference | Prediction |
86
+ | ------------- | ------------- |
87
+ | "SHE'LL BE ALL RIGHT." | SHELL BE ALL RIGHT |
88
+ | SIX | SIX |
89
+ | "ALL'S WELL THAT ENDS WELL." | ALLAS WELL THAT ENDS WELL |
90
+ | DO YOU MEAN IT? | W MEAN IT |
91
+ | THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS. | THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESTION |
92
+ | HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE? | HOW IS MOSILLA GOING TO BANDL AND BE WHIT IS LIKE QU AND QU |
93
+ | "I GUESS YOU MUST THINK I'M KINDA BATTY." | RUSTION AS HAME AK AN THE POT |
94
+ | NO ONE NEAR THE REMOTE MACHINE YOU COULD RING? | NO ONE NEAR THE REMOTE MACHINE YOU COULD RING |
95
+ | SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER. | SAUCE FOR THE GUCE IS SAUCE FOR THE GONDER |
96
+ | GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD. | GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD |
97
+
98
+ ## Evaluation
99
+
100
+ The model can be evaluated as follows on the English (en) test data of Common Voice.
101
+
102
+ ```python
103
+ import torch
104
+ import re
105
+ import librosa
106
+ from datasets import load_dataset, load_metric
107
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
108
+ LANG_ID = "en"
109
+ MODEL_ID = "jonatasgrosman/wav2vec2-large-english"
110
+ DEVICE = "cuda"
111
+ CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
112
+ "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
113
+ "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
114
+ "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
115
+ "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
116
+ test_dataset = load_dataset("common_voice", LANG_ID, split="test")
117
+ wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
118
+ cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py
119
+ chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
120
+ processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
121
+ model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
122
+ model.to(DEVICE)
123
+ # Preprocessing the datasets.
124
+ # We need to read the audio files as arrays
125
+ def speech_file_to_array_fn(batch):
126
+ with warnings.catch_warnings():
127
+ warnings.simplefilter("ignore")
128
+ speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
129
+ batch["speech"] = speech_array
130
+ batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
131
+ return batch
132
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
133
+ # Preprocessing the datasets.
134
+ # We need to read the audio files as arrays
135
+ def evaluate(batch):
136
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
137
+ with torch.no_grad():
138
+ logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits
139
+ pred_ids = torch.argmax(logits, dim=-1)
140
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
141
+ return batch
142
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
143
+ predictions = [x.upper() for x in result["pred_strings"]]
144
+ references = [x.upper() for x in result["sentence"]]
145
+ print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
146
+ print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
147
+ ```
148
+
149
+ **Test Result**:
150
+
151
+ In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-06-17). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
152
+
153
+ | Model | WER | CER |
154
+ | ------------- | ------------- | ------------- |
155
+ | jonatasgrosman/wav2vec2-large-xlsr-53-english | **18.98%** | **8.29%** |
156
+ | jonatasgrosman/wav2vec2-large-english | 21.53% | 9.66% |
157
+ | facebook/wav2vec2-large-960h-lv60-self | 22.03% | 10.39% |
158
+ | facebook/wav2vec2-large-960h-lv60 | 23.97% | 11.14% |
159
+ | boris/xlsr-en-punctuation | 29.10% | 10.75% |
160
+ | facebook/wav2vec2-large-960h | 32.79% | 16.03% |
161
+ | facebook/wav2vec2-base-960h | 39.86% | 19.89% |
162
+ | facebook/wav2vec2-base-100h | 51.06% | 25.06% |
163
+ | elgeish/wav2vec2-large-lv60-timit-asr | 59.96% | 34.28% |
164
+ | facebook/wav2vec2-base-10k-voxpopuli-ft-en | 66.41% | 36.76% |
165
+ | elgeish/wav2vec2-base-timit-asr | 68.78% | 36.81% |
166
+
167
+ ## Citation
168
+ If you want to cite this model you can use this:
169
+
170
+ ```bibtex
171
+ @misc{grosman2021wav2vec2-large-english,
172
+ title={Wav2Vec2 English by Jonatas Grosman},
173
+ author={Grosman, Jonatas},
174
+ publisher={Hugging Face},
175
+ journal={Hugging Face Hub},
176
+ howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-english}},
177
+ year={2021}
178
+ }
179
+ ```