codeceejay commited on
Commit
425ed92
·
1 Parent(s): 70f6d06

Delete Readme2

Browse files
Files changed (1) hide show
  1. Readme2 +0 -179
Readme2 DELETED
@@ -1,179 +0,0 @@
1
- language: en
2
- datasets:
3
- - common_voice
4
- metrics:
5
- - wer
6
- - cer
7
- tags:
8
- - audio
9
- - automatic-speech-recognition
10
- - speech
11
- - xlsr-fine-tuning-week
12
- license: apache-2.0
13
- model-index:
14
- - name: Wav2Vec2 English by Jonatas Grosman
15
- results:
16
- - task:
17
- name: Speech Recognition
18
- type: automatic-speech-recognition
19
- dataset:
20
- name: Common Voice en
21
- type: common_voice
22
- args: en
23
- metrics:
24
- - name: Test WER
25
- type: wer
26
- value: 21.53
27
- - name: Test CER
28
- type: cer
29
- value: 9.66
30
- ---
31
- # Wav2vec2-Large-English
32
-
33
- Fine-tuned [facebook/wav2vec2-large](https://huggingface.co/facebook/wav2vec2-large) on English using the [Common Voice](https://huggingface.co/datasets/common_voice).
34
- When using this model, make sure that your speech input is sampled at 16kHz.
35
-
36
- This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)
37
-
38
- The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint
39
-
40
- ## Usage
41
-
42
- The model can be used directly (without a language model) as follows...
43
-
44
- Using the [ASRecognition](https://github.com/jonatasgrosman/asrecognition) library:
45
-
46
- ```python
47
- from asrecognition import ASREngine
48
- asr = ASREngine("fr", model_path="jonatasgrosman/wav2vec2-large-english")
49
- audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
50
- transcriptions = asr.transcribe(audio_paths)
51
- ```
52
-
53
- Writing your own inference script:
54
-
55
- ```python
56
- import torch
57
- import librosa
58
- from datasets import load_dataset
59
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
60
- LANG_ID = "en"
61
- MODEL_ID = "jonatasgrosman/wav2vec2-large-english"
62
- SAMPLES = 10
63
- test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
64
- processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
65
- model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
66
- # Preprocessing the datasets.
67
- # We need to read the audio files as arrays
68
- def speech_file_to_array_fn(batch):
69
- speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
70
- batch["speech"] = speech_array
71
- batch["sentence"] = batch["sentence"].upper()
72
- return batch
73
- test_dataset = test_dataset.map(speech_file_to_array_fn)
74
- inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
75
- with torch.no_grad():
76
- logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
77
- predicted_ids = torch.argmax(logits, dim=-1)
78
- predicted_sentences = processor.batch_decode(predicted_ids)
79
- for i, predicted_sentence in enumerate(predicted_sentences):
80
- print("-" * 100)
81
- print("Reference:", test_dataset[i]["sentence"])
82
- print("Prediction:", predicted_sentence)
83
- ```
84
-
85
- | Reference | Prediction |
86
- | ------------- | ------------- |
87
- | "SHE'LL BE ALL RIGHT." | SHELL BE ALL RIGHT |
88
- | SIX | SIX |
89
- | "ALL'S WELL THAT ENDS WELL." | ALLAS WELL THAT ENDS WELL |
90
- | DO YOU MEAN IT? | W MEAN IT |
91
- | THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS. | THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESTION |
92
- | HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE? | HOW IS MOSILLA GOING TO BANDL AND BE WHIT IS LIKE QU AND QU |
93
- | "I GUESS YOU MUST THINK I'M KINDA BATTY." | RUSTION AS HAME AK AN THE POT |
94
- | NO ONE NEAR THE REMOTE MACHINE YOU COULD RING? | NO ONE NEAR THE REMOTE MACHINE YOU COULD RING |
95
- | SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER. | SAUCE FOR THE GUCE IS SAUCE FOR THE GONDER |
96
- | GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD. | GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD |
97
-
98
- ## Evaluation
99
-
100
- The model can be evaluated as follows on the English (en) test data of Common Voice.
101
-
102
- ```python
103
- import torch
104
- import re
105
- import librosa
106
- from datasets import load_dataset, load_metric
107
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
108
- LANG_ID = "en"
109
- MODEL_ID = "jonatasgrosman/wav2vec2-large-english"
110
- DEVICE = "cuda"
111
- CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
112
- "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
113
- "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
114
- "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
115
- "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
116
- test_dataset = load_dataset("common_voice", LANG_ID, split="test")
117
- wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
118
- cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py
119
- chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
120
- processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
121
- model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
122
- model.to(DEVICE)
123
- # Preprocessing the datasets.
124
- # We need to read the audio files as arrays
125
- def speech_file_to_array_fn(batch):
126
- with warnings.catch_warnings():
127
- warnings.simplefilter("ignore")
128
- speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
129
- batch["speech"] = speech_array
130
- batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
131
- return batch
132
- test_dataset = test_dataset.map(speech_file_to_array_fn)
133
- # Preprocessing the datasets.
134
- # We need to read the audio files as arrays
135
- def evaluate(batch):
136
- inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
137
- with torch.no_grad():
138
- logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits
139
- pred_ids = torch.argmax(logits, dim=-1)
140
- batch["pred_strings"] = processor.batch_decode(pred_ids)
141
- return batch
142
- result = test_dataset.map(evaluate, batched=True, batch_size=8)
143
- predictions = [x.upper() for x in result["pred_strings"]]
144
- references = [x.upper() for x in result["sentence"]]
145
- print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
146
- print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
147
- ```
148
-
149
- **Test Result**:
150
-
151
- In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-06-17). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
152
-
153
- | Model | WER | CER |
154
- | ------------- | ------------- | ------------- |
155
- | jonatasgrosman/wav2vec2-large-xlsr-53-english | **18.98%** | **8.29%** |
156
- | jonatasgrosman/wav2vec2-large-english | 21.53% | 9.66% |
157
- | facebook/wav2vec2-large-960h-lv60-self | 22.03% | 10.39% |
158
- | facebook/wav2vec2-large-960h-lv60 | 23.97% | 11.14% |
159
- | boris/xlsr-en-punctuation | 29.10% | 10.75% |
160
- | facebook/wav2vec2-large-960h | 32.79% | 16.03% |
161
- | facebook/wav2vec2-base-960h | 39.86% | 19.89% |
162
- | facebook/wav2vec2-base-100h | 51.06% | 25.06% |
163
- | elgeish/wav2vec2-large-lv60-timit-asr | 59.96% | 34.28% |
164
- | facebook/wav2vec2-base-10k-voxpopuli-ft-en | 66.41% | 36.76% |
165
- | elgeish/wav2vec2-base-timit-asr | 68.78% | 36.81% |
166
-
167
- ## Citation
168
- If you want to cite this model you can use this:
169
-
170
- ```bibtex
171
- @misc{grosman2021wav2vec2-large-english,
172
- title={Wav2Vec2 English by Jonatas Grosman},
173
- author={Grosman, Jonatas},
174
- publisher={Hugging Face},
175
- journal={Hugging Face Hub},
176
- howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-english}},
177
- year={2021}
178
- }
179
- ```