acharyasagar commited on
Commit
0ecdc8f
·
verified ·
1 Parent(s): 9e3ce12

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +9 -413
README.md CHANGED
@@ -1,416 +1,12 @@
1
- # whisper-timestamped
2
 
3
- Multilingual Automatic Speech Recognition with word-level timestamps and confidence.
 
 
 
 
 
 
4
 
5
- * [Description](#description)
6
- * [Notes on other approaches](#notes-on-other-approaches)
7
- * [Installation](#installation)
8
- * [First installation](#first-installation)
9
- * [Additional packages that might be needed](#additional-packages-that-might-be-needed)
10
- * [Docker](#docker)
11
- * [Light installation for CPU](#light-installation-for-cpu)
12
- * [Upgrade to the latest version](#upgrade-to-the-latest-version)
13
- * [Usage](#usage)
14
- * [Python](#python)
15
- * [Command line](#command-line)
16
- * [Plot of word alignment](#plot-of-word-alignment)
17
- * [Example output](#example-output)
18
- * [Options that may improve results](#options-that-may-improve-results)
19
- * [Accurate Whisper transcription](#accurate-whisper-transcription)
20
- * [Running Voice Activity Detection (VAD) before sending to Whisper](#running-voice-activity-detection-vad-before-sending-to-whisper)
21
- * [Detecting disfluencies](#detecting-disfluencies)
22
- * [Acknowlegment](#acknowlegment)
23
- * [Citations](#citations)
24
 
25
- ## Description
26
-
27
- [Whisper](https://openai.com/blog/whisper/) is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This repository proposes an implementation to **predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models**.
28
- Besides, a confidence score is assigned to each word and each segment.
29
-
30
- The approach is based on Dynamic Time Warping (DTW) applied to cross-attention weights, as demonstrated by [this notebook by Jong Wook Kim](https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/notebooks/Multilingual_ASR.ipynb). There are some additions to this notebook:
31
- * The start/end estimation is more accurate.
32
- * Confidence scores are assigned to each word.
33
- * **If possible (without beam search...)**, no additional inference steps are required to predict word timestamps (word alignment is done on the fly after each speech segment is decoded).
34
- * Special care has been taken regarding memory usage: `whisper-timestamped` is able to process long files with little additional memory compared to the regular use of the Whisper model.
35
-
36
- `whisper-timestamped` is an extension of the [`openai-whisper`](https://pypi.org/project/whisper-openai/) Python package and is meant to be compatible with any version of `openai-whisper`.
37
- It provides more efficient/accurate word timestamps, along with those additional features:
38
- * Voice Activity Detection (VAD) can be run before applying Whisper model,
39
- to avoid hallucinations due to errors in the training data (for instance, predicting "Thanks you for watching!" on pure silence).
40
- Several VAD methods are available: silero (default), auditok, auditok:v3.1
41
- * When the language is not specified, the language probabilities are provided among the outputs.
42
-
43
- ### Notes on other approaches
44
-
45
- An alternative relevant approach to recovering word-level timestamps involves using wav2vec models that predict characters, as successfully implemented in [whisperX](https://github.com/m-bain/whisperX). However, these approaches have several drawbacks that are not present in approaches based on cross-attention weights such as `whisper_timestamped`. These drawbacks include:
46
- * The need to find one wav2vec model per language to support, which does not scale well with the multi-lingual capabilities of Whisper.
47
- * The need to handle (at least) one additional neural network (wav2vec model), which consumes memory.
48
- * The need to normalize characters in Whisper transcription to match the character set of the wav2vec model. This involves awkward language-dependent conversions, such as converting numbers to words ("2" -> "two"), symbols to words ("%" -> "percent", "€" -> "euro(s)")...
49
- * The lack of robustness around speech disfluencies (fillers, hesitations, repeated words...) that are usually removed by Whisper.
50
-
51
- An alternative approach that does not require an additional model is to look at the probabilities of timestamp tokens estimated by the Whisper model after each (sub)word token is predicted. This was implemented, for instance, in whisper.cpp and stable-ts. However, this approach lacks robustness because Whisper models have not been trained to output meaningful timestamps after each word. Whisper models tend to predict timestamps only after a certain number of words have been predicted (typically at the end of a sentence), and the probability distribution of timestamps outside this condition may be inaccurate. In practice, these methods can produce results that are totally out-of-sync on some periods of time (we observed this especially when there is jingle music). Also, the timestamp precision of Whisper models tends to be rounded to 1 second (as in many video subtitles), which is too inaccurate for words, and reaching better accuracy is tricky.
52
-
53
- ## Installation
54
-
55
- ### First installation
56
-
57
- Requirements:
58
- * `python3` (version higher or equal to 3.7, at least 3.9 is recommended)
59
- * `ffmpeg` (see instructions for installation on the [whisper repository](https://github.com/openai/whisper))
60
-
61
- You can install `whisper-timestamped` either by using pip:
62
- ```bash
63
- pip3 install whisper-timestamped
64
- ```
65
-
66
- or by cloning this repository and running installation:
67
- ```bash
68
- git clone https://github.com/linto-ai/whisper-timestamped
69
- cd whisper-timestamped/
70
- python3 setup.py install
71
- ```
72
-
73
- #### Additional packages that might be needed
74
-
75
- If you want to plot alignment between audio timestamps and words (as in [this section](#plot-of-word-alignment)), you also need matplotlib:
76
- ```bash
77
- pip3 install matplotlib
78
- ```
79
-
80
- If you want to use VAD option (Voice Activity Detection before running Whisper model), you also need torchaudio and onnxruntime:
81
- ```bash
82
- pip3 install onnxruntime torchaudio
83
- ```
84
-
85
- If you want to use finetuned Whisper models from the Hugging Face Hub, you also need transformers:
86
- ```bash
87
- pip3 install transformers
88
- ```
89
-
90
- #### Docker
91
-
92
- A docker image of about 9GB can be built using:
93
- ```bash
94
- git clone https://github.com/linto-ai/whisper-timestamped
95
- cd whisper-timestamped/
96
- docker build -t whisper_timestamped:latest .
97
- ```
98
-
99
- ### Light installation for CPU
100
-
101
- If you don't have a GPU (or don't want to use it), then you don't need to install the CUDA dependencies. You should then just install a light version of torch **before** installing whisper-timestamped, for instance as follows:
102
- ```bash
103
- pip3 install \
104
- torch==1.13.1+cpu \
105
- torchaudio==0.13.1+cpu \
106
- -f https://download.pytorch.org/whl/torch_stable.html
107
- ```
108
-
109
- A specific docker image of about 3.5GB can also be built using:
110
- ```bash
111
- git clone https://github.com/linto-ai/whisper-timestamped
112
- cd whisper-timestamped/
113
- docker build -t whisper_timestamped_cpu:latest -f Dockerfile.cpu .
114
- ```
115
-
116
- ### Upgrade to the latest version
117
-
118
- When using pip, the library can be updated to the latest version using:
119
- ```
120
- pip3 install --upgrade --no-deps --force-reinstall git+https://github.com/linto-ai/whisper-timestamped
121
- ```
122
-
123
- A specific version of `openai-whisper` can be used by running, for example:
124
- ```bash
125
- pip3 install openai-whisper==20230124
126
- ```
127
-
128
- ## Usage
129
-
130
- ### Python
131
-
132
- In Python, you can use the function `whisper_timestamped.transcribe()`, which is similar to the function `whisper.transcribe()`:
133
- ```python
134
- import whisper_timestamped
135
- help(whisper_timestamped.transcribe)
136
- ```
137
- The main difference with `whisper.transcribe()` is that the output will include a key `"words"` for all segments, with the word start and end position. Note that the word will include punctuation. See the example [below](#example-output).
138
-
139
- Besides, the default decoding options are different to favour efficient decoding (greedy decoding instead of beam search, and no temperature sampling fallback). To have same default as in `whisper`, use ```beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)```.
140
-
141
- There are also additional options related to word alignement.
142
-
143
- In general, if you import `whisper_timestamped` instead of `whisper` in your Python script and use `transcribe(model, ...)` instead of `model.transcribe(...)`, it should do the job:
144
- ```
145
- import whisper_timestamped as whisper
146
-
147
- audio = whisper.load_audio("AUDIO.wav")
148
-
149
- model = whisper.load_model("tiny", device="cpu")
150
-
151
- result = whisper.transcribe(model, audio, language="fr")
152
-
153
- import json
154
- print(json.dumps(result, indent = 2, ensure_ascii = False))
155
- ```
156
-
157
- Note that you can use a finetuned Whisper model from HuggingFace or a local folder by using the `load_model` method of `whisper_timestamped`. For instance, if you want to use [whisper-large-v2-nob](https://huggingface.co/NbAiLab/whisper-large-v2-nob), you can simply do the following:
158
- ```
159
- import whisper_timestamped as whisper
160
-
161
- model = whisper.load_model("NbAiLab/whisper-large-v2-nob", device="cpu")
162
-
163
- # ...
164
- ```
165
-
166
- ### Command line
167
-
168
- You can also use `whisper_timestamped` on the command line, similarly to `whisper`. See help with:
169
- ```bash
170
- whisper_timestamped --help
171
- ```
172
-
173
- The main differences with `whisper` CLI are:
174
- * Output files:
175
- * The output JSON contains word timestamps and confidence scores. See example [below](#example-output).
176
- * There is an additional CSV output format.
177
- * For SRT, VTT, TSV formats, there will be additional files saved with word timestamps.
178
- * Some default options are different:
179
- * By default, no output folder is set: Use `--output_dir .` for Whisper default.
180
- * By default, there is no verbose: Use `--verbose True` for Whisper default.
181
- * By default, beam search decoding and temperature sampling fallback are disabled, to favour an efficient decoding.
182
- To set the same as Whisper default, you can use `--accurate` (which is an alias for ```--beam_size 5 --temperature_increment_on_fallback 0.2 --best_of 5```).
183
- * There are some additional specific options:
184
- <!-- * `--efficient` to use a faster greedy decoding (without beam search neither several sampling at each step),
185
- which enables a special path where word timestamps are computed on the fly (no need to run inference twice).
186
- Note that transcription results might be significantly worse on challenging audios with this option. -->
187
- * `--compute_confidence` to enable/disable the computation of confidence scores for each word.
188
- * `--punctuations_with_words` to decide whether punctuation marks should be included or not with preceding words.
189
-
190
- An example command to process several files using the `tiny` model and output the results in the current folder, as would be done by default with whisper, is as follows:
191
- ```
192
- whisper_timestamped audio1.flac audio2.mp3 audio3.wav --model tiny --output_dir .
193
- ```
194
-
195
- Note that you can use a fine-tuned Whisper model from HuggingFace or a local folder. For instance, if you want to use the [whisper-large-v2-nob](https://huggingface.co/NbAiLab/whisper-large-v2-nob) model, you can simply do the following:
196
- ```
197
- whisper_timestamped --model NbAiLab/whisper-large-v2-nob <...>
198
- ```
199
-
200
- ### Plot of word alignment
201
-
202
- Note that you can use the `plot_word_alignment` option of the `whisper_timestamped.transcribe()` Python function or the `--plot` option of the `whisper_timestamped` CLI to see the word alignment for each segment.
203
-
204
- ![Example alignement](figs/example_alignement_plot.png)
205
-
206
- * The upper plot represents the transformation of cross-attention weights used for alignment with Dynamic Time Warping. The abscissa represents time, and the ordinate represents the predicted tokens, with special timestamp tokens at the beginning and end, and (sub)words and punctuation in the middle.
207
- * The lower plot is an MFCC representation of the input signal (features used by Whisper, based on Mel-frequency cepstrum).
208
- * The vertical dotted red lines show where the word boundaries are found (with punctuation marks "glued" to the previous word).
209
-
210
- ### Example output
211
-
212
- The output of `whisper_timestamped.transcribe()` function is a python dictionary,
213
- which can be viewed in JSON format using the CLI.
214
-
215
- The JSON schema can be seen in [tests/json_schema.json](tests/json_schema.json).
216
-
217
- Here is an example output:
218
- ```bash
219
- whisper_timestamped AUDIO_FILE.wav --model tiny --language fr
220
- ```
221
- ```json
222
- {
223
- "text": " Bonjour! Est-ce que vous allez bien?",
224
- "segments": [
225
- {
226
- "id": 0,
227
- "seek": 0,
228
- "start": 0.5,
229
- "end": 1.2,
230
- "text": " Bonjour!",
231
- "tokens": [ 25431, 2298 ],
232
- "temperature": 0.0,
233
- "avg_logprob": -0.6674491882324218,
234
- "compression_ratio": 0.8181818181818182,
235
- "no_speech_prob": 0.10241222381591797,
236
- "confidence": 0.51,
237
- "words": [
238
- {
239
- "text": "Bonjour!",
240
- "start": 0.5,
241
- "end": 1.2,
242
- "confidence": 0.51
243
- }
244
- ]
245
- },
246
- {
247
- "id": 1,
248
- "seek": 200,
249
- "start": 2.02,
250
- "end": 4.48,
251
- "text": " Est-ce que vous allez bien?",
252
- "tokens": [ 50364, 4410, 12, 384, 631, 2630, 18146, 3610, 2506, 50464 ],
253
- "temperature": 0.0,
254
- "avg_logprob": -0.43492694334550336,
255
- "compression_ratio": 0.7714285714285715,
256
- "no_speech_prob": 0.06502953916788101,
257
- "confidence": 0.595,
258
- "words": [
259
- {
260
- "text": "Est-ce",
261
- "start": 2.02,
262
- "end": 3.78,
263
- "confidence": 0.441
264
- },
265
- {
266
- "text": "que",
267
- "start": 3.78,
268
- "end": 3.84,
269
- "confidence": 0.948
270
- },
271
- {
272
- "text": "vous",
273
- "start": 3.84,
274
- "end": 4.0,
275
- "confidence": 0.935
276
- },
277
- {
278
- "text": "allez",
279
- "start": 4.0,
280
- "end": 4.14,
281
- "confidence": 0.347
282
- },
283
- {
284
- "text": "bien?",
285
- "start": 4.14,
286
- "end": 4.48,
287
- "confidence": 0.998
288
- }
289
- ]
290
- }
291
- ],
292
- "language": "fr"
293
- }
294
- ```
295
- If the language is not specified (e.g. without option `--language fr` in the CLI) you will find an additional key with the language probabilities:
296
- ```json
297
- {
298
- ...
299
- "language": "fr",
300
- "language_probs": {
301
- "en": 0.027954353019595146,
302
- "zh": 0.02743500843644142,
303
- ...
304
- "fr": 0.9196318984031677,
305
- ...
306
- "su": 3.0119704064190955e-08,
307
- "yue": 2.2565967810805887e-05
308
- }
309
- }
310
- ```
311
-
312
- ### Options that may improve results
313
-
314
- Here are some options that are not enabled by default but might improve results.
315
-
316
- #### Accurate Whisper transcription
317
-
318
- As mentioned earlier, some decoding options are disabled by default to offer better efficiency. However, this can impact the quality of the transcription. To run with the options that have the best chance of providing a good transcription, use the following options.
319
- * In Python:
320
- ```python
321
- results = whisper_timestamped.transcribe(model, audio, beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), ...)
322
- ```
323
- * On the command line:
324
- ```bash
325
- whisper_timestamped --accurate ...
326
- ```
327
-
328
- #### Running Voice Activity Detection (VAD) before sending to Whisper
329
-
330
- Whisper models can "hallucinate" text when given a segment without speech. This can be avoided by running VAD and gluing speech segments together before transcribing with the Whisper model. This is possible with `whisper-timestamped`.
331
- * In Python:
332
- ```python
333
- results = whisper_timestamped.transcribe(model, audio, vad=True, ...)
334
- ```
335
- * On the command line:
336
- ```bash
337
- whisper_timestamped --vad True ...
338
- ```
339
-
340
- By default, the VAD method used is [silero](https://github.com/snakers4/silero-vad).
341
- But other methods are available, such as earlier versions of silero, or [auditok](https://github.com/amsehili/auditok).
342
- Those methods were introduced because latest versions of silero VAD can have a lot of false alarms on some audios (speech detected on silence).
343
- * In Python:
344
- ```python
345
- results = whisper_timestamped.transcribe(model, audio, vad="silero:v3.1", ...)
346
- results = whisper_timestamped.transcribe(model, audio, vad="auditok", ...)
347
- ```
348
- * On the command line:
349
- ```bash
350
- whisper_timestamped --vad silero:v3.1 ...
351
- whisper_timestamped --vad auditok ...
352
- ```
353
-
354
- In order to watch the VAD results, you can use the `--plot` option of the `whisper_timestamped` CLI,
355
- or the `plot_word_alignment` option of the `whisper_timestamped.transcribe()` Python function.
356
- It will show the VAD results on the input audio signal as following (x-axis is time in seconds):
357
- | **vad="silero:v4.0"** | **vad="silero:v3.1"** | **vad="auditok"** |
358
- | :---: | :---: | :---: |
359
- | ![Example VAD](figs/VAD_silero_v4.0.png) | ![Example VAD](figs/VAD_silero_v3.1.png) | ![Example VAD](figs/VAD_auditok.png) |
360
-
361
- #### Detecting disfluencies
362
-
363
- Whisper models tend to remove speech disfluencies (filler words, hesitations, repetitions, etc.). Without precautions, the disfluencies that are not transcribed will affect the timestamp of the following word: the timestamp of the beginning of the word will actually be the timestamp of the beginning of the disfluencies. `whisper-timestamped` can have some heuristics to avoid this.
364
- * In Python:
365
- ```python
366
- results = whisper_timestamped.transcribe(model, audio, detect_disfluencies=True, ...)
367
- ```
368
- * On the command line:
369
- ```bash
370
- whisper_timestamped --detect_disfluencies True ...
371
- ```
372
- **Important:** Note that when using these options, possible disfluencies will appear in the transcription as a special "`[*]`" word.
373
-
374
-
375
- ## Acknowlegment
376
- * [whisper](https://github.com/openai/whisper): Whisper speech recognition (License MIT).
377
- * [dtw-python](https://pypi.org/project/dtw-python): Dynamic Time Warping (License GPL v3).
378
-
379
- ## Citations
380
- If you use this in your research, please cite the repo:
381
-
382
- ```bibtex
383
- @misc{lintoai2023whispertimestamped,
384
- title={whisper-timestamped},
385
- author={Louradour, J{\'e}r{\^o}me},
386
- journal={GitHub repository},
387
- year={2023},
388
- publisher={GitHub},
389
- howpublished = {\url{https://github.com/linto-ai/whisper-timestamped}}
390
- }
391
- ```
392
-
393
- as well as the OpenAI Whisper paper:
394
-
395
- ```bibtex
396
- @article{radford2022robust,
397
- title={Robust speech recognition via large-scale weak supervision},
398
- author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
399
- journal={arXiv preprint arXiv:2212.04356},
400
- year={2022}
401
- }
402
- ```
403
-
404
- and this paper for Dynamic-Time-Warping:
405
-
406
- ```bibtex
407
- @article{JSSv031i07,
408
- title={Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package},
409
- author={Giorgino, Toni},
410
- journal={Journal of Statistical Software},
411
- year={2009},
412
- volume={31},
413
- number={7},
414
- doi={10.18637/jss.v031.i07}
415
- }
416
- ```
 
 
1
 
2
+ ---
3
+ title: "My Fine-tuned Whisper Model"
4
+ tags:
5
+ - whisper
6
+ - fine-tuning
7
+ - ASR
8
+ ---
9
 
10
+ # My Fine-tuned Whisper Model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
+ This model is fine-tuned for [specific use cases].