Automatic Speech Recognition
ESPnet
multilingual
audio
speech-translation
language-identification
pyf98 commited on
Commit
9318a3b
·
verified ·
1 Parent(s): 90c1868

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -102
README.md CHANGED
@@ -16,16 +16,11 @@ It is trained on 180k hours of public audio data for multilingual speech recogni
16
 
17
  Due to time constraint, the model used in the paper was trained for 40 "epochs". The new model trained for 45 "epochs" (approximately three entire passes on the full data) is also added in this repo in order to match the setup of encoder-decoder OWSM. It can have better performance than the old one in many test sets.
18
 
19
- Currently, the code for OWSM-CTC has not been merged into ESPnet main branch. Instead, it is available as follows:
20
- - PR in ESPnet: https://github.com/espnet/espnet/pull/5933
21
- - Code in my repo: https://github.com/pyf98/espnet/tree/owsm-ctc
22
- - Current model on HF: https://huggingface.co/pyf98/owsm_ctc_v3.1_1B
23
-
24
- To use the pre-trained model, you need to install `espnet` and `espnet_model_zoo`. The requirements are:
25
  ```
26
  librosa
27
  torch
28
- espnet @ git+https://github.com/pyf98/espnet@owsm-ctc
29
  espnet_model_zoo
30
  ```
31
 
@@ -34,98 +29,4 @@ We use FlashAttention during training, but we do not need it during inference. P
34
  pip install flash-attn --no-build-isolation
35
  ```
36
 
37
- ### Example script for short-form ASR/ST
38
-
39
- ```python
40
- import soundfile as sf
41
- import numpy as np
42
- import librosa
43
- import kaldiio
44
- from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
45
-
46
-
47
- s2t = Speech2TextGreedySearch.from_pretrained(
48
- "pyf98/owsm_ctc_v3.1_1B",
49
- device="cuda",
50
- generate_interctc_outputs=False,
51
- lang_sym='<eng>',
52
- task_sym='<asr>',
53
- )
54
-
55
- speech, rate = sf.read(
56
- "xxx.wav"
57
- )
58
- speech = librosa.util.fix_length(speech, size=(16000 * 30))
59
-
60
- res = s2t(speech)[0]
61
- print(res)
62
- ```
63
-
64
- ### Example script for long-form ASR/ST
65
-
66
- ```python
67
- import soundfile as sf
68
- import torch
69
- from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
70
-
71
-
72
- context_len_in_secs = 4 # left and right context when doing buffered inference
73
- batch_size = 32 # depends on the GPU memory
74
- s2t = Speech2TextGreedySearch.from_pretrained(
75
- "pyf98/owsm_ctc_v3.1_1B",
76
- device='cuda' if torch.cuda.is_available() else 'cpu',
77
- generate_interctc_outputs=False,
78
- lang_sym='<eng>',
79
- task_sym='<asr>',
80
- )
81
-
82
- speech, rate = sf.read(
83
- "xxx.wav"
84
- )
85
-
86
- text = s2t.decode_long_batched_buffered(
87
- speech,
88
- batch_size=batch_size,
89
- context_len_in_secs=context_len_in_secs,
90
- frames_per_sec=12.5, # 80ms shift, model-dependent, don't change
91
- )
92
- print(text)
93
- ```
94
-
95
- ### Example for CTC forced alignment using `ctc-segmentation`
96
-
97
- It can be efficiently applied to audio of an arbitrary length.
98
- For model downloading, please refer to https://github.com/espnet/espnet?tab=readme-ov-file#ctc-segmentation-demo
99
-
100
- ```python
101
- import soundfile as sf
102
- from espnet2.bin.s2t_ctc_align import CTCSegmentation
103
-
104
-
105
- ## Please download model first
106
- aligner = CTCSegmentation(
107
- s2t_model_file="exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_raw_bpe50000/valid.total_count.ave_5best.till45epoch.pth",
108
- fs=16000,
109
- ngpu=1,
110
- batch_size=16, # batched parallel decoding; reduce it if your GPU memory is smaller
111
- kaldi_style_text=True,
112
- time_stamps="fixed",
113
- samples_to_frames_ratio=1280, # 80ms time shift; don't change as it depends on the pre-trained model
114
- lang_sym="<eng>",
115
- task_sym="<asr>",
116
- context_len_in_secs=2, # left and right context in buffered decoding
117
- frames_per_sec=12.5, # 80ms time shift; don't change as it depends on the pre-trained model
118
- )
119
-
120
- speech, rate = sf.read(
121
- "example.wav"
122
- )
123
- print(f"speech duration: {len(speech) / rate : .2f} seconds")
124
- text = '''
125
- utt1 hello there
126
- utt2 welcome to this repo
127
- '''
128
-
129
- segments = aligner(speech, text)
130
- print(segments)
131
- ```
 
16
 
17
  Due to time constraint, the model used in the paper was trained for 40 "epochs". The new model trained for 45 "epochs" (approximately three entire passes on the full data) is also added in this repo in order to match the setup of encoder-decoder OWSM. It can have better performance than the old one in many test sets.
18
 
19
+ To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
 
 
 
 
 
20
  ```
21
  librosa
22
  torch
23
+ espnet
24
  espnet_model_zoo
25
  ```
26
 
 
29
  pip install flash-attn --no-build-isolation
30
  ```
31
 
32
+ Example usage can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1