Update README.md
Browse files
README.md
CHANGED
@@ -16,16 +16,11 @@ It is trained on 180k hours of public audio data for multilingual speech recogni
|
|
16 |
|
17 |
Due to time constraint, the model used in the paper was trained for 40 "epochs". The new model trained for 45 "epochs" (approximately three entire passes on the full data) is also added in this repo in order to match the setup of encoder-decoder OWSM. It can have better performance than the old one in many test sets.
|
18 |
|
19 |
-
|
20 |
-
- PR in ESPnet: https://github.com/espnet/espnet/pull/5933
|
21 |
-
- Code in my repo: https://github.com/pyf98/espnet/tree/owsm-ctc
|
22 |
-
- Current model on HF: https://huggingface.co/pyf98/owsm_ctc_v3.1_1B
|
23 |
-
|
24 |
-
To use the pre-trained model, you need to install `espnet` and `espnet_model_zoo`. The requirements are:
|
25 |
```
|
26 |
librosa
|
27 |
torch
|
28 |
-
espnet
|
29 |
espnet_model_zoo
|
30 |
```
|
31 |
|
@@ -34,98 +29,4 @@ We use FlashAttention during training, but we do not need it during inference. P
|
|
34 |
pip install flash-attn --no-build-isolation
|
35 |
```
|
36 |
|
37 |
-
|
38 |
-
|
39 |
-
```python
|
40 |
-
import soundfile as sf
|
41 |
-
import numpy as np
|
42 |
-
import librosa
|
43 |
-
import kaldiio
|
44 |
-
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
|
45 |
-
|
46 |
-
|
47 |
-
s2t = Speech2TextGreedySearch.from_pretrained(
|
48 |
-
"pyf98/owsm_ctc_v3.1_1B",
|
49 |
-
device="cuda",
|
50 |
-
generate_interctc_outputs=False,
|
51 |
-
lang_sym='<eng>',
|
52 |
-
task_sym='<asr>',
|
53 |
-
)
|
54 |
-
|
55 |
-
speech, rate = sf.read(
|
56 |
-
"xxx.wav"
|
57 |
-
)
|
58 |
-
speech = librosa.util.fix_length(speech, size=(16000 * 30))
|
59 |
-
|
60 |
-
res = s2t(speech)[0]
|
61 |
-
print(res)
|
62 |
-
```
|
63 |
-
|
64 |
-
### Example script for long-form ASR/ST
|
65 |
-
|
66 |
-
```python
|
67 |
-
import soundfile as sf
|
68 |
-
import torch
|
69 |
-
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
|
70 |
-
|
71 |
-
|
72 |
-
context_len_in_secs = 4 # left and right context when doing buffered inference
|
73 |
-
batch_size = 32 # depends on the GPU memory
|
74 |
-
s2t = Speech2TextGreedySearch.from_pretrained(
|
75 |
-
"pyf98/owsm_ctc_v3.1_1B",
|
76 |
-
device='cuda' if torch.cuda.is_available() else 'cpu',
|
77 |
-
generate_interctc_outputs=False,
|
78 |
-
lang_sym='<eng>',
|
79 |
-
task_sym='<asr>',
|
80 |
-
)
|
81 |
-
|
82 |
-
speech, rate = sf.read(
|
83 |
-
"xxx.wav"
|
84 |
-
)
|
85 |
-
|
86 |
-
text = s2t.decode_long_batched_buffered(
|
87 |
-
speech,
|
88 |
-
batch_size=batch_size,
|
89 |
-
context_len_in_secs=context_len_in_secs,
|
90 |
-
frames_per_sec=12.5, # 80ms shift, model-dependent, don't change
|
91 |
-
)
|
92 |
-
print(text)
|
93 |
-
```
|
94 |
-
|
95 |
-
### Example for CTC forced alignment using `ctc-segmentation`
|
96 |
-
|
97 |
-
It can be efficiently applied to audio of an arbitrary length.
|
98 |
-
For model downloading, please refer to https://github.com/espnet/espnet?tab=readme-ov-file#ctc-segmentation-demo
|
99 |
-
|
100 |
-
```python
|
101 |
-
import soundfile as sf
|
102 |
-
from espnet2.bin.s2t_ctc_align import CTCSegmentation
|
103 |
-
|
104 |
-
|
105 |
-
## Please download model first
|
106 |
-
aligner = CTCSegmentation(
|
107 |
-
s2t_model_file="exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_raw_bpe50000/valid.total_count.ave_5best.till45epoch.pth",
|
108 |
-
fs=16000,
|
109 |
-
ngpu=1,
|
110 |
-
batch_size=16, # batched parallel decoding; reduce it if your GPU memory is smaller
|
111 |
-
kaldi_style_text=True,
|
112 |
-
time_stamps="fixed",
|
113 |
-
samples_to_frames_ratio=1280, # 80ms time shift; don't change as it depends on the pre-trained model
|
114 |
-
lang_sym="<eng>",
|
115 |
-
task_sym="<asr>",
|
116 |
-
context_len_in_secs=2, # left and right context in buffered decoding
|
117 |
-
frames_per_sec=12.5, # 80ms time shift; don't change as it depends on the pre-trained model
|
118 |
-
)
|
119 |
-
|
120 |
-
speech, rate = sf.read(
|
121 |
-
"example.wav"
|
122 |
-
)
|
123 |
-
print(f"speech duration: {len(speech) / rate : .2f} seconds")
|
124 |
-
text = '''
|
125 |
-
utt1 hello there
|
126 |
-
utt2 welcome to this repo
|
127 |
-
'''
|
128 |
-
|
129 |
-
segments = aligner(speech, text)
|
130 |
-
print(segments)
|
131 |
-
```
|
|
|
16 |
|
17 |
Due to time constraint, the model used in the paper was trained for 40 "epochs". The new model trained for 45 "epochs" (approximately three entire passes on the full data) is also added in this repo in order to match the setup of encoder-decoder OWSM. It can have better performance than the old one in many test sets.
|
18 |
|
19 |
+
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
|
|
|
|
|
|
|
|
|
|
|
20 |
```
|
21 |
librosa
|
22 |
torch
|
23 |
+
espnet
|
24 |
espnet_model_zoo
|
25 |
```
|
26 |
|
|
|
29 |
pip install flash-attn --no-build-isolation
|
30 |
```
|
31 |
|
32 |
+
Example usage can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|