Automatic Speech Recognition
ESPnet
multilingual
audio
speech-translation
language-identification
pyf98 commited on
Commit
e1f251b
·
verified ·
1 Parent(s): b794c0c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -2
README.md CHANGED
@@ -9,7 +9,7 @@ language: multilingual
9
  datasets:
10
  - owsm_v3.2_ctc
11
  base_model:
12
- - espnet/owsm_ctc_v3.1_1B
13
  license: cc-by-4.0
14
  ---
15
 
@@ -27,4 +27,129 @@ espnet_model_zoo
27
  ```
28
 
29
 
30
- **Example usage can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  datasets:
10
  - owsm_v3.2_ctc
11
  base_model:
12
+ - espnet/owsm_ctc_v3.2_ft_1B
13
  license: cc-by-4.0
14
  ---
15
 
 
27
  ```
28
 
29
 
30
+ **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
31
+
32
+
33
+
34
+ ### Example script for batched inference
35
+
36
+ `Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
37
+
38
+ ```python
39
+ from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
40
+
41
+ s2t = Speech2TextGreedySearch.from_pretrained(
42
+ "espnet/owsm_ctc_v3.1_1B",
43
+ device="cuda",
44
+ use_flash_attn=False, # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
45
+ lang_sym='<eng>',
46
+ task_sym='<asr>',
47
+ )
48
+
49
+ res = s2t.batch_decode(
50
+ "audio.wav", # a single audio (path or 1-D array/tensor) as input
51
+ batch_size=16,
52
+ context_len_in_secs=4,
53
+ ) # res is a single str, i.e., the predicted text without special tokens
54
+
55
+ res = s2t.batch_decode(
56
+ ["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
57
+ batch_size=16,
58
+ context_len_in_secs=4,
59
+ ) # res is a list of str
60
+
61
+ # Please check the code of `batch_decode` for all supported inputs
62
+ ```
63
+
64
+ ### Example script for short-form ASR/ST/LID
65
+
66
+ Our models are trained on 16kHz audio with a fixed duration of 30s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 30s.
67
+
68
+ ```python
69
+ import librosa
70
+ from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
71
+
72
+ s2t = Speech2TextGreedySearch.from_pretrained(
73
+ "espnet/owsm_ctc_v3.2_ft_1B",
74
+ device="cuda",
75
+ generate_interctc_outputs=False,
76
+ lang_sym='<eng>',
77
+ task_sym='<asr>',
78
+ )
79
+
80
+ # NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
81
+ speech, rate = librosa.load("xxx.wav", sr=16000)
82
+ speech = librosa.util.fix_length(speech, size=(16000 * 30))
83
+
84
+ res = s2t(speech)[0]
85
+ print(res)
86
+ ```
87
+
88
+ ### Example script for long-form ASR/ST
89
+
90
+ ```python
91
+ import soundfile as sf
92
+ import torch
93
+ from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
94
+
95
+ context_len_in_secs = 4 # left and right context when doing buffered inference
96
+ batch_size = 32 # depends on the GPU memory
97
+ s2t = Speech2TextGreedySearch.from_pretrained(
98
+ "espnet/owsm_ctc_v3.2_ft_1B",
99
+ device='cuda' if torch.cuda.is_available() else 'cpu',
100
+ generate_interctc_outputs=False,
101
+ lang_sym='<eng>',
102
+ task_sym='<asr>',
103
+ )
104
+
105
+ speech, rate = sf.read(
106
+ "xxx.wav"
107
+ )
108
+
109
+ text = s2t.decode_long_batched_buffered(
110
+ speech,
111
+ batch_size=batch_size,
112
+ context_len_in_secs=context_len_in_secs,
113
+ )
114
+ print(text)
115
+ ```
116
+
117
+ ### Example of CTC forced alignment using `ctc-segmentation`
118
+
119
+ CTC segmentation can be efficiently applied to audio of an arbitrary length.
120
+
121
+ ```python
122
+ import soundfile as sf
123
+ from espnet2.bin.s2t_ctc_align import CTCSegmentation
124
+ from espnet_model_zoo.downloader import ModelDownloader
125
+
126
+ # Download model first
127
+ d = ModelDownloader()
128
+ downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B") # "espnet/owsm_ctc_v3.2_ft_1B"
129
+
130
+ aligner = CTCSegmentation(
131
+ **downloaded,
132
+ fs=16000,
133
+ ngpu=1,
134
+ batch_size=32, # batched parallel decoding; reduce it if your GPU memory is smaller
135
+ kaldi_style_text=True,
136
+ time_stamps="auto", # "auto" can be more accurate than "fixed" when converting token index to timestamp
137
+ lang_sym="<eng>",
138
+ task_sym="<asr>",
139
+ context_len_in_secs=2, # left and right context in buffered decoding
140
+ )
141
+
142
+ speech, rate = sf.read(
143
+ "./test_utils/ctc_align_test.wav"
144
+ )
145
+ print(f"speech duration: {len(speech) / rate : .2f} seconds")
146
+ text = """
147
+ utt1 THE SALE OF THE HOTELS
148
+ utt2 IS PART OF HOLIDAY'S STRATEGY
149
+ utt3 TO SELL OFF ASSETS
150
+ utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
151
+ """
152
+
153
+ segments = aligner(speech, text)
154
+ print(segments)
155
+ ```