MahmoudAshraf commited on
Commit
3bd2831
·
verified ·
1 Parent(s): e88adb8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -1
README.md CHANGED
@@ -162,4 +162,76 @@ license: cc-by-nc-4.0
162
  tags:
163
  - mms
164
  - wav2vec2
165
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
  tags:
163
  - mms
164
  - wav2vec2
165
+ ---
166
+
167
+ # Forced Alignment with Hugging Face CTC Models
168
+ This Python package provides an efficient way to perform forced alignment between text and audio using Hugging Face's pretrained models. it also features an improved implementation to use much less memory than TorchAudio forced alignment API.
169
+
170
+ The model checkpoint uploaded here is a conversion from torchaudio to HF Transformers for the MMS-300M checkpoint trained on forced alignment dataset
171
+
172
+ ## Installation
173
+
174
+ ```bash
175
+ pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
176
+ ```
177
+ ## Usage
178
+
179
+ ```python
180
+ from ctc_forced_aligner import (
181
+ load_audio,
182
+ load_alignment_model,
183
+ generate_emissions,
184
+ preprocess_text,
185
+ get_alignments,
186
+ get_spans,
187
+ postprocess_results,
188
+ )
189
+
190
+ audio_path = "your/audio/path"
191
+ text_path = "your/text/path"
192
+
193
+ audio_waveform = load_audio(audio_path, model.dtype, model.device)
194
+ emissions, stride = generate_emissions(
195
+ model, audio_waveform, args.window_size, args.context_size, args.batch_size
196
+ )
197
+
198
+ with open(text_path, "r") as f:
199
+ lines = f.readlines()
200
+ text = "".join(line for line in lines).replace("\n", " ").strip()
201
+
202
+ alignment_model, alignment_tokenizer, alignment_dictionary = load_alignment_model(
203
+ device,
204
+ dtype=torch.float16 if device == "cuda" else torch.float32,
205
+ model_path="MahmoudAshraf/mms-300m-1130-forced-aligner"
206
+ )
207
+ # also compatible with other Wav2Vec2 Checkpoints such as
208
+ # "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"
209
+
210
+
211
+ emissions, stride = generate_emissions(
212
+ alignment_model, audio_waveform, batch_size=batch_size
213
+ )
214
+
215
+
216
+ # romanization should be enabled when using multilingual models
217
+ # it should be changed to `False` when using models that support the
218
+ # native vocabulary of the text
219
+
220
+ tokens_starred, text_starred = preprocess_text(
221
+ text,
222
+ romanize=True,
223
+ language=langs_to_iso[language],
224
+ )
225
+
226
+
227
+ segments, blank_id = get_alignments(
228
+ emissions,
229
+ tokens_starred,
230
+ alignment_dictionary,
231
+ )
232
+
233
+ spans = get_spans(tokens_starred, segments, alignment_tokenizer.decode(blank_id))
234
+
235
+ word_timestamps = postprocess_results(text_starred, spans, stride)
236
+
237
+ ```