Spaces:

gradio
/

HuBERT

Runtime error

App Files Files Community

HuBERT / examples /speech_to_text /docs /librispeech_example.md

aliabd

full working demo

d5175d3 about 4 years ago

preview code

raw

history blame

3.32 kB

	[[Back]](..)

	# S2T Example: Speech Recognition (ASR) on LibriSpeech
	[LibriSpeech](https://www.danielpovey.com/files/2015_icassp_librispeech.pdf) is a de-facto standard English ASR
	benchmark. We provide competitive
	vanilla [Transformer](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) baselines.

	## Data preparation
	Download and preprocess LibriSpeech data with
	```bash
	# additional Python packages for S2T data processing/model training
	pip install pandas torchaudio sentencepiece

	python examples/speech_to_text/prep_librispeech_data.py \
	--output-root ${LS_ROOT} --vocab-type unigram --vocab-size 10000
	```
	where `LS_ROOT` is the root path for downloaded data as well as generated files (manifest, features, vocabulary and
	data configuration).

	[Download](https://dl.fbaipublicfiles.com/fairseq/s2t/librispeech_vocab_unigram10000.zip) our vocabulary files
	if you want to use our pre-trained models.

	## Training
	```bash
	fairseq-train ${LS_ROOT} --save-dir ${SAVE_DIR} \
	--config-yaml config.yaml --train-subset train-clean-100,train-clean-360,train-other-500 --valid-subset dev-clean,dev-other \
	--num-workers 4 --max-tokens 40000 --max-update 300000 \
	--task speech_to_text --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --report-accuracy \
	--arch s2t_transformer_s --share-decoder-input-output-embed \
	--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 10000 \
	--clip-norm 10.0 --seed 1 --update-freq 8
	```
	where `SAVE_DIR` is the checkpoint root path. Here we use `--arch s2t_transformer_s` (31M parameters) as example.
	For better performance, you may switch to `s2t_transformer_m` (71M, with `--lr 1e-3`) or `s2t_transformer_l`
	(268M, with `--lr 5e-4`). We set `--update-freq 8` to simulate 8 GPUs with 1 GPU. You may want to update it accordingly
	when using more than 1 GPU.

	## Inference & Evaluation
	Average the last 10 checkpoints and evaluate on the 4 splits
	(`dev-clean`, `dev-other`, `test-clean` and `test-other`):
	```bash
	CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
	python scripts/average_checkpoints.py --inputs ${SAVE_DIR} \
	--num-epoch-checkpoints 10 \
	--output "${SAVE_DIR}/${CHECKPOINT_FILENAME}"
	for SUBSET in dev-clean dev-other test-clean test-other; do
	fairseq-generate ${LS_ROOT} --config-yaml config.yaml --gen-subset ${SUBSET} \
	--task speech_to_text --path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \
	--max-tokens 50000 --beam 5 --scoring wer
	done
	```

	## Interactive Decoding
	Launch the interactive console via
	```bash
	fairseq-interactive ${LS_ROOT} --config-yaml config.yaml --task speech_to_text \
	--path ${SAVE_DIR}/${CHECKPOINT_FILENAME} --max-tokens 50000 --beam 5
	```
	Type in WAV/FLAC/OGG audio paths (one per line) after the prompt.

	## Results

	\| --arch \| Params \| dev-clean \| dev-other \| test-clean \| test-other \| Model \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| s2t_transformer_s \| 30M \| 3.8 \| 8.9 \| 4.4 \| 9.0 \| [Download](https://dl.fbaipublicfiles.com/fairseq/s2t/librispeech_transformer_s.pt) \|
	\| s2t_transformer_m \| 71M \| 3.2 \| 8.0 \| 3.4 \| 7.9 \| [Download](https://dl.fbaipublicfiles.com/fairseq/s2t/librispeech_transformer_m.pt) \|
	\| s2t_transformer_l \| 268M \| 3.0 \| 7.5 \| 3.2 \| 7.5 \| [Download](https://dl.fbaipublicfiles.com/fairseq/s2t/librispeech_transformer_l.pt) \|

	[[Back]](..)