Update README.md

920e145 verified 14 days ago

7.1 kB

	---
	tags:
	- espnet
	- audio
	- automatic-speech-recognition
	- speech-translation
	- language-identification
	language: multilingual
	datasets:
	- owsm_v3.2_ctc
	base_model:
	- espnet/owsm_ctc_v3.2_ft_1B
	license: cc-by-4.0
	---

	[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.

	This model is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).

	This model is initialized with [OWSM-CTC v3.1](https://huggingface.co/pyf98/owsm_ctc_v3.1_1B) and then fine-tuned on [v3.2 data](https://arxiv.org/abs/2406.09282) for 225k steps.

	To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
	```
	librosa
	torch
	espnet
	espnet_model_zoo
	```


	The recipe can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1



	### Example script for batched inference

	`Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).

	```python
	from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

	s2t = Speech2TextGreedySearch.from_pretrained(
	"espnet/owsm_ctc_v3.2_ft_1B",
	device="cuda",
	use_flash_attn=False, # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
	lang_sym='<eng>',
	task_sym='<asr>',
	)

	res = s2t.batch_decode(
	"audio.wav", # a single audio (path or 1-D array/tensor) as input
	batch_size=16,
	context_len_in_secs=4,
	) # res is a single str, i.e., the predicted text without special tokens

	res = s2t.batch_decode(
	["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
	batch_size=16,
	context_len_in_secs=4,
	) # res is a list of str

	# Please check the code of `batch_decode` for all supported inputs
	```

	### Example script for short-form ASR/ST/LID

	Our models are trained on 16kHz audio with a fixed duration of 30s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 30s.

	```python
	import librosa
	from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

	s2t = Speech2TextGreedySearch.from_pretrained(
	"espnet/owsm_ctc_v3.2_ft_1B",
	device="cuda",
	generate_interctc_outputs=False,
	lang_sym='<eng>',
	task_sym='<asr>',
	)

	# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
	speech, rate = librosa.load("xxx.wav", sr=16000)
	speech = librosa.util.fix_length(speech, size=(16000 * 30))

	res = s2t(speech)[0]
	print(res)
	```

	### Example script for long-form ASR/ST

	```python
	import soundfile as sf
	import torch
	from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

	context_len_in_secs = 4 # left and right context when doing buffered inference
	batch_size = 32 # depends on the GPU memory
	s2t = Speech2TextGreedySearch.from_pretrained(
	"espnet/owsm_ctc_v3.2_ft_1B",
	device='cuda' if torch.cuda.is_available() else 'cpu',
	generate_interctc_outputs=False,
	lang_sym='<eng>',
	task_sym='<asr>',
	)

	speech, rate = sf.read(
	"xxx.wav"
	)

	text = s2t.decode_long_batched_buffered(
	speech,
	batch_size=batch_size,
	context_len_in_secs=context_len_in_secs,
	)
	print(text)
	```

	### Example of CTC forced alignment using `ctc-segmentation`

	CTC segmentation can be efficiently applied to audio of an arbitrary length.

	```python
	import soundfile as sf
	from espnet2.bin.s2t_ctc_align import CTCSegmentation
	from espnet_model_zoo.downloader import ModelDownloader

	# Download model first
	d = ModelDownloader()
	downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B")

	aligner = CTCSegmentation(
	**downloaded,
	fs=16000,
	ngpu=1,
	batch_size=32, # batched parallel decoding; reduce it if your GPU memory is smaller
	kaldi_style_text=True,
	time_stamps="auto", # "auto" can be more accurate than "fixed" when converting token index to timestamp
	lang_sym="<eng>",
	task_sym="<asr>",
	context_len_in_secs=2, # left and right context in buffered decoding
	)

	speech, rate = sf.read(
	"./test_utils/ctc_align_test.wav"
	)
	print(f"speech duration: {len(speech) / rate : .2f} seconds")
	text = """
	utt1 THE SALE OF THE HOTELS
	utt2 IS PART OF HOLIDAY'S STRATEGY
	utt3 TO SELL OFF ASSETS
	utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
	"""

	segments = aligner(speech, text)
	print(segments)
	```

	## Citations

	#### OWSM-CTC

	```BibTex
	@inproceedings{owsm-ctc,
	title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
	author = "Peng, Yifan and
	Sudo, Yui and
	Shakeel, Muhammad and
	Watanabe, Shinji",
	booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
	year = "2024",
	month= {8},
	url = "https://aclanthology.org/2024.acl-long.549",
	}
	```

	#### OWSM v3.1 and v3.2

	```BibTex
	@inproceedings{owsm-v32,
	title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
	author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
	booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
	year={2024},
	month={9},
	pdf="https://arxiv.org/pdf/2406.09282"
	}
	@inproceedings{owsm-v31,
	title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
	author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
	booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
	year={2024},
	month={9},
	pdf="https://arxiv.org/pdf/2401.16658",
	}
	```

	#### Initial OWSM (v1, v2, v3)

	```BibTex
	@inproceedings{owsm,
	title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
	author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
	booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
	year={2023},
	month={12},
	pdf="https://arxiv.org/pdf/2309.13876",
	}
	```