language: ja
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
widget:
- example_title: Sample 1
src: >-
https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
Kotoba-Whisper-v2.2
Kotoba-Whisper-v2.2 is a Japanese ASR model based on kotoba-tech/kotoba-whisper-v2.0, with
additional postprocessing stacks integrated as pipeline
. The new features includes
(i) improved timestamp achieved by stable-ts and (ii) adding punctuation with punctuators.
These libraries are merged into Kotoba-Whisper-v2.1 via pipeline and will be applied seamlessly to the predicted transcription from kotoba-tech/kotoba-whisper-v2.0.
The pipeline has been developed through the collaboration between Asahi Ushio and Kotoba Technologies
Transformers Usage
Kotoba-Whisper-v2.2 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first install the latest version of Transformers.
pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install "punctuators==0.0.5"
pip install "pyannote.audio"
pip install git+https://github.com/huggingface/diarizers.git
Also,
Transcription
The model can be used with the pipeline
class to transcribe audio files as follows:
import torch
from transformers import pipeline
from datasets import load_dataset
# config
model_id = "kotoba-tech/kotoba-whisper-v2.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}
# load model
pipe = pipeline(
model=model_id,
torch_dtype=torch_dtype,
device=device,
model_kwargs=model_kwargs,
chunk_length_s=15,
batch_size=16,
trust_remote_code=True,
stable_ts=True,
punctuator=True
)
# load sample audio
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]
# run inference
result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)
- To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)
- To deactivate stable-ts:
- stable_ts=True,
+ stable_ts=False,
- To deactivate punctuator:
- punctuator=True,
+ punctuator=False,
Flash Attention 2
We recommend using Flash-Attention 2 if your GPU allows for it. To do so, you first need to install Flash Attention:
pip install flash-attn --no-build-isolation
Then pass attn_implementation="flash_attention_2"
to from_pretrained
:
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
Acknowledgements
- OpenAI for the Whisper model.
- Hugging Face 🤗 Transformers for the model integration.
- Hugging Face 🤗 for the Distil-Whisper codebase.
- Reazon Human Interaction Lab for the ReazonSpeech dataset.