File size: 8,155 Bytes
8069744 429df62 8069744 429df62 aaccb5f 429df62 aaccb5f 8069744 429df62 3dd9442 429df62 964e046 429df62 4f38470 429df62 964e046 3dd9442 0e5b68e 3dd9442 429df62 3dd9442 429df62 3dd9442 429df62 5b85155 429df62 3dd9442 429df62 3dd9442 429df62 5b85155 429df62 bd4e048 429df62 3dd9442 429df62 64f7d68 429df62 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
---
language: ja
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
widget:
- example_title: Sample 1
src: https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
---
# Kotoba-Whisper-v2.2
_Kotoba-Whisper-v2.2_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0), with
additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes
(i) speaker diarization with [diarizers](https://huggingface.co/diarizers-community/speaker-segmentation-fine-tuned-callhome-jpn)
and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main).
The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)
## Transformers Usage
Kotoba-Whisper-v2.2 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first
install the latest version of Transformers.
```bash
pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install "punctuators==0.0.5"
pip install "pyannote.audio"
pip install git+https://github.com/huggingface/diarizers.git
```
To load pre-trained diarization models from the Hub, you'll first need to accept the terms-of-use for the following two models:
1. [pyannote/segmentation-3.0](https://hf.co/pyannote/segmentation-3.0)
2. [pyannote/speaker-diarization-3.1](https://hf.co/pyannote/speaker-diarization-3.1)
And subsequently use a Hugging Face authentication token to log in with:
```
huggingface-cli login
```
### Transcription with Diarization
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline).
- Download an audio sample.
```shell
wget https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
```
- Run the model via pipeline.
```python
import torch
from transformers import pipeline
# config
model_id = "kotoba-tech/kotoba-whisper-v2.2"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}
# load model
pipe = pipeline(
model=model_id,
torch_dtype=torch_dtype,
device=device,
model_kwargs=model_kwargs,
chunk_length_s=15,
batch_size=16,
trust_remote_code=True,
)
# run inference
result = pipe(
"sample_diarization_japanese.mp3",
add_punctuation=False,
return_unique_speaker=True,
generate_kwargs=generate_kwargs
)
print(result)
>>>
{'chunks': [{'speaker': ['SPEAKER_02'],
'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
'timestamp': (0.0, 5.0)},
{'speaker': ['SPEAKER_02'],
'text': '今は屋外の気温',
'timestamp': (5.0, 7.6)},
{'speaker': ['SPEAKER_02'],
'text': '昼も夜も上がってますので空気の入れ替えだけでは',
'timestamp': (7.6, 11.72)},
{'speaker': ['SPEAKER_02'],
'text': 'かえって人が上がってきます',
'timestamp': (11.72, 13.54)},
{'speaker': ['SPEAKER_02'],
'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
'timestamp': (13.54, 17.24)},
{'speaker': ['SPEAKER_00'],
'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
'timestamp': (17.24, 23.84)}],
'chunks/SPEAKER_00': [{'speaker': ['SPEAKER_00'],
'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
'timestamp': (17.24, 23.84)}],
'chunks/SPEAKER_02': [{'speaker': ['SPEAKER_02'],
'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
'timestamp': (0.0, 5.0)},
{'speaker': ['SPEAKER_02'],
'text': '今は屋外の気温',
'timestamp': (5.0, 7.6)},
{'speaker': ['SPEAKER_02'],
'text': '昼も夜も上がってますので空気の入れ替えだけでは',
'timestamp': (7.6, 11.72)},
{'speaker': ['SPEAKER_02'],
'text': 'かえって人が上がってきます',
'timestamp': (11.72, 13.54)},
{'speaker': ['SPEAKER_02'],
'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
'timestamp': (13.54, 17.24)}],
'speakers': ['SPEAKER_00', 'SPEAKER_02'],
'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていうそういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
'text/SPEAKER_00': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
'text/SPEAKER_02': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていう'}
```
- To activate punctuator:
```diff
- add_punctuation=True,
+ add_punctuation=False,
```
- To include more than a single speaker:
```diff
- return_unique_speaker=True
+ return_unique_speaker=False
```
- To contorol the number of speakers (see [here](https://huggingface.co/pyannote/speaker-diarization-3.1#controlling-the-number-of-speakers)):
```diff
result = pipe(
"sample_diarization_japanese.mp3",
+ num_speakers=2,
add_punctuation=False,
return_unique_speaker=True,
generate_kwargs=generate_kwargs
)
```
or
```diff
result = pipe(
"sample_diarization_japanese.mp3",
+ min_speakers=2,
+ max_speakers=5,
add_punctuation=False,
return_unique_speaker=True,
generate_kwargs=generate_kwargs
)
```
### Flash Attention 2
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
```
pip install flash-attn --no-build-isolation
```
Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
```diff
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
```
## Acknowledgements
* [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
* Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration.
* Hugging Face 🤗 for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper).
* [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech). |