File size: 8,155 Bytes

---
language: ja
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
widget:
- example_title: Sample 1
  src: https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
---

# Kotoba-Whisper-v2.2
_Kotoba-Whisper-v2.2_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0), with 
additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes 
(i) speaker diarization with [diarizers](https://huggingface.co/diarizers-community/speaker-segmentation-fine-tuned-callhome-jpn)
and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main). 
The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)

## Transformers Usage
Kotoba-Whisper-v2.2 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first 
install the latest version of Transformers.

```bash
pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install "punctuators==0.0.5"
pip install "pyannote.audio"
pip install git+https://github.com/huggingface/diarizers.git
```

To load pre-trained diarization models from the Hub, you'll first need to accept the terms-of-use for the following two models:
1. [pyannote/segmentation-3.0](https://hf.co/pyannote/segmentation-3.0)
2. [pyannote/speaker-diarization-3.1](https://hf.co/pyannote/speaker-diarization-3.1)

And subsequently use a Hugging Face authentication token to log in with: 

```
huggingface-cli login
```


### Transcription with Diarization
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline).

- Download an audio sample.
```shell
wget https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
```

- Run the model via pipeline.

```python
import torch
from transformers import pipeline

# config
model_id = "kotoba-tech/kotoba-whisper-v2.2"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True,
)

# run inference
result = pipe(
     "sample_diarization_japanese.mp3",
     add_punctuation=False,
     return_unique_speaker=True,
     generate_kwargs=generate_kwargs
)
print(result)
>>>
{'chunks': [{'speaker': ['SPEAKER_02'],
             'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
             'timestamp': (0.0, 5.0)},
            {'speaker': ['SPEAKER_02'],
             'text': '今は屋外の気温',
             'timestamp': (5.0, 7.6)},
            {'speaker': ['SPEAKER_02'],
             'text': '昼も夜も上がってますので空気の入れ替えだけでは',
             'timestamp': (7.6, 11.72)},
            {'speaker': ['SPEAKER_02'],
             'text': 'かえって人が上がってきます',
             'timestamp': (11.72, 13.54)},
            {'speaker': ['SPEAKER_02'],
             'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
             'timestamp': (13.54, 17.24)},
            {'speaker': ['SPEAKER_00'],
             'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
             'timestamp': (17.24, 23.84)}],
 'chunks/SPEAKER_00': [{'speaker': ['SPEAKER_00'],
                        'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
                        'timestamp': (17.24, 23.84)}],
 'chunks/SPEAKER_02': [{'speaker': ['SPEAKER_02'],
                        'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
                        'timestamp': (0.0, 5.0)},
                       {'speaker': ['SPEAKER_02'],
                        'text': '今は屋外の気温',
                        'timestamp': (5.0, 7.6)},
                       {'speaker': ['SPEAKER_02'],
                        'text': '昼も夜も上がってますので空気の入れ替えだけでは',
                        'timestamp': (7.6, 11.72)},
                       {'speaker': ['SPEAKER_02'],
                        'text': 'かえって人が上がってきます',
                        'timestamp': (11.72, 13.54)},
                       {'speaker': ['SPEAKER_02'],
                        'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
                        'timestamp': (13.54, 17.24)}],
 'speakers': ['SPEAKER_00', 'SPEAKER_02'],
 'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていうそういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
 'text/SPEAKER_00': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
 'text/SPEAKER_02': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていう'}
```

- To activate punctuator:
```diff
-     add_punctuation=True,
+     add_punctuation=False,
```

- To include more than a single speaker:
```diff
-     return_unique_speaker=True
+     return_unique_speaker=False
```

- To contorol the number of speakers (see [here](https://huggingface.co/pyannote/speaker-diarization-3.1#controlling-the-number-of-speakers)):
```diff
result = pipe(
     "sample_diarization_japanese.mp3",
+    num_speakers=2,
     add_punctuation=False,
     return_unique_speaker=True,
     generate_kwargs=generate_kwargs
)
```
or
```diff
result = pipe(
     "sample_diarization_japanese.mp3",
+    min_speakers=2,
+    max_speakers=5,
     add_punctuation=False,
     return_unique_speaker=True,
     generate_kwargs=generate_kwargs
)
```

### Flash Attention 2
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) 
if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):

```
pip install flash-attn --no-build-isolation
```

Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:

```diff
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
```


## Acknowledgements
* [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
* Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration.
* Hugging Face 🤗 for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper).
* [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech).