File size: 8,155 Bytes
8069744
429df62
8069744
429df62
aaccb5f
429df62
 
 
 
 
 
aaccb5f
8069744
 
429df62
 
 
3dd9442
 
429df62
 
 
964e046
429df62
 
 
 
 
4f38470
 
 
429df62
964e046
3dd9442
 
 
0e5b68e
3dd9442
429df62
3dd9442
 
 
 
 
 
 
 
 
 
 
 
 
 
429df62
 
 
 
 
 
3dd9442
429df62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b85155
 
 
 
 
 
429df62
3dd9442
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
429df62
 
3dd9442
429df62
5b85155
 
429df62
 
bd4e048
429df62
3dd9442
 
429df62
 
64f7d68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
429df62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
language: ja
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
widget:
- example_title: Sample 1
  src: https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
---

# Kotoba-Whisper-v2.2
_Kotoba-Whisper-v2.2_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0), with 
additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes 
(i) speaker diarization with [diarizers](https://huggingface.co/diarizers-community/speaker-segmentation-fine-tuned-callhome-jpn)
and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main). 
The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)

## Transformers Usage
Kotoba-Whisper-v2.2 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first 
install the latest version of Transformers.

```bash
pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install "punctuators==0.0.5"
pip install "pyannote.audio"
pip install git+https://github.com/huggingface/diarizers.git
```

To load pre-trained diarization models from the Hub, you'll first need to accept the terms-of-use for the following two models:
1. [pyannote/segmentation-3.0](https://hf.co/pyannote/segmentation-3.0)
2. [pyannote/speaker-diarization-3.1](https://hf.co/pyannote/speaker-diarization-3.1)

And subsequently use a Hugging Face authentication token to log in with: 

```
huggingface-cli login
```


### Transcription with Diarization
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline).

- Download an audio sample.
```shell
wget https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
```

- Run the model via pipeline.

```python
import torch
from transformers import pipeline

# config
model_id = "kotoba-tech/kotoba-whisper-v2.2"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}

# load model
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    chunk_length_s=15,
    batch_size=16,
    trust_remote_code=True,
)

# run inference
result = pipe(
     "sample_diarization_japanese.mp3",
     add_punctuation=False,
     return_unique_speaker=True,
     generate_kwargs=generate_kwargs
)
print(result)
>>>
{'chunks': [{'speaker': ['SPEAKER_02'],
             'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
             'timestamp': (0.0, 5.0)},
            {'speaker': ['SPEAKER_02'],
             'text': '今は屋外の気温',
             'timestamp': (5.0, 7.6)},
            {'speaker': ['SPEAKER_02'],
             'text': '昼も夜も上がってますので空気の入れ替えだけでは',
             'timestamp': (7.6, 11.72)},
            {'speaker': ['SPEAKER_02'],
             'text': 'かえって人が上がってきます',
             'timestamp': (11.72, 13.54)},
            {'speaker': ['SPEAKER_02'],
             'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
             'timestamp': (13.54, 17.24)},
            {'speaker': ['SPEAKER_00'],
             'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
             'timestamp': (17.24, 23.84)}],
 'chunks/SPEAKER_00': [{'speaker': ['SPEAKER_00'],
                        'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
                        'timestamp': (17.24, 23.84)}],
 'chunks/SPEAKER_02': [{'speaker': ['SPEAKER_02'],
                        'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
                        'timestamp': (0.0, 5.0)},
                       {'speaker': ['SPEAKER_02'],
                        'text': '今は屋外の気温',
                        'timestamp': (5.0, 7.6)},
                       {'speaker': ['SPEAKER_02'],
                        'text': '昼も夜も上がってますので空気の入れ替えだけでは',
                        'timestamp': (7.6, 11.72)},
                       {'speaker': ['SPEAKER_02'],
                        'text': 'かえって人が上がってきます',
                        'timestamp': (11.72, 13.54)},
                       {'speaker': ['SPEAKER_02'],
                        'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
                        'timestamp': (13.54, 17.24)}],
 'speakers': ['SPEAKER_00', 'SPEAKER_02'],
 'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていうそういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
 'text/SPEAKER_00': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
 'text/SPEAKER_02': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていう'}
```

- To activate punctuator:
```diff
-     add_punctuation=True,
+     add_punctuation=False,
```

- To include more than a single speaker:
```diff
-     return_unique_speaker=True
+     return_unique_speaker=False
```

- To contorol the number of speakers (see [here](https://huggingface.co/pyannote/speaker-diarization-3.1#controlling-the-number-of-speakers)):
```diff
result = pipe(
     "sample_diarization_japanese.mp3",
+    num_speakers=2,
     add_punctuation=False,
     return_unique_speaker=True,
     generate_kwargs=generate_kwargs
)
```
or
```diff
result = pipe(
     "sample_diarization_japanese.mp3",
+    min_speakers=2,
+    max_speakers=5,
     add_punctuation=False,
     return_unique_speaker=True,
     generate_kwargs=generate_kwargs
)
```

### Flash Attention 2
We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2) 
if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):

```
pip install flash-attn --no-build-isolation
```

Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:

```diff
- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}
```


## Acknowledgements
* [OpenAI](https://openai.com/) for the Whisper [model](https://huggingface.co/openai/whisper-large-v3).
* Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration.
* Hugging Face 🤗 for the [Distil-Whisper codebase](https://github.com/huggingface/distil-whisper).
* [Reazon Human Interaction Lab](https://research.reazon.jp/) for the [ReazonSpeech dataset](https://huggingface.co/datasets/reazon-research/reazonspeech).