Update README.md
Browse files
README.md
CHANGED
|
@@ -67,32 +67,35 @@ we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-larg
|
|
| 67 |
teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
|
| 68 |
Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
|
| 69 |
|
| 70 |
-
As
|
| 71 |
(the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
|
| 72 |
-
which amounts
|
| 73 |
those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
|
| 74 |
The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
|
| 75 |
|
| 76 |
-
Kotoba-whisper-
|
| 77 |
from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
|
| 78 |
the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
|
| 79 |
|
| 80 |
- ***CER***
|
| 81 |
|
| 82 |
-
| Model
|
| 83 |
-
|
| 84 |
-
| [**kotoba-tech/kotoba-whisper-
|
| 85 |
-
| [
|
| 86 |
-
| [openai/whisper-
|
| 87 |
-
| [openai/whisper-
|
| 88 |
-
| [openai/whisper-
|
|
|
|
|
|
|
| 89 |
|
| 90 |
- ***WER***
|
| 91 |
|
| 92 |
-
| Model |
|
| 93 |
|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
|
| 94 |
-
| [**kotoba-tech/kotoba-whisper-
|
| 95 |
-
| [
|
|
|
|
| 96 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
|
| 97 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
|
| 98 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
|
|
@@ -103,7 +106,7 @@ it inherits the benefit of the improved latency compared to [openai/whisper-larg
|
|
| 103 |
|
| 104 |
| Model | Params / M | Rel. Latency |
|
| 105 |
|----------------------------------------------------------------------------------------------|------------|--------------|
|
| 106 |
-
| **[kotoba-tech/kotoba-whisper-
|
| 107 |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 |
|
| 108 |
|
| 109 |
|
|
@@ -126,7 +129,7 @@ from transformers import pipeline
|
|
| 126 |
from datasets import load_dataset
|
| 127 |
|
| 128 |
# config
|
| 129 |
-
model_id = "kotoba-tech/kotoba-whisper-
|
| 130 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
| 131 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 132 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|
|
@@ -188,7 +191,7 @@ from transformers import pipeline
|
|
| 188 |
from datasets import load_dataset
|
| 189 |
|
| 190 |
# config
|
| 191 |
-
model_id = "kotoba-tech/kotoba-whisper-
|
| 192 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
| 193 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 194 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|
|
@@ -261,7 +264,7 @@ from evaluate import load
|
|
| 261 |
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
|
| 262 |
|
| 263 |
# model config
|
| 264 |
-
model_id = "kotoba-tech/kotoba-whisper-
|
| 265 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
| 266 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 267 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|
|
|
|
| 67 |
teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
|
| 68 |
Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
|
| 69 |
|
| 70 |
+
As successor of our first model, [kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), we release ***kotoba-whisper-v2.0*** trained on the `all` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
|
| 71 |
(the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
|
| 72 |
+
which amounts 7,203,957 audio clips (5 sec audio with 18 text tokens in average) after
|
| 73 |
those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
|
| 74 |
The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
|
| 75 |
|
| 76 |
+
Kotoba-whisper-v2.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set
|
| 77 |
from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
|
| 78 |
the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
|
| 79 |
|
| 80 |
- ***CER***
|
| 81 |
|
| 82 |
+
| Model | [CommonVoice 8.0](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT basic5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech Test](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
|
| 83 |
+
|:---------------------------------------------------------------------------------------------|-------------------:|-----------------:|--------------------:|
|
| 84 |
+
| [**kotoba-tech/kotoba-whisper-v2.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)| 9.20 | 8.40 | **11.63** |
|
| 85 |
+
| [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 9.44 | 8.48 | **12.60** |
|
| 86 |
+
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **8.52** | **7.18** | 15.18 |
|
| 87 |
+
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 11.34 | 9.87 | 29.56 |
|
| 88 |
+
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 15.26 | 14.22 | 34.29 |
|
| 89 |
+
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 46.86 | 35.69 | 96.69 |
|
| 90 |
+
|
| 91 |
|
| 92 |
- ***WER***
|
| 93 |
|
| 94 |
+
| Model | [CommonVoice 8.0](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT basic5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech Test](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
|
| 95 |
|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
|
| 96 |
+
| [**kotoba-tech/kotoba-whisper-v2.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0) | 58.8 | 63.7 | **55.6** |
|
| 97 |
+
| [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 59.27 | 64.36 | 56.62 |
|
| 98 |
+
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **55.41** | **59.34** | 60.23 |
|
| 99 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
|
| 100 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
|
| 101 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
|
|
|
|
| 106 |
|
| 107 |
| Model | Params / M | Rel. Latency |
|
| 108 |
|----------------------------------------------------------------------------------------------|------------|--------------|
|
| 109 |
+
| **[kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)**| **756** | **6.3** |
|
| 110 |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 |
|
| 111 |
|
| 112 |
|
|
|
|
| 129 |
from datasets import load_dataset
|
| 130 |
|
| 131 |
# config
|
| 132 |
+
model_id = "kotoba-tech/kotoba-whisper-v2.0"
|
| 133 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
| 134 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 135 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|
|
|
|
| 191 |
from datasets import load_dataset
|
| 192 |
|
| 193 |
# config
|
| 194 |
+
model_id = "kotoba-tech/kotoba-whisper-v2.0"
|
| 195 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
| 196 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 197 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|
|
|
|
| 264 |
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
|
| 265 |
|
| 266 |
# model config
|
| 267 |
+
model_id = "kotoba-tech/kotoba-whisper-v2.0"
|
| 268 |
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
| 269 |
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 270 |
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
|