metadata

sdk: gradio
sdk_version: 5.36.2

Whisper-WebUI

A Gradio-based browser interface for Whisper

Features

Select the Whisper implementation you want to use between:
- openai/whisper
- SYSTRAN/faster-whisper (used by default)
- Vaibhavs10/insanely-fast-whisper
Generate transcriptions from various sources, including files & microphone
Currently supported output formats: csv, srt & txt
Speech to Text Translation:
- From other languages to English (This is Whisper's end-to-end speech-to-text translation feature)
- Translate transcription files using Facebook NLLB models
Pre-processing audio input with Silero VAD
Post-processing with speaker diarization using the pyannote model:
- To download the pyannote model, you need to have a Huggingface token and manually accept their terms in the pages below:
  1. https://huggingface.co/pyannote/speaker-diarization-3.1
  2. https://huggingface.co/pyannote/segmentation-3.0

Implementation	Precision	Beam size	Time	Max. GPU memory	Max. CPU memory
openai/whisper	fp16	5	4m30s	11325MB	9439MB
faster-whisper	fp16	5	54s	4755MB	3244MB

Whisper's original VRAM usage table for available models:

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~32x
base	74 M	`base.en`	`base`	~1 GB	~16x
small	244 M	`small.en`	`small`	~2 GB	~6x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x

Note: .en models are for English only, and you can use the Translate to English option from the other models