metadata

sdk: gradio
sdk_version: 5.16.0

Whisper-WebUI

A Gradio-based browser interface for Whisper

Features

Select the Whisper implementation you want to use between:
- openai/whisper
- SYSTRAN/faster-whisper (used by default)
- Vaibhavs10/insanely-fast-whisper
Generate transcriptions from various sources, including files & microphone
Currently supported output formats: csv, srt & txt
Speech to Text Translation:
- From other languages to English (This is Whisper's end-to-end speech-to-text translation feature)
- Translate transcription files using Facebook NLLB models
Pre-processing audio input with Silero VAD
Post-processing with speaker diarization using the pyannote model:
- To download the pyannote model, you need to have a Huggingface token and manually accept their terms in the pages below:
  1. https://huggingface.co/pyannote/speaker-diarization-3.1
  2. https://huggingface.co/pyannote/segmentation-3.0

Installation and Running

Run Locally

Prerequisite

To run this WebUI, you need to have git, python version 3.8 ~ 3.10, FFmpeg
And if you're not using an Nvida GPU, or using a different CUDA version than 12.4, edit the file requirements.txt to match your environment

Please follow the links below to install the necessary software:
- git : [https://git-scm.com/downloads](https://git-scm.com/downloads)
- python : [https://www.python.org/downloads/](https://www.python.org/downloads/)
- FFmpeg :  [https://ffmpeg.org/download.html](https://ffmpeg.org/download.html)
- CUDA : [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads)

After installing FFmpeg, make sure to add the FFmpeg/bin folder to your system PATH

Installation Using the Script Files

1. Download the the repository and extract its contents 
2. Run `install.bat` or `install.sh` to install dependencies (It will create a `venv` directory and install dependencies there)
3. Start WebUI with `start-webui.bat` or `start-webui.sh` (It will run `python app.py` after activating the venv)

Running with Docker
1. Install and launch Docker-Desktop
2. Get the repository
3. Build the image ( Image is about ~7GB)
```
docker compose build 
```
1. Run the container
```
docker compose up
```
1. Connect to the WebUI with your browser at http://localhost:7860
Note: If needed, update the docker-compose.yaml to match your environment

VRAM Usages

This project is integrated with faster-whisper by default for better VRAM usage and transcription speed.
According to faster-whisper, the efficiency of the optimized whisper model is as follows:

Implementation Precision Beam size Time Max. GPU memory Max. CPU memory

openai/whisper fp16 5 4m30s 11325MB 9439MB

faster-whisper fp16 5 54s 4755MB 3244MB

Implementation	Precision	Beam size	Time	Max. GPU memory	Max. CPU memory
openai/whisper	fp16	5	4m30s	11325MB	9439MB
faster-whisper	fp16	5	54s	4755MB	3244MB

Whisper's original VRAM usage table for available models:

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~32x
base	74 M	`base.en`	`base`	~1 GB	~16x
small	244 M	`small.en`	`small`	~2 GB	~6x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x

Note: .en models are for English only, and you can use the Translate to English option from the other models