File size: 4,501 Bytes
7d14cc9 712e41d 7d14cc9 710db5f b5a1a25 710db5f c435fe8 b5a1a25 710db5f b5a1a25 710db5f b5a1a25 cfecb80 b5a1a25 c202b1e b5a1a25 62c3596 b5a1a25 6799b88 b5a1a25 c202b1e b5a1a25 119a0fb b5a1a25 119a0fb b5a1a25 710db5f f60c46f 710db5f f60c46f 710db5f 119a0fb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
sdk: gradio
sdk_version: 5.16.0
---
# Whisper-WebUI
A Gradio-based browser interface for [Whisper](https://github.com/openai/whisper)
# Features
- Select the Whisper implementation you want to use between:
- [openai/whisper](https://github.com/openai/whisper)
- [SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper) (used by default)
- [Vaibhavs10/insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper)
- Generate transcriptions from various sources, including **files** & **microphone**
- Currently supported output formats: **csv**, **srt** & **txt**
- Speech to Text Translation:
- From other languages to English (This is Whisper's end-to-end speech-to-text translation feature)
- Translate transcription files using Facebook NLLB models
- Pre-processing audio input with [Silero VAD](https://github.com/snakers4/silero-vad)
- Post-processing with speaker diarization using the [pyannote](https://huggingface.co/pyannote/speaker-diarization-3.1) model:
- To download the pyannote model, you need to have a Huggingface token and manually accept their terms in the pages below:
1. https://huggingface.co/pyannote/speaker-diarization-3.1
2. https://huggingface.co/pyannote/segmentation-3.0
# Installation and Running
- ## Run Locally
### Prerequisite
To run this WebUI, you need to have `git`, `python` version 3.8 ~ 3.10 & `FFmpeg`.<BR>If you're not using an Nvida GPU, or using a different `CUDA` version than 12.4, edit the file `requirements.txt` to match your environment.
Please follow the links below to install the necessary software:
- git : [https://git-scm.com/downloads](https://git-scm.com/downloads)
- python : [https://www.python.org/downloads/](https://www.python.org/downloads/)
- FFmpeg : [https://ffmpeg.org/download.html](https://ffmpeg.org/download.html)
- CUDA : [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads)
After installing `FFmpeg`, make sure to **add** the `FFmpeg/bin` folder to your system `PATH`
### Installation using the script files
1. Download the the repository and extract its contents
2. Run `install.bat` or `install.sh` to install dependencies (It will create a `venv` directory and install dependencies there)
3. Start WebUI with `start-webui.bat` or `start-webui.sh` (It will run `python app.py` after activating the venv)
- ## Running with Docker
1. Install and launch [Docker-Desktop](https://www.docker.com/products/docker-desktop/)
2. Get the repository
3. If needed, update the `docker-compose.yaml` to match your environment
4. Docker commands:
Build the image ( Image is about ~7GB)
```sh
docker compose build
```
Run the container
```sh
docker compose up
```
5. Connect to the WebUI with your browser at `http://localhost:7860`
# VRAM Usages
- This project is integrated with [faster-whisper](https://github.com/guillaumekln/faster-whisper) by default for better VRAM usage and transcription speed.<BR>According to faster-whisper, the efficiency of the optimized whisper model is as follows:
| Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory |
|-------------------|-----------|-----------|-------|-----------------|-----------------|
| openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB |
| faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB |
- Whisper's original VRAM usage table for available models:
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
| tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x |
| base | 74 M | `base.en` | `base` | ~1 GB | ~16x |
| small | 244 M | `small.en` | `small` | ~2 GB | ~6x |
| medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x |
| large | 1550 M | N/A | `large` | ~10 GB | 1x |
Note: `.en` models are for English only, and you can use the `Translate to English` option from the other models |