File size: 4,501 Bytes
7d14cc9
 
712e41d
7d14cc9
710db5f
b5a1a25
710db5f
c435fe8
b5a1a25
710db5f
 
 
b5a1a25
 
 
 
 
 
 
 
710db5f
 
 
 
 
b5a1a25
 
 
cfecb80
b5a1a25
c202b1e
 
 
 
 
b5a1a25
62c3596
b5a1a25
6799b88
b5a1a25
c202b1e
 
 
b5a1a25
 
 
 
119a0fb
b5a1a25
119a0fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5a1a25
 
710db5f
f60c46f
 
 
 
 
710db5f
f60c46f
 
 
 
 
 
 
 
710db5f
119a0fb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
sdk: gradio
sdk_version: 5.16.0
---
# Whisper-WebUI
A Gradio-based browser interface for [Whisper](https://github.com/openai/whisper)

# Features
- Select the Whisper implementation you want to use between:
   - [openai/whisper](https://github.com/openai/whisper)
   - [SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper) (used by default)
   - [Vaibhavs10/insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper)
- Generate transcriptions from various sources, including **files** & **microphone**
- Currently supported output formats: **csv**, **srt** & **txt**
- Speech to Text Translation:
  - From other languages to English (This is Whisper's end-to-end speech-to-text translation feature)
  - Translate transcription files using Facebook NLLB models
- Pre-processing audio input with [Silero VAD](https://github.com/snakers4/silero-vad)
- Post-processing with speaker diarization using the [pyannote](https://huggingface.co/pyannote/speaker-diarization-3.1) model:
   - To download the pyannote model, you need to have a Huggingface token and manually accept their terms in the pages below:
      1. https://huggingface.co/pyannote/speaker-diarization-3.1
      2. https://huggingface.co/pyannote/segmentation-3.0

# Installation and Running

- ## Run Locally

    ### Prerequisite
    To run this WebUI, you need to have `git`, `python` version 3.8 ~ 3.10 & `FFmpeg`.<BR>If you're not using an Nvida GPU, or using a different `CUDA` version than 12.4,  edit the file `requirements.txt` to match your environment.
      
    Please follow the links below to install the necessary software:
    - git : [https://git-scm.com/downloads](https://git-scm.com/downloads)
    - python : [https://www.python.org/downloads/](https://www.python.org/downloads/)
    - FFmpeg :  [https://ffmpeg.org/download.html](https://ffmpeg.org/download.html)
    - CUDA : [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads)
    
    After installing `FFmpeg`, make sure to **add** the `FFmpeg/bin` folder to your system `PATH`

    ### Installation using the script files
    
    1. Download the the repository and extract its contents 
    2. Run `install.bat` or `install.sh` to install dependencies (It will create a `venv` directory and install dependencies there)
    3. Start WebUI with `start-webui.bat` or `start-webui.sh` (It will run `python app.py` after activating the venv)

- ## Running with Docker

    1. Install and launch [Docker-Desktop](https://www.docker.com/products/docker-desktop/)

    2. Get the repository

    3. If needed, update the `docker-compose.yaml` to match your environment

    4. Docker commands:

        Build the image ( Image is about ~7GB)
        ```sh
        docker compose build 
        ```

        Run the container 
        ```sh
        docker compose up
        ```

    5. Connect to the WebUI with your browser at `http://localhost:7860`
    
# VRAM Usages
- This project is integrated with [faster-whisper](https://github.com/guillaumekln/faster-whisper) by default for better VRAM usage and transcription speed.<BR>According to faster-whisper, the efficiency of the optimized whisper model is as follows: 
    | Implementation    | Precision | Beam size | Time  | Max. GPU memory | Max. CPU memory |
    |-------------------|-----------|-----------|-------|-----------------|-----------------|
    | openai/whisper    | fp16      | 5         | 4m30s | 11325MB         | 9439MB          |
    | faster-whisper    | fp16      | 5         | 54s   | 4755MB          | 3244MB          |

- Whisper's original VRAM usage table for available models:
    |  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
    |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
    |  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
    |  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
    | small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
    | medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
    | large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |

    Note: `.en` models are for English only, and you can use the `Translate to English` option from the other models