Speech Recognition AI: Fine-Tuned Whisper and Wav2Vec2 for Real-Time Audio

Hugging Face

This project fine-tunes OpenAI's Whisper (whisper-small) and Facebook's Wav2Vec2 (wav2vec2-base-960h) models for real-time speech recognition using live audio recordings. It’s designed for dynamic environments where low-latency transcription is key, such as live conversations or streaming audio.

Model Description

This is a fine-tuned version of OpenAI's Whisper small model and Facebook's Wav2Vec2 base model, optimized for real-time speech-to-text transcription. The models were trained on live 16kHz mono audio recordings, improving transcription accuracy over their base versions for continuous input scenarios.

Features

  • Real-time audio recording: Captures live 16kHz mono audio via microphone input.
  • Continuous fine-tuning: Updates model weights incrementally during live sessions.
  • Speech-to-text transcription: Converts audio to text with high accuracy.
  • Model saving/loading: Automatically saves fine-tuned models with timestamps.
  • Dual model support: Choose between Whisper and Wav2Vec2 architectures.

Note: Currently supports English-only transcription.

Installation

Clone the repository and install the dependencies:

git clone https://github.com/bniladridas/speech-model.git
cd speech-model
pip install -r requirements.txt

Optional: Install system dependencies for Sounddevice (e.g., libsoundio on Linux):

sudo apt-get install libsndfile1

Usage

Start Fine-Tuning

Fine-tune the model on live audio:

# For Whisper model
python main.py --model_type whisper

# For Wav2Vec2 model
python main.py --model_type wav2vec2

Records audio in real-time and updates the model continuously. Press Ctrl+C to stop training and save the model automatically.

Transcription

Test the fine-tuned model:

# For Whisper model
python test_transcription.py --model_type whisper

# For Wav2Vec2 model
python test_transcription.py --model_type wav2vec2

Records 5 seconds of audio (configurable in code) and generates a transcription.

Model Storage

Models are saved by default to:

models/speech_recognition_ai_fine_tune_[model_type]_[timestamp]

Example: models/speech_recognition_ai_fine_tune_whisper_20250225

To customize the save path:

export MODEL_SAVE_PATH="/your/custom/path"
python main.py --model_type [whisper|wav2vec2]

Requirements

  • Python 3.8+
  • PyTorch (torch==2.0.1 recommended)
  • Transformers (transformers==4.35.0 recommended)
  • Sounddevice (sounddevice==0.4.6)
  • Torchaudio (torchaudio==2.0.1)

A GPU is recommended for faster fine-tuning. See requirements.txt for the full list.

Model Details

  • Task: Automatic Speech Recognition (ASR)
  • Base Models:
    • Whisper: openai/whisper-small
    • Wav2Vec2: facebook/wav2vec2-base-960h
  • Fine-tuning: Trained on live 16kHz mono audio recordings with a batch size of 8, using the Adam optimizer (learning rate 1e-5).
  • Input: 16kHz mono audio
  • Output: Text transcription
  • Language: English

Loading the Model (Hugging Face)

To load the models from Hugging Face:

from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("bniladridas/speech-recognition-ai-fine-tune")
processor = WhisperProcessor.from_pretrained("bniladridas/speech-recognition-ai-fine-tune")

Repository Structure

speech-model/
├── dataset.py              # Audio recording and preprocessing
├── train.py                # Training pipeline
├── test_transcription.py   # Transcription testing
├── main.py                 # Main script for fine-tuning
├── README.md               # This file
└── requirements.txt        # Dependencies

Training Data

The models are fine-tuned on live audio recordings collected during runtime. No pre-existing dataset is required—users generate their own data via microphone input.

Evaluation Results

Placeholder: Future updates will include WER (Word Error Rate) metrics compared to base models.

License

Licensed under the MIT License. See the LICENSE file for details.

Downloads last month
48
Safetensors
Model size
242M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for bniladridas/speech-recognition-ai-fine-tune

Finetuned
(128)
this model

Dataset used to train bniladridas/speech-recognition-ai-fine-tune