Speech Recognition AI: Fine-Tuned Whisper and Wav2Vec2 for Real-Time Audio
This project fine-tunes OpenAI's Whisper (whisper-small
) and Facebook's Wav2Vec2 (wav2vec2-base-960h
) models for real-time speech recognition using live audio recordings. It’s designed for dynamic environments where low-latency transcription is key, such as live conversations or streaming audio.
Model Description
This is a fine-tuned version of OpenAI's Whisper small model and Facebook's Wav2Vec2 base model, optimized for real-time speech-to-text transcription. The models were trained on live 16kHz mono audio recordings, improving transcription accuracy over their base versions for continuous input scenarios.
Features
- Real-time audio recording: Captures live 16kHz mono audio via microphone input.
- Continuous fine-tuning: Updates model weights incrementally during live sessions.
- Speech-to-text transcription: Converts audio to text with high accuracy.
- Model saving/loading: Automatically saves fine-tuned models with timestamps.
- Dual model support: Choose between Whisper and Wav2Vec2 architectures.
Note: Currently supports English-only transcription.
Installation
Clone the repository and install the dependencies:
git clone https://github.com/bniladridas/speech-model.git
cd speech-model
pip install -r requirements.txt
Optional: Install system dependencies for Sounddevice (e.g., libsoundio on Linux):
sudo apt-get install libsndfile1
Usage
Start Fine-Tuning
Fine-tune the model on live audio:
# For Whisper model
python main.py --model_type whisper
# For Wav2Vec2 model
python main.py --model_type wav2vec2
Records audio in real-time and updates the model continuously. Press Ctrl+C to stop training and save the model automatically.
Transcription
Test the fine-tuned model:
# For Whisper model
python test_transcription.py --model_type whisper
# For Wav2Vec2 model
python test_transcription.py --model_type wav2vec2
Records 5 seconds of audio (configurable in code) and generates a transcription.
Model Storage
Models are saved by default to:
models/speech_recognition_ai_fine_tune_[model_type]_[timestamp]
Example: models/speech_recognition_ai_fine_tune_whisper_20250225
To customize the save path:
export MODEL_SAVE_PATH="/your/custom/path"
python main.py --model_type [whisper|wav2vec2]
Requirements
- Python 3.8+
- PyTorch (torch==2.0.1 recommended)
- Transformers (transformers==4.35.0 recommended)
- Sounddevice (sounddevice==0.4.6)
- Torchaudio (torchaudio==2.0.1)
A GPU is recommended for faster fine-tuning. See requirements.txt
for the full list.
Model Details
- Task: Automatic Speech Recognition (ASR)
- Base Models:
- Whisper: openai/whisper-small
- Wav2Vec2: facebook/wav2vec2-base-960h
- Fine-tuning: Trained on live 16kHz mono audio recordings with a batch size of 8, using the Adam optimizer (learning rate 1e-5).
- Input: 16kHz mono audio
- Output: Text transcription
- Language: English
Loading the Model (Hugging Face)
To load the models from Hugging Face:
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("bniladridas/speech-recognition-ai-fine-tune")
processor = WhisperProcessor.from_pretrained("bniladridas/speech-recognition-ai-fine-tune")
Repository Structure
speech-model/
├── dataset.py # Audio recording and preprocessing
├── train.py # Training pipeline
├── test_transcription.py # Transcription testing
├── main.py # Main script for fine-tuning
├── README.md # This file
└── requirements.txt # Dependencies
Training Data
The models are fine-tuned on live audio recordings collected during runtime. No pre-existing dataset is required—users generate their own data via microphone input.
Evaluation Results
Placeholder: Future updates will include WER (Word Error Rate) metrics compared to base models.
License
Licensed under the MIT License. See the LICENSE file for details.
- Downloads last month
- 48
Model tree for bniladridas/speech-recognition-ai-fine-tune
Base model
facebook/wav2vec2-base-960h