|
--- |
|
language: en |
|
tags: |
|
- audio |
|
- speech |
|
- emotion-recognition |
|
- wav2vec2 |
|
datasets: |
|
- TESS |
|
- CREMA-D |
|
- SAVEE |
|
- RAVDESS |
|
license: mit |
|
metrics: |
|
- accuracy |
|
- f1 |
|
--- |
|
|
|
# wav2vec2-emotion-recognition |
|
|
|
This model is fine-tuned on the Wav2Vec2 architecture for speech emotion recognition. It can classify speech into 8 different emotions with corresponding confidence scores. |
|
|
|
## Model Description |
|
|
|
- **Model Architecture:** Wav2Vec2 with sequence classification head |
|
- **Language:** English |
|
- **Task:** Speech Emotion Recognition |
|
- **Fine-tuned from:** facebook/wav2vec2-base |
|
- **Datasets:** Combined emotion datasets |
|
- [TESS](https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess) |
|
- [CREMA-D](https://www.kaggle.com/datasets/ejlok1/cremad) |
|
- [SAVEE](https://www.kaggle.com/datasets/barelydedicated/savee-database) |
|
- [RAVDESS](https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio) |
|
|
|
## Performance Metrics |
|
- **Accuracy:** 79.57% |
|
- **F1 Score:** 79.43% |
|
|
|
## Supported Emotions |
|
- π Angry |
|
- π Calm |
|
- π€’ Disgust |
|
- π¨ Fearful |
|
- π Happy |
|
- π Neutral |
|
- π’ Sad |
|
- π² Surprised |
|
|
|
## Training Details |
|
|
|
The model was trained with the following configuration: |
|
- **Epochs:** 15 |
|
- **Batch Size:** 16 |
|
- **Learning Rate:** 5e-5 |
|
- **Optimizer:** AdamW |
|
- **Weight Decay:** 0.03 |
|
- **Gradient Accumulation Steps:** 2 |
|
- **Mixed Precision:** fp16 |
|
|
|
For detailed training process, check out the [Fine-tuning Notebook](https://colab.research.google.com/drive/1VNhIjY7gW29d0uKGNDGN0eOp-pxr_pFL?usp=drive_link) |
|
|
|
## Limitations |
|
|
|
### Audio Requirements: |
|
|
|
- Sampling rate: 16kHz (will be automatically resampled) |
|
- Maximum duration: 1 minute |
|
- Clear speech with minimal background noise recommended |
|
|
|
### Performance Considerations: |
|
|
|
- Best results with clear speech audio |
|
- Performance may vary with different accents |
|
- Background noise can affect accuracy |
|
|
|
|
|
## Demo |
|
https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition |
|
|
|
## Contact |
|
* **GitHub**: [DGautam11](https://github.com/DGautam11) |
|
* **LinkedIn**: [Deepan Gautam](https://www.linkedin.com/in/deepan-gautam) |
|
* **Hugging Face**: [@Dpngtm](https://huggingface.co/Dpngtm) |
|
|
|
For issues and questions, feel free to: |
|
1. Open an issue on the [Model Repository](https://huggingface.co/Dpngtm/wav2vec2-emotion-recognition) |
|
2. Comment on the [Demo Space](https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition) |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor |
|
import torch |
|
import torchaudio |
|
|
|
# Load model and processor |
|
model = Wav2Vec2ForSequenceClassification.from_pretrained("Dpngtm/wav2vec2-emotion-recognition") |
|
processor = Wav2Vec2Processor.from_pretrained("Dpngtm/wav2vec2-emotion-recognition") |
|
|
|
# Load and preprocess audio |
|
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav") |
|
if sampling_rate != 16000: |
|
resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000) |
|
speech_array = resampler(speech_array) |
|
|
|
# Convert to mono if stereo |
|
if speech_array.shape[0] > 1: |
|
speech_array = torch.mean(speech_array, dim=0, keepdim=True) |
|
|
|
speech_array = speech_array.squeeze().numpy() |
|
|
|
# Process through model |
|
inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True) |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
|
# Get predicted emotion |
|
emotion_labels = ["angry", "calm", "disgust", "fearful", "happy", "neutral", "sad", "surprised"] |
|
predicted_emotion = emotion_labels[predictions.argmax().item()] |