File size: 3,577 Bytes

349f1cf
cfa0be1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
349f1cf
 
cfa0be1
349f1cf
cfa0be1
349f1cf
cfa0be1
349f1cf
cfa0be1
 
 
 
 
 
 
 
 
349f1cf
cfa0be1
 
 
349f1cf
cfa0be1
 
 
 
 
 
 
 
 
349f1cf
 
 
cfa0be1
 
 
 
 
 
 
 
349f1cf
cfa0be1
349f1cf
cfa0be1
349f1cf
cfa0be1
349f1cf
cfa0be1
 
 
349f1cf
cfa0be1
349f1cf
cfa0be1
 
 
349f1cf
 
cfa0be1
 
349f1cf
cfa0be1
 
 
 
349f1cf
cfa0be1
 
 
349f1cf
cfa0be1
349f1cf
cfa0be1
 
 
 
349f1cf
cfa0be1
 
 
349f1cf
cfa0be1
 
 
 
 
349f1cf
cfa0be1
 
 
349f1cf
cfa0be1
349f1cf
cfa0be1
 
 
 
 
349f1cf
cfa0be1

---
language: en
tags:
- audio
- speech
- emotion-recognition
- wav2vec2
datasets:
- TESS
- CREMA-D
- SAVEE
- RAVDESS
license: mit
metrics:
- accuracy
- f1
---

# wav2vec2-emotion-recognition

This model is fine-tuned on the Wav2Vec2 architecture for speech emotion recognition. It can classify speech into 8 different emotions with corresponding confidence scores.

## Model Description

- **Model Architecture:** Wav2Vec2 with sequence classification head
- **Language:** English
- **Task:** Speech Emotion Recognition
- **Fine-tuned from:** facebook/wav2vec2-base
- **Datasets:** Combined emotion datasets
 - [TESS](https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess)
 - [CREMA-D](https://www.kaggle.com/datasets/ejlok1/cremad)
 - [SAVEE](https://www.kaggle.com/datasets/barelydedicated/savee-database)
 - [RAVDESS](https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio)

## Performance Metrics
- **Accuracy:** 79.57%
- **F1 Score:** 79.43%

## Supported Emotions
- 😠 Angry
- 😌 Calm
- 🤢 Disgust
- 😨 Fearful
- 😊 Happy
- 😐 Neutral
- 😢 Sad
- 😲 Surprised

## Training Details

The model was trained with the following configuration:
- **Epochs:** 15
- **Batch Size:** 16
- **Learning Rate:** 5e-5
- **Optimizer:** AdamW
- **Weight Decay:** 0.03
- **Gradient Accumulation Steps:** 2
- **Mixed Precision:** fp16

For detailed training process, check out the [Fine-tuning Notebook](https://colab.research.google.com/drive/1VNhIjY7gW29d0uKGNDGN0eOp-pxr_pFL?usp=drive_link)

## Limitations

### Audio Requirements:

- Sampling rate: 16kHz (will be automatically resampled)
- Maximum duration: 1 minute
- Clear speech with minimal background noise recommended

### Performance Considerations:

- Best results with clear speech audio
- Performance may vary with different accents
- Background noise can affect accuracy


## Demo
https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition

## Contact
* **GitHub**: [DGautam11](https://github.com/DGautam11)
* **LinkedIn**: [Deepan Gautam](https://www.linkedin.com/in/deepan-gautam)  
* **Hugging Face**: [@Dpngtm](https://huggingface.co/Dpngtm)

For issues and questions, feel free to:
1. Open an issue on the [Model Repository](https://huggingface.co/Dpngtm/wav2vec2-emotion-recognition)
2. Comment on the [Demo Space](https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition)

## Usage

```python
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torch
import torchaudio

# Load model and processor
model = Wav2Vec2ForSequenceClassification.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
processor = Wav2Vec2Processor.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")

# Load and preprocess audio
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav")
if sampling_rate != 16000:
   resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)
   speech_array = resampler(speech_array)

# Convert to mono if stereo
if speech_array.shape[0] > 1:
   speech_array = torch.mean(speech_array, dim=0, keepdim=True)

speech_array = speech_array.squeeze().numpy()

# Process through model
inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
   outputs = model(**inputs)
   predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Get predicted emotion
emotion_labels = ["angry", "calm", "disgust", "fearful", "happy", "neutral", "sad", "surprised"]
predicted_emotion = emotion_labels[predictions.argmax().item()]