File size: 3,577 Bytes
349f1cf
cfa0be1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
349f1cf
 
cfa0be1
349f1cf
cfa0be1
349f1cf
cfa0be1
349f1cf
cfa0be1
 
 
 
 
 
 
 
 
349f1cf
cfa0be1
 
 
349f1cf
cfa0be1
 
 
 
 
 
 
 
 
349f1cf
 
 
cfa0be1
 
 
 
 
 
 
 
349f1cf
cfa0be1
349f1cf
cfa0be1
349f1cf
cfa0be1
349f1cf
cfa0be1
 
 
349f1cf
cfa0be1
349f1cf
cfa0be1
 
 
349f1cf
 
cfa0be1
 
349f1cf
cfa0be1
 
 
 
349f1cf
cfa0be1
 
 
349f1cf
cfa0be1
349f1cf
cfa0be1
 
 
 
349f1cf
cfa0be1
 
 
349f1cf
cfa0be1
 
 
 
 
349f1cf
cfa0be1
 
 
349f1cf
cfa0be1
349f1cf
cfa0be1
 
 
 
 
349f1cf
cfa0be1
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
language: en
tags:
- audio
- speech
- emotion-recognition
- wav2vec2
datasets:
- TESS
- CREMA-D
- SAVEE
- RAVDESS
license: mit
metrics:
- accuracy
- f1
---

# wav2vec2-emotion-recognition

This model is fine-tuned on the Wav2Vec2 architecture for speech emotion recognition. It can classify speech into 8 different emotions with corresponding confidence scores.

## Model Description

- **Model Architecture:** Wav2Vec2 with sequence classification head
- **Language:** English
- **Task:** Speech Emotion Recognition
- **Fine-tuned from:** facebook/wav2vec2-base
- **Datasets:** Combined emotion datasets
 - [TESS](https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess)
 - [CREMA-D](https://www.kaggle.com/datasets/ejlok1/cremad)
 - [SAVEE](https://www.kaggle.com/datasets/barelydedicated/savee-database)
 - [RAVDESS](https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio)

## Performance Metrics
- **Accuracy:** 79.57%
- **F1 Score:** 79.43%

## Supported Emotions
- 😠 Angry
- 😌 Calm
- 🀒 Disgust
- 😨 Fearful
- 😊 Happy
- 😐 Neutral
- 😒 Sad
- 😲 Surprised

## Training Details

The model was trained with the following configuration:
- **Epochs:** 15
- **Batch Size:** 16
- **Learning Rate:** 5e-5
- **Optimizer:** AdamW
- **Weight Decay:** 0.03
- **Gradient Accumulation Steps:** 2
- **Mixed Precision:** fp16

For detailed training process, check out the [Fine-tuning Notebook](https://colab.research.google.com/drive/1VNhIjY7gW29d0uKGNDGN0eOp-pxr_pFL?usp=drive_link)

## Limitations

### Audio Requirements:

- Sampling rate: 16kHz (will be automatically resampled)
- Maximum duration: 1 minute
- Clear speech with minimal background noise recommended

### Performance Considerations:

- Best results with clear speech audio
- Performance may vary with different accents
- Background noise can affect accuracy


## Demo
https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition

## Contact
* **GitHub**: [DGautam11](https://github.com/DGautam11)
* **LinkedIn**: [Deepan Gautam](https://www.linkedin.com/in/deepan-gautam)  
* **Hugging Face**: [@Dpngtm](https://huggingface.co/Dpngtm)

For issues and questions, feel free to:
1. Open an issue on the [Model Repository](https://huggingface.co/Dpngtm/wav2vec2-emotion-recognition)
2. Comment on the [Demo Space](https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition)

## Usage

```python
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torch
import torchaudio

# Load model and processor
model = Wav2Vec2ForSequenceClassification.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
processor = Wav2Vec2Processor.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")

# Load and preprocess audio
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav")
if sampling_rate != 16000:
   resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)
   speech_array = resampler(speech_array)

# Convert to mono if stereo
if speech_array.shape[0] > 1:
   speech_array = torch.mean(speech_array, dim=0, keepdim=True)

speech_array = speech_array.squeeze().numpy()

# Process through model
inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
   outputs = model(**inputs)
   predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Get predicted emotion
emotion_labels = ["angry", "calm", "disgust", "fearful", "happy", "neutral", "sad", "surprised"]
predicted_emotion = emotion_labels[predictions.argmax().item()]