File size: 3,577 Bytes
349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 349f1cf cfa0be1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
---
language: en
tags:
- audio
- speech
- emotion-recognition
- wav2vec2
datasets:
- TESS
- CREMA-D
- SAVEE
- RAVDESS
license: mit
metrics:
- accuracy
- f1
---
# wav2vec2-emotion-recognition
This model is fine-tuned on the Wav2Vec2 architecture for speech emotion recognition. It can classify speech into 8 different emotions with corresponding confidence scores.
## Model Description
- **Model Architecture:** Wav2Vec2 with sequence classification head
- **Language:** English
- **Task:** Speech Emotion Recognition
- **Fine-tuned from:** facebook/wav2vec2-base
- **Datasets:** Combined emotion datasets
- [TESS](https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess)
- [CREMA-D](https://www.kaggle.com/datasets/ejlok1/cremad)
- [SAVEE](https://www.kaggle.com/datasets/barelydedicated/savee-database)
- [RAVDESS](https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio)
## Performance Metrics
- **Accuracy:** 79.57%
- **F1 Score:** 79.43%
## Supported Emotions
- π Angry
- π Calm
- π€’ Disgust
- π¨ Fearful
- π Happy
- π Neutral
- π’ Sad
- π² Surprised
## Training Details
The model was trained with the following configuration:
- **Epochs:** 15
- **Batch Size:** 16
- **Learning Rate:** 5e-5
- **Optimizer:** AdamW
- **Weight Decay:** 0.03
- **Gradient Accumulation Steps:** 2
- **Mixed Precision:** fp16
For detailed training process, check out the [Fine-tuning Notebook](https://colab.research.google.com/drive/1VNhIjY7gW29d0uKGNDGN0eOp-pxr_pFL?usp=drive_link)
## Limitations
### Audio Requirements:
- Sampling rate: 16kHz (will be automatically resampled)
- Maximum duration: 1 minute
- Clear speech with minimal background noise recommended
### Performance Considerations:
- Best results with clear speech audio
- Performance may vary with different accents
- Background noise can affect accuracy
## Demo
https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition
## Contact
* **GitHub**: [DGautam11](https://github.com/DGautam11)
* **LinkedIn**: [Deepan Gautam](https://www.linkedin.com/in/deepan-gautam)
* **Hugging Face**: [@Dpngtm](https://huggingface.co/Dpngtm)
For issues and questions, feel free to:
1. Open an issue on the [Model Repository](https://huggingface.co/Dpngtm/wav2vec2-emotion-recognition)
2. Comment on the [Demo Space](https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition)
## Usage
```python
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torch
import torchaudio
# Load model and processor
model = Wav2Vec2ForSequenceClassification.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
processor = Wav2Vec2Processor.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
# Load and preprocess audio
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav")
if sampling_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)
speech_array = resampler(speech_array)
# Convert to mono if stereo
if speech_array.shape[0] > 1:
speech_array = torch.mean(speech_array, dim=0, keepdim=True)
speech_array = speech_array.squeeze().numpy()
# Process through model
inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Get predicted emotion
emotion_labels = ["angry", "calm", "disgust", "fearful", "happy", "neutral", "sad", "surprised"]
predicted_emotion = emotion_labels[predictions.argmax().item()] |