yuvraj108c commited on
Commit
aadbd46
·
verified ·
1 Parent(s): 1104015

Upload folder using huggingface_hub

Browse files
wav2vec-english-speech-emotion-recognition/.gitattributes ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ftz filter=lfs diff=lfs merge=lfs -text
6
+ *.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.h5 filter=lfs diff=lfs merge=lfs -text
8
+ *.joblib filter=lfs diff=lfs merge=lfs -text
9
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
10
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.npy filter=lfs diff=lfs merge=lfs -text
14
+ *.npz filter=lfs diff=lfs merge=lfs -text
15
+ *.onnx filter=lfs diff=lfs merge=lfs -text
16
+ *.ot filter=lfs diff=lfs merge=lfs -text
17
+ *.parquet filter=lfs diff=lfs merge=lfs -text
18
+ *.pb filter=lfs diff=lfs merge=lfs -text
19
+ *.pickle filter=lfs diff=lfs merge=lfs -text
20
+ *.pkl filter=lfs diff=lfs merge=lfs -text
21
+ *.pt filter=lfs diff=lfs merge=lfs -text
22
+ *.pth filter=lfs diff=lfs merge=lfs -text
23
+ *.rar filter=lfs diff=lfs merge=lfs -text
24
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
25
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
26
+ *.tflite filter=lfs diff=lfs merge=lfs -text
27
+ *.tgz filter=lfs diff=lfs merge=lfs -text
28
+ *.wasm filter=lfs diff=lfs merge=lfs -text
29
+ *.xz filter=lfs diff=lfs merge=lfs -text
30
+ *.zip filter=lfs diff=lfs merge=lfs -text
31
+ *.zst filter=lfs diff=lfs merge=lfs -text
32
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
wav2vec-english-speech-emotion-recognition/README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - generated_from_trainer
5
+ metrics:
6
+ - accuracy
7
+ model_index:
8
+ name: wav2vec-english-speech-emotion-recognition
9
+ ---
10
+ # Speech Emotion Recognition By Fine-Tuning Wav2Vec 2.0
11
+ The model is a fine-tuned version of [jonatasgrosman/wav2vec2-large-xlsr-53-english](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english) for a Speech Emotion Recognition (SER) task.
12
+
13
+ Several datasets were used the fine-tune the original model:
14
+ - Surrey Audio-Visual Expressed Emotion [(SAVEE)](http://kahlan.eps.surrey.ac.uk/savee/Database.html) - 480 audio files from 4 male actors
15
+ - Ryerson Audio-Visual Database of Emotional Speech and Song [(RAVDESS)](https://zenodo.org/record/1188976) - 1440 audio files from 24 professional actors (12 female, 12 male)
16
+ - Toronto emotional speech set [(TESS)](https://tspace.library.utoronto.ca/handle/1807/24487) - 2800 audio files from 2 female actors
17
+
18
+ 7 labels/emotions were used as classification labels
19
+ ```python
20
+ emotions = ['angry' 'disgust' 'fear' 'happy' 'neutral' 'sad' 'surprise']
21
+ ```
22
+ It achieves the following results on the evaluation set:
23
+ - Loss: 0.104075
24
+ - Accuracy: 0.97463
25
+
26
+ ## Model Usage
27
+ ```bash
28
+ pip install transformers librosa torch
29
+ ```
30
+ ```python
31
+ from transformers import *
32
+ import librosa
33
+ import torch
34
+
35
+ feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
36
+ model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
37
+
38
+ def predict_emotion(audio_path):
39
+ audio, rate = librosa.load(audio_path, sr=16000)
40
+ inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
41
+
42
+ with torch.no_grad():
43
+ outputs = model(inputs.input_values)
44
+ predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1) # Average over sequence length
45
+ predicted_label = torch.argmax(predictions, dim=-1)
46
+ emotion = model.config.id2label[predicted_label.item()]
47
+ return emotion
48
+
49
+ emotion = predict_emotion("example_audio.wav")
50
+ print(f"Predicted emotion: {emotion}")
51
+ >> Predicted emotion: angry
52
+ ```
53
+
54
+
55
+ ## Training procedure
56
+ ### Training hyperparameters
57
+ The following hyperparameters were used during training:
58
+ - learning_rate: 0.0001
59
+ - train_batch_size: 4
60
+ - eval_batch_size: 4
61
+ - eval_steps: 500
62
+ - seed: 42
63
+ - gradient_accumulation_steps: 2
64
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
65
+ - num_epochs: 4
66
+ - max_steps=7500
67
+ - save_steps: 1500
68
+
69
+ ### Training results
70
+ | Step | Training Loss | Validation Loss | Accuracy |
71
+ | ---- | ------------- | --------------- | -------- |
72
+ | 500 | 1.8124 | 1.365212 | 0.486258 |
73
+ | 1000 | 0.8872 | 0.773145 | 0.79704 |
74
+ | 1500 | 0.7035 | 0.574954 | 0.852008 |
75
+ | 2000 | 0.6879 | 1.286738 | 0.775899 |
76
+ | 2500 | 0.6498 | 0.697455 | 0.832981 |
77
+ | 3000 | 0.5696 | 0.33724 | 0.892178 |
78
+ | 3500 | 0.4218 | 0.307072 | 0.911205 |
79
+ | 4000 | 0.3088 | 0.374443 | 0.930233 |
80
+ | 4500 | 0.2688 | 0.260444 | 0.936575 |
81
+ | 5000 | 0.2973 | 0.302985 | 0.92389 |
82
+ | 5500 | 0.1765 | 0.165439 | 0.961945 |
83
+ | 6000 | 0.1475 | 0.170199 | 0.961945 |
84
+ | 6500 | 0.1274 | 0.15531 | 0.966173 |
85
+ | 7000 | 0.0699 | 0.103882 | 0.976744 |
86
+ | 7500 | 0.083 | 0.104075 | 0.97463 |
wav2vec-english-speech-emotion-recognition/config.json ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "jonatasgrosman/wav2vec2-large-xlsr-53-english",
3
+ "processor_class": "Wav2Vec2CTCTokenizer",
4
+ "activation_dropout": 0.05,
5
+ "adapter_kernel_size": 3,
6
+ "adapter_stride": 2,
7
+ "add_adapter": false,
8
+ "apply_spec_augment": true,
9
+ "architectures": [
10
+ "Wav2Vec2ForCTC"
11
+ ],
12
+ "attention_dropout": 0.1,
13
+ "bos_token_id": 1,
14
+ "classifier_proj_size": 256,
15
+ "codevector_dim": 256,
16
+ "contrastive_logits_temperature": 0.1,
17
+ "conv_bias": true,
18
+ "conv_dim": [
19
+ 512,
20
+ 512,
21
+ 512,
22
+ 512,
23
+ 512,
24
+ 512,
25
+ 512
26
+ ],
27
+ "conv_kernel": [
28
+ 10,
29
+ 3,
30
+ 3,
31
+ 3,
32
+ 3,
33
+ 2,
34
+ 2
35
+ ],
36
+ "conv_stride": [
37
+ 5,
38
+ 2,
39
+ 2,
40
+ 2,
41
+ 2,
42
+ 2,
43
+ 2
44
+ ],
45
+ "ctc_loss_reduction": "mean",
46
+ "ctc_zero_infinity": true,
47
+ "diversity_loss_weight": 0.1,
48
+ "do_stable_layer_norm": true,
49
+ "eos_token_id": 2,
50
+ "feat_extract_activation": "gelu",
51
+ "feat_extract_dropout": 0.0,
52
+ "feat_extract_norm": "layer",
53
+ "feat_proj_dropout": 0.05,
54
+ "feat_quantizer_dropout": 0.0,
55
+ "final_dropout": 0.0,
56
+ "finetuning_task": "wav2vec2_clf",
57
+ "hidden_act": "gelu",
58
+ "hidden_dropout": 0.05,
59
+ "hidden_size": 1024,
60
+ "id2label": {
61
+ "0": "angry",
62
+ "1": "disgust",
63
+ "2": "fear",
64
+ "3": "happy",
65
+ "4": "neutral",
66
+ "5": "sad",
67
+ "6": "surprise"
68
+ },
69
+ "initializer_range": 0.02,
70
+ "intermediate_size": 4096,
71
+ "label2id": {
72
+ "angry": 0,
73
+ "disgust": 1,
74
+ "fear": 2,
75
+ "happy": 3,
76
+ "neutral": 4,
77
+ "sad": 5,
78
+ "surprise": 6
79
+ },
80
+ "layer_norm_eps": 1e-05,
81
+ "layerdrop": 0.05,
82
+ "mask_channel_length": 10,
83
+ "mask_channel_min_space": 1,
84
+ "mask_channel_other": 0.0,
85
+ "mask_channel_prob": 0.0,
86
+ "mask_channel_selection": "static",
87
+ "mask_feature_length": 10,
88
+ "mask_feature_min_masks": 0,
89
+ "mask_feature_prob": 0.0,
90
+ "mask_time_length": 10,
91
+ "mask_time_min_masks": 2,
92
+ "mask_time_min_space": 1,
93
+ "mask_time_other": 0.0,
94
+ "mask_time_prob": 0.05,
95
+ "mask_time_selection": "static",
96
+ "model_type": "wav2vec2",
97
+ "num_adapter_layers": 3,
98
+ "num_attention_heads": 16,
99
+ "num_codevector_groups": 2,
100
+ "num_codevectors_per_group": 320,
101
+ "num_conv_pos_embedding_groups": 16,
102
+ "num_conv_pos_embeddings": 128,
103
+ "num_feat_extract_layers": 7,
104
+ "num_hidden_layers": 24,
105
+ "num_negatives": 100,
106
+ "output_hidden_size": 1024,
107
+ "pad_token_id": 0,
108
+ "pooling_mode": "mean",
109
+ "problem_type": "single_label_classification",
110
+ "proj_codevector_dim": 256,
111
+ "tdnn_dilation": [
112
+ 1,
113
+ 2,
114
+ 3,
115
+ 1,
116
+ 1
117
+ ],
118
+ "tdnn_dim": [
119
+ 512,
120
+ 512,
121
+ 512,
122
+ 512,
123
+ 1500
124
+ ],
125
+ "tdnn_kernel": [
126
+ 5,
127
+ 3,
128
+ 3,
129
+ 1,
130
+ 1
131
+ ],
132
+ "torch_dtype": "float32",
133
+ "transformers_version": "4.22.1",
134
+ "use_weighted_layer_sum": false,
135
+ "vocab_size": 33,
136
+ "xvector_output_dim": 512
137
+ }
wav2vec-english-speech-emotion-recognition/preprocessor_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0.0,
7
+ "processor_class": "Wav2Vec2ProcessorWithLM",
8
+ "return_attention_mask": true,
9
+ "sampling_rate": 16000
10
+ }
wav2vec-english-speech-emotion-recognition/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6470434ecf20ae93b22284ac83877984fb8765e332037c36a54df6607e3a206
3
+ size 1266126445
wav2vec-english-speech-emotion-recognition/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b7a4b18e6dd098bbeba86991ea3a66623c19570bf00ab392b2b8e7e72ee8598
3
+ size 3439