File size: 5,601 Bytes
4fb1171 fb3be04 4fb1171 d2f0ece 4fb1171 d2f0ece 4fb1171 d2f0ece 655e682 4fb1171 d2f0ece 4fb1171 711779f 4fb1171 3c6a47b 4fb1171 ff7951e 4b15afc 63bd61c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
---
library_name: tf-keras
tags:
- audio-classification
- accent-classification
---
## Model description
This model classifies UK & Ireland accents using feature extraction from [Yamnet](https://tfhub.dev/google/yamnet/1).
### Yamnet Model
Yamnet is an audio event classifier trained on the AudioSet dataset to predict audio events from the AudioSet ontology. It is available on TensorFlow Hub.
Yamnet accepts a 1-D tensor of audio samples with a sample rate of 16 kHz.
As output, the model returns a 3-tuple:
- Scores of shape `(N, 521)` representing the scores of the 521 classes.
- Embeddings of shape `(N, 1024)`.
- The log-mel spectrogram of the entire audio frame.
We will use the embeddings, which are the features extracted from the audio samples, as the input to our dense model.
For more detailed information about Yamnet, please refer to its [TensorFlow Hub](https://tfhub.dev/google/yamnet/1) page.
### Dense Model
The dense model that we used consists of:
- An input layer which is embedding output of the Yamnet classifier.
- 4 dense hidden layers and 4 dropout layers.
- An output dense layer.
## Training and evaluation data
The dataset used is the
[Crowdsourced high-quality UK and Ireland English Dialect speech data set](https://openslr.org/83/)
which consists of a total of 17,877 high-quality audio wav files.
This dataset includes over 31 hours of recording from 120 vounteers who self-identify as
native speakers of Southern England, Midlands, Northern England, Wales, Scotland and Ireland.
For more info, please refer to the above link or to the following paper:
[Open-source Multi-speaker Corpora of the English Accents in the British Isles](https://aclanthology.org/2020.lrec-1.804.pdf)
## Training procedure
### Training hyperparameters
| Optimizer | learning_rate | decay | beta_1 | beta_2 | epsilon | amsgrad | training_precision |
|----|-------------|-----|------|------|-------|-------|------------------|
|Adam|1.9644e-05|0.0|0.9|0.999|1e-07|False|float32|
## Training Metrics
| Epochs | Training Loss | Training Accuracy | Training AUC |
|--- |--- |--- |--- |
| 1| 10.614| 0.343| 0.759|
| 2| 9.378| 0.396| 0.806|
| 3| 8.993| 0.422| 0.821|
| 4| 8.768| 0.433| 0.829|
| 5| 8.636| 0.438| 0.833|
| 6| 8.514| 0.442| 0.837|
| 7| 8.432| 0.444| 0.839|
| 8| 8.339| 0.446| 0.841|
| 9| 8.270| 0.448| 0.843|
| 10| 8.202| 0.449| 0.845|
| 11| 8.141| 0.451| 0.847|
| 12| 8.095| 0.452| 0.849|
| 13| 8.029| 0.454| 0.851|
| 14| 7.982| 0.454| 0.852|
| 15| 7.935| 0.456| 0.853|
| 16| 7.896| 0.456| 0.854|
| 17| 7.846| 0.459| 0.856|
| 18| 7.809| 0.460| 0.857|
| 19| 7.763| 0.460| 0.858|
| 20| 7.720| 0.462| 0.860|
| 21| 7.688| 0.463| 0.860|
| 22| 7.640| 0.464| 0.861|
| 23| 7.593| 0.467| 0.863|
| 24| 7.579| 0.467| 0.863|
| 25| 7.552| 0.468| 0.864|
| 26| 7.512| 0.468| 0.865|
| 27| 7.477| 0.469| 0.866|
| 28| 7.434| 0.470| 0.867|
| 29| 7.420| 0.471| 0.868|
| 30| 7.374| 0.471| 0.868|
| 31| 7.352| 0.473| 0.869|
| 32| 7.323| 0.474| 0.870|
| 33| 7.274| 0.475| 0.871|
| 34| 7.253| 0.476| 0.871|
| 35| 7.221| 0.477| 0.872|
| 36| 7.179| 0.480| 0.873|
| 37| 7.155| 0.481| 0.874|
| 38| 7.141| 0.481| 0.874|
| 39| 7.108| 0.482| 0.875|
| 40| 7.067| 0.483| 0.876|
| 41| 7.060| 0.483| 0.876|
| 42| 7.019| 0.485| 0.877|
| 43| 6.998| 0.484| 0.877|
| 44| 6.974| 0.486| 0.878|
| 45| 6.947| 0.487| 0.878|
| 46| 6.921| 0.488| 0.879|
| 47| 6.875| 0.490| 0.880|
| 48| 6.860| 0.490| 0.880|
| 49| 6.843| 0.491| 0.881|
| 50| 6.811| 0.492| 0.881|
| 51| 6.783| 0.494| 0.882|
| 52| 6.764| 0.494| 0.882|
| 53| 6.719| 0.497| 0.883|
| 54| 6.693| 0.497| 0.884|
| 55| 6.682| 0.498| 0.884|
| 56| 6.653| 0.497| 0.884|
| 57| 6.630| 0.499| 0.885|
| 58| 6.596| 0.500| 0.885|
| 59| 6.577| 0.500| 0.886|
| 60| 6.546| 0.501| 0.886|
| 61| 6.517| 0.502| 0.887|
| 62| 6.514| 0.504| 0.887|
| 63| 6.483| 0.504| 0.888|
| 64| 6.428| 0.506| 0.888|
| 65| 6.424| 0.507| 0.889|
| 66| 6.412| 0.508| 0.889|
| 67| 6.388| 0.507| 0.889|
| 68| 6.342| 0.509| 0.890|
| 69| 6.309| 0.510| 0.891|
| 70| 6.300| 0.510| 0.891|
| 71| 6.279| 0.512| 0.892|
| 72| 6.258| 0.510| 0.892|
| 73| 6.242| 0.513| 0.892|
| 74| 6.206| 0.514| 0.893|
| 75| 6.189| 0.516| 0.893|
| 76| 6.164| 0.517| 0.894|
| 77| 6.134| 0.517| 0.894|
| 78| 6.120| 0.517| 0.894|
| 79| 6.081| 0.520| 0.895|
| 80| 6.090| 0.518| 0.895|
| 81| 6.052| 0.521| 0.896|
| 82| 6.028| 0.521| 0.896|
| 83| 5.991| 0.521| 0.897|
| 84| 5.974| 0.524| 0.897|
| 85| 5.964| 0.524| 0.897|
| 86| 5.951| 0.524| 0.897|
| 87| 5.940| 0.524| 0.898|
| 88| 5.891| 0.525| 0.899|
| 89| 5.870| 0.526| 0.899|
| 90| 5.856| 0.528| 0.899|
| 91| 5.831| 0.528| 0.900|
| 92| 5.808| 0.529| 0.900|
| 93| 5.796| 0.529| 0.900|
| 94| 5.770| 0.530| 0.901|
| 95| 5.763| 0.529| 0.901|
| 96| 5.749| 0.530| 0.901|
| 97| 5.742| 0.530| 0.901|
| 98| 5.705| 0.531| 0.902|
| 99| 5.694| 0.533| 0.902|
| 100| 5.671| 0.534| 0.903|
## Model Plot
<details>
<summary>View Model Plot</summary>

</details>
## Validation Results
The model achieved the following results on the validation dataset:
Results | Validation
-----------|------------
Accuracy | 50%
AUC | 0.8909
d-prime | 1.742
And the confusion matrix for the validation set is:

## Credits
Author: [Fadi Badine](https://twitter.com/fadibadine).
Based on the following Keras example [English speaker accent recognition using Transfer Learning](https://keras.io/examples/audio/uk_ireland_accent_recognition) by Fadi Badine
Check out the demo space [here](https://huggingface.co/spaces/keras-io/english-speaker-accent-recognition-using-transfer-learning) |