Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
File size: 3,554 Bytes
b2ecf7d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
## Use Cases
### Command Recognition
Command recognition or keyword spotting classifies utterances into a predefined set of commands. This is often done on-device for fast response time.
As an example, using the Google Speech Commands dataset, given an input, a model can classify which of the following commands the user is typing:
```
'yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go', 'unknown', 'silence'
```
Speechbrain models can easily perform this task with just a couple of lines of code!
```python
from speechbrain.pretrained import EncoderClassifier
model = EncoderClassifier.from_hparams(
"speechbrain/google_speech_command_xvector"
)
model.classify_file("file.wav")
```
### Language Identification
Datasets such as VoxLingua107 allow anyone to train language identification models for up to 107 languages! This can be extremely useful as a preprocessing step for other systems. Here's an example [model](https://huggingface.co/TalTechNLP/voxlingua107-epaca-tdnn)trained on VoxLingua107.
### Emotion recognition
Emotion recognition is self explanatory. In addition to trying the widgets, you can use the Inference API to perform audio classification. Here is a simple example that uses a [HuBERT](https://huggingface.co/superb/hubert-large-superb-er) model fine-tuned for this task.
```python
import json
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/superb/hubert-large-superb-er"
def query(filename):
with open(filename, "rb") as f:
data = f.read()
response = requests.request("POST", API_URL, headers=headers, data=data)
return json.loads(response.content.decode("utf-8"))
data = query("sample1.flac")
# [{'label': 'neu', 'score': 0.60},
# {'label': 'hap', 'score': 0.20},
# {'label': 'ang', 'score': 0.13},
# {'label': 'sad', 'score': 0.07}]
```
You can use [huggingface.js](https://github.com/huggingface/huggingface.js) to infer with audio classification models on Hugging Face Hub.
```javascript
import { HfInference } from "@huggingface/inference";
const inference = new HfInference(HF_ACCESS_TOKEN);
await inference.audioClassification({
data: await (await fetch("sample.flac")).blob(),
model: "facebook/mms-lid-126",
});
```
### Speaker Identification
Speaker Identification is classifying the audio of the person speaking. Speakers are usually predefined. You can try out this task with [this model](https://huggingface.co/superb/wav2vec2-base-superb-sid). A useful dataset for this task is VoxCeleb1.
## Solving audio classification for your own data
We have some great news! You can do fine-tuning (transfer learning) to train a well-performing model without requiring as much data. Pretrained models such as Wav2Vec2 and HuBERT exist. [Facebook's Wav2Vec2 XLS-R model](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) is a large multilingual model trained on 128 languages and with 436K hours of speech.
## Useful Resources
Would you like to learn more about the topic? Awesome! Here you can find some curated resources that you may find helpful!
### Notebooks
- [PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/audio_classification.ipynb)
### Scripts for training
- [PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification)
### Documentation
- [Audio classification task guide](https://huggingface.co/docs/transformers/tasks/audio_classification)
|