UniversalAlgorithmic commited on
Commit
3d72a4d
·
verified ·
1 Parent(s): 9e27f12

Delete examples/audio-classification/README.md

Browse files
examples/audio-classification/README.md DELETED
@@ -1,148 +0,0 @@
1
- <!---
2
- Copyright 2021 The HuggingFace Team. All rights reserved.
3
-
4
- Licensed under the Apache License, Version 2.0 (the "License");
5
- you may not use this file except in compliance with the License.
6
- You may obtain a copy of the License at
7
-
8
- http://www.apache.org/licenses/LICENSE-2.0
9
-
10
- Unless required by applicable law or agreed to in writing, software
11
- distributed under the License is distributed on an "AS IS" BASIS,
12
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
- See the License for the specific language governing permissions and
14
- limitations under the License.
15
- -->
16
-
17
- # Audio classification examples
18
-
19
- The following examples showcase how to fine-tune `Wav2Vec2` for audio classification using PyTorch.
20
-
21
- Speech recognition models that have been pretrained in unsupervised fashion on audio data alone,
22
- *e.g.* [Wav2Vec2](https://huggingface.co/transformers/main/model_doc/wav2vec2.html),
23
- [HuBERT](https://huggingface.co/transformers/main/model_doc/hubert.html),
24
- [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html), have shown to require only
25
- very little annotated data to yield good performance on speech classification datasets.
26
-
27
- ## Single-GPU
28
-
29
- The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on the 🗣️ [Keyword Spotting subset](https://huggingface.co/datasets/superb#ks) of the SUPERB dataset.
30
-
31
- ```bash
32
- python run_audio_classification.py \
33
- --model_name_or_path facebook/wav2vec2-base \
34
- --dataset_name superb \
35
- --dataset_config_name ks \
36
- --output_dir wav2vec2-base-ft-keyword-spotting \
37
- --overwrite_output_dir \
38
- --remove_unused_columns False \
39
- --do_train \
40
- --do_eval \
41
- --fp16 \
42
- --learning_rate 3e-5 \
43
- --max_length_seconds 1 \
44
- --attention_mask False \
45
- --warmup_ratio 0.1 \
46
- --num_train_epochs 5 \
47
- --per_device_train_batch_size 32 \
48
- --gradient_accumulation_steps 4 \
49
- --per_device_eval_batch_size 32 \
50
- --dataloader_num_workers 4 \
51
- --logging_strategy steps \
52
- --logging_steps 10 \
53
- --eval_strategy epoch \
54
- --save_strategy epoch \
55
- --load_best_model_at_end True \
56
- --metric_for_best_model accuracy \
57
- --save_total_limit 3 \
58
- --seed 0 \
59
- --push_to_hub
60
- ```
61
-
62
- On a single V100 GPU (16GB), this script should run in ~14 minutes and yield accuracy of **98.26%**.
63
-
64
- 👀 See the results here: [anton-l/wav2vec2-base-ft-keyword-spotting](https://huggingface.co/anton-l/wav2vec2-base-ft-keyword-spotting)
65
-
66
- > If your model classification head dimensions do not fit the number of labels in the dataset, you can specify `--ignore_mismatched_sizes` to adapt it.
67
-
68
- ## Multi-GPU
69
-
70
- The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) for 🌎 **Language Identification** on the [CommonLanguage dataset](https://huggingface.co/datasets/anton-l/common_language).
71
-
72
- ```bash
73
- python run_audio_classification.py \
74
- --model_name_or_path facebook/wav2vec2-base \
75
- --dataset_name common_language \
76
- --audio_column_name audio \
77
- --label_column_name language \
78
- --output_dir wav2vec2-base-lang-id \
79
- --overwrite_output_dir \
80
- --remove_unused_columns False \
81
- --do_train \
82
- --do_eval \
83
- --fp16 \
84
- --learning_rate 3e-4 \
85
- --max_length_seconds 16 \
86
- --attention_mask False \
87
- --warmup_ratio 0.1 \
88
- --num_train_epochs 10 \
89
- --per_device_train_batch_size 8 \
90
- --gradient_accumulation_steps 4 \
91
- --per_device_eval_batch_size 1 \
92
- --dataloader_num_workers 8 \
93
- --logging_strategy steps \
94
- --logging_steps 10 \
95
- --eval_strategy epoch \
96
- --save_strategy epoch \
97
- --load_best_model_at_end True \
98
- --metric_for_best_model accuracy \
99
- --save_total_limit 3 \
100
- --seed 0 \
101
- --push_to_hub
102
- ```
103
-
104
- On 4 V100 GPUs (16GB), this script should run in ~1 hour and yield accuracy of **79.45%**.
105
-
106
- 👀 See the results here: [anton-l/wav2vec2-base-lang-id](https://huggingface.co/anton-l/wav2vec2-base-lang-id)
107
-
108
- ## Sharing your model on 🤗 Hub
109
-
110
- 0. If you haven't already, [sign up](https://huggingface.co/join) for a 🤗 account
111
-
112
- 1. Make sure you have `git-lfs` installed and git set up.
113
-
114
- ```bash
115
- $ apt install git-lfs
116
- ```
117
-
118
- 2. Log in with your HuggingFace account credentials using `huggingface-cli`
119
-
120
- ```bash
121
- $ huggingface-cli login
122
- # ...follow the prompts
123
- ```
124
-
125
- 3. When running the script, pass the following arguments:
126
-
127
- ```bash
128
- python run_audio_classification.py \
129
- --push_to_hub \
130
- --hub_model_id <username/model_id> \
131
- ...
132
- ```
133
-
134
- ### Examples
135
-
136
- The following table shows a couple of demonstration fine-tuning runs.
137
- It has been verified that the script works for the following datasets:
138
-
139
- - [SUPERB Keyword Spotting](https://huggingface.co/datasets/superb#ks)
140
- - [Common Language](https://huggingface.co/datasets/common_language)
141
-
142
- | Dataset | Pretrained Model | # transformer layers | Accuracy on eval | GPU setup | Training time | Fine-tuned Model & Logs |
143
- |---------|------------------|----------------------|------------------|-----------|---------------|--------------------------|
144
- | Keyword Spotting | [ntu-spml/distilhubert](https://huggingface.co/ntu-spml/distilhubert) | 2 | 0.9706 | 1 V100 GPU | 11min | [here](https://huggingface.co/anton-l/distilhubert-ft-keyword-spotting) |
145
- | Keyword Spotting | [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) | 12 | 0.9826 | 1 V100 GPU | 14min | [here](https://huggingface.co/anton-l/wav2vec2-base-ft-keyword-spotting) |
146
- | Keyword Spotting | [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960) | 12 | 0.9819 | 1 V100 GPU | 14min | [here](https://huggingface.co/anton-l/hubert-base-ft-keyword-spotting) |
147
- | Keyword Spotting | [asapp/sew-mid-100k](https://huggingface.co/asapp/sew-mid-100k) | 24 | 0.9757 | 1 V100 GPU | 15min | [here](https://huggingface.co/anton-l/sew-mid-100k-ft-keyword-spotting) |
148
- | Common Language | [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) | 12 | 0.7945 | 4 V100 GPUs | 1h10m | [here](https://huggingface.co/anton-l/wav2vec2-base-lang-id) |