Spaces:
Running
Running
<div align="center"> | |
<h1> Neural Source-Filter BigVGAN </h1> | |
Just For Fun | |
</div> | |
 | |
## Dataset preparation | |
Put the dataset into the data_raw directory according to the following file structure | |
```shell | |
data_raw | |
ββββspeaker0 | |
β ββββ000001.wav | |
β ββββ... | |
β ββββ000xxx.wav | |
ββββspeaker1 | |
ββββ000001.wav | |
ββββ... | |
ββββ000xxx.wav | |
``` | |
## Install dependencies | |
- 1 software dependency | |
> pip install -r requirements.txt | |
- 2 download [release](https://github.com/PlayVoice/NSF-BigVGAN/releases/tag/debug) model, and test | |
> python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --wave test.wav | |
## Data preprocessing | |
- 1οΌ re-sampling: 32kHz | |
> python prepare/preprocess_a.py -w ./data_raw -o ./data_bigvgan/waves-32k | |
- 3οΌ extract pitch | |
> python prepare/preprocess_f0.py -w data_bigvgan/waves-32k/ -p data_bigvgan/pitch | |
- 4οΌ extract mel: [100, length] | |
> python prepare/preprocess_spec.py -w data_bigvgan/waves-32k/ -s data_bigvgan/mel | |
- 5οΌ generate training index | |
> python prepare/preprocess_train.py | |
```shell | |
data_bigvgan/ | |
β | |
βββ waves-32k | |
β βββ speaker0 | |
β β βββ 000001.wav | |
β β βββ 000xxx.wav | |
β βββ speaker1 | |
β βββ 000001.wav | |
β βββ 000xxx.wav | |
βββ pitch | |
β βββ speaker0 | |
β β βββ 000001.pit.npy | |
β β βββ 000xxx.pit.npy | |
β βββ speaker1 | |
β βββ 000001.pit.npy | |
β βββ 000xxx.pit.npy | |
βββ mel | |
βββ speaker0 | |
β βββ 000001.mel.pt | |
β βββ 000xxx.mel.pt | |
βββ speaker1 | |
βββ 000001.mel.pt | |
βββ 000xxx.mel.pt | |
``` | |
## Train | |
- 1οΌ start training | |
> python nsf_bigvgan_trainer.py -c configs/nsf_bigvgan.yaml -n nsf_bigvgan | |
- 2οΌ resume training | |
> python nsf_bigvgan_trainer.py -c configs/nsf_bigvgan.yaml -n nsf_bigvgan -p chkpt/nsf_bigvgan/***.pth | |
- 3οΌ view log | |
> tensorboard --logdir logs/ | |
## Inference | |
- 1οΌ export inference model | |
> python nsf_bigvgan_export.py --config configs/maxgan.yaml --checkpoint_path chkpt/nsf_bigvgan/***.pt | |
- 2οΌ extract mel | |
> python spec/inference.py -w test.wav -m test.mel.pt | |
- 3οΌ extract F0 | |
> python pitch/inference.py -w test.wav -p test.csv | |
- 4οΌ infer | |
> python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --wave test.wav | |
or | |
> python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --mel test.mel.pt --pit test.csv | |
## Augmentation of mel | |
For the over smooth output of acoustic model, we use gaussian blur for mel when train vocoder | |
``` | |
# gaussian blur | |
model_b = get_gaussian_kernel(kernel_size=5, sigma=2, channels=1).to(device) | |
# mel blur | |
mel_b = mel[:, None, :, :] | |
mel_b = model_b(mel_b) | |
mel_b = torch.squeeze(mel_b, 1) | |
mel_r = torch.rand(1).to(device) * 0.5 | |
mel_b = (1 - mel_r) * mel_b + mel_r * mel | |
# generator | |
optim_g.zero_grad() | |
fake_audio = model_g(mel_b, pit) | |
``` | |
 | |
## Source of code and References | |
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf | |
https://github.com/mindslab-ai/univnet [[paper]](https://arxiv.org/abs/2106.07889) | |
https://github.com/NVIDIA/BigVGAN [[paper]](https://arxiv.org/abs/2206.04658) |