Spaces:
Sleeping
Sleeping
File size: 3,694 Bytes
3aa4060 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
<div align="center">
<h1> Neural Source-Filter BigVGAN </h1>
Just For Fun
</div>

## Dataset preparation
Put the dataset into the data_raw directory according to the following file structure
```shell
data_raw
ββββspeaker0
β ββββ000001.wav
β ββββ...
β ββββ000xxx.wav
ββββspeaker1
ββββ000001.wav
ββββ...
ββββ000xxx.wav
```
## Install dependencies
- 1 software dependency
> pip install -r requirements.txt
- 2 download [release](https://github.com/PlayVoice/NSF-BigVGAN/releases/tag/debug) model, and test
> python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --wave test.wav
## Data preprocessing
- 1οΌ re-sampling: 32kHz
> python prepare/preprocess_a.py -w ./data_raw -o ./data_bigvgan/waves-32k
- 3οΌ extract pitch
> python prepare/preprocess_f0.py -w data_bigvgan/waves-32k/ -p data_bigvgan/pitch
- 4οΌ extract mel: [100, length]
> python prepare/preprocess_spec.py -w data_bigvgan/waves-32k/ -s data_bigvgan/mel
- 5οΌ generate training index
> python prepare/preprocess_train.py
```shell
data_bigvgan/
β
βββ waves-32k
β βββ speaker0
β β βββ 000001.wav
β β βββ 000xxx.wav
β βββ speaker1
β βββ 000001.wav
β βββ 000xxx.wav
βββ pitch
β βββ speaker0
β β βββ 000001.pit.npy
β β βββ 000xxx.pit.npy
β βββ speaker1
β βββ 000001.pit.npy
β βββ 000xxx.pit.npy
βββ mel
βββ speaker0
β βββ 000001.mel.pt
β βββ 000xxx.mel.pt
βββ speaker1
βββ 000001.mel.pt
βββ 000xxx.mel.pt
```
## Train
- 1οΌ start training
> python nsf_bigvgan_trainer.py -c configs/nsf_bigvgan.yaml -n nsf_bigvgan
- 2οΌ resume training
> python nsf_bigvgan_trainer.py -c configs/nsf_bigvgan.yaml -n nsf_bigvgan -p chkpt/nsf_bigvgan/***.pth
- 3οΌ view log
> tensorboard --logdir logs/
## Inference
- 1οΌ export inference model
> python nsf_bigvgan_export.py --config configs/maxgan.yaml --checkpoint_path chkpt/nsf_bigvgan/***.pt
- 2οΌ extract mel
> python spec/inference.py -w test.wav -m test.mel.pt
- 3οΌ extract F0
> python pitch/inference.py -w test.wav -p test.csv
- 4οΌ infer
> python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --wave test.wav
or
> python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --mel test.mel.pt --pit test.csv
## Augmentation of mel
For the over smooth output of acoustic model, we use gaussian blur for mel when train vocoder
```
# gaussian blur
model_b = get_gaussian_kernel(kernel_size=5, sigma=2, channels=1).to(device)
# mel blur
mel_b = mel[:, None, :, :]
mel_b = model_b(mel_b)
mel_b = torch.squeeze(mel_b, 1)
mel_r = torch.rand(1).to(device) * 0.5
mel_b = (1 - mel_r) * mel_b + mel_r * mel
# generator
optim_g.zero_grad()
fake_audio = model_g(mel_b, pit)
```

## Source of code and References
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
https://github.com/mindslab-ai/univnet [[paper]](https://arxiv.org/abs/2106.07889)
https://github.com/NVIDIA/BigVGAN [[paper]](https://arxiv.org/abs/2206.04658) |