Update README.md
Browse files
README.md
CHANGED
@@ -4,36 +4,9 @@ emoji: 🐻❄️
|
|
4 |
colorFrom: green
|
5 |
colorTo: purple
|
6 |
sdk: gradio
|
7 |
-
sdk_version:
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
license: agpl-3.0
|
11 |
---
|
12 |
|
13 |
-
Recommended settings for stable speech:
|
14 |
-
* `NAR levels`: 7; less NAR levels reduces the quality of the final waveform (this may also be strictly because, when EnCodec is fed a sequence with less RVQ bin levels than what it was initialized with, it'll sound worse).
|
15 |
-
* `Temperature (AR)`: [0.85, 1.1]; It's ***really*** tough to find a one-size-fits-all value.
|
16 |
-
* `Temperature (NAR)`: [0.15, 0.85]; This is even harder to nail decent values. Too high and you'll hear artifacts from the NAR, too low and you might not have acoustic detail recreated.
|
17 |
-
* `Dynamic Temperature`: checked; Dynamic temperature seems to definitely help resolve issues with a model that is not strongly trained. Pairable with every other sampling technique.
|
18 |
-
* `Top P`: [0.85, 0.95] || 1; I feel this is cope.
|
19 |
-
* `Top K`: [768, 1024] || 0; I also feel this is cope.
|
20 |
-
* `Beam Width`: 0 || 16; beam searching helps find potential best candidates, but I'm not sure how well it helps in the realm of audio. Incompatible with mirostat.
|
21 |
-
* `Repetition Penalty`: 1.35; this and the length decay miraculously are what helps stabilize output; I have my theories.
|
22 |
-
* `Repetition Penalty Length Decay`: 0.2; this helps not severly dampen the model's output when applying rep. pen.
|
23 |
-
* `Length Penalty`: 0; this only be messed with if you're consistently having either too short output, or too long output. The AR is trained decently enough to know when to emit a STOP token.
|
24 |
-
* `Mirostat (Tau)`: [2.0, 8.0]; The "surprise value" when performing mirostat sampling, which seems to be much more favorable in comparison to typical top-k/top-p or beam search sampling. The "best" values are still unknown.
|
25 |
-
* `Mirostat (Eta)`: [0.05, 0.3]; The "learning rate" (decay value?) applied each step for mirostat sampling.
|
26 |
-
|
27 |
-
This Space:
|
28 |
-
* houses experimental models and the necessary inferencing code for my [VALL-E](https://git.ecker.tech/mrq/vall-e) implementation. I hope to gain some critical feedback with the outputs.
|
29 |
-
* utilizes a T4 with a narcoleptic 5-minute sleep timer, as I do not have another system to (easily) host this myself with a 6800XT (or two) while I'm training off my 4070Ti and 7900XTX.
|
30 |
-
|
31 |
-
The model is:
|
32 |
-
* utilizing an RetNet for faster training/inferencing with conforming dimensionality (1024 dim, 4096 ffn dim, 16 heads, 12 layers) targetting the full eight RVQ-bins (albeit the model was originally trained at two then four).
|
33 |
-
* trained on ~12.5K hour dataset composed of LibriTTS-R, LibriLight (`small`+`medium`+`duplicated`), generously donated audiobooks, and vidya voice clip rips (including some Japanese kusoge gacha clips).
|
34 |
-
* a "monolothic" approach to sharing the retention-based transformer weights between AR and NAR tasks for no immediately discernable penalties (besides retraining).
|
35 |
-
* utilizing DeepSpeed to inference using its int8 quantized inferencing (allegedly), and Vocos for better output quality.
|
36 |
-
- I do need to add a toggle between different dtypes to gauge any perceptable quality/throughput gains/losses.
|
37 |
-
* currently still being trained, and any updates to it will be pushed back to this repo.
|
38 |
-
|
39 |
-
I am also currently training an experimental model with double the layers (24 layers instead) to gauge its performance. Depending on how well it performs, I may pivot to that too, but for now, I'm starting to doubt the time investment in training it.
|
|
|
4 |
colorFrom: green
|
5 |
colorTo: purple
|
6 |
sdk: gradio
|
7 |
+
sdk_version: 5.5.0
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
license: agpl-3.0
|
11 |
---
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|