Spaces:
No application file
No application file
File size: 4,777 Bytes
95c3696 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
# Inference
Inference support command line, HTTP API and web UI.
!!! note
Overall, reasoning consists of several parts:
1. Encode a given ~10 seconds of voice using VQGAN.
2. Input the encoded semantic tokens and the corresponding text into the language model as an example.
3. Given a new piece of text, let the model generate the corresponding semantic tokens.
4. Input the generated semantic tokens into VITS / VQGAN to decode and generate the corresponding voice.
## Command Line Inference
Download the required `vqgan` and `llama` models from our Hugging Face repository.
```bash
huggingface-cli download fishaudio/fish-speech-1.2-sft --local-dir checkpoints/fish-speech-1.2-sft
```
### 1. Generate prompt from voice:
!!! note
If you plan to let the model randomly choose a voice timbre, you can skip this step.
```bash
python tools/vqgan/inference.py \
-i "paimon.wav" \
--checkpoint-path "checkpoints/fish-speech-1.2-sft/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
```
You should get a `fake.npy` file.
### 2. Generate semantic tokens from text:
```bash
python tools/llama/generate.py \
--text "The text you want to convert" \
--prompt-text "Your reference text" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/fish-speech-1.2-sft" \
--num-samples 2 \
--compile
```
This command will create a `codes_N` file in the working directory, where N is an integer starting from 0.
!!! note
You may want to use `--compile` to fuse CUDA kernels for faster inference (~30 tokens/second -> ~500 tokens/second).
Correspondingly, if you do not plan to use acceleration, you can comment out the `--compile` parameter.
!!! info
For GPUs that do not support bf16, you may need to use the `--half` parameter.
### 3. Generate vocals from semantic tokens:
#### VQGAN Decoder
```bash
python tools/vqgan/inference.py \
-i "codes_0.npy" \
--checkpoint-path "checkpoints/fish-speech-1.2-sft/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
```
## HTTP API Inference
We provide a HTTP API for inference. You can use the following command to start the server:
```bash
python -m tools.api \
--listen 0.0.0.0:8080 \
--llama-checkpoint-path "checkpoints/fish-speech-1.2-sft" \
--decoder-checkpoint-path "checkpoints/fish-speech-1.2-sft/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
--decoder-config-name firefly_gan_vq
```
If you want to speed up inference, you can add the --compile parameter.
After that, you can view and test the API at http://127.0.0.1:8080/.
Below is an example of sending a request using `tools/post_api.py`.
```bash
python -m tools.post_api \
--text "Text to be input" \
--reference_audio "Path to reference audio" \
--reference_text "Text content of the reference audio" \
--streaming True
```
The above command indicates synthesizing the desired audio according to the reference audio information and returning it in a streaming manner.
If you need to randomly select reference audio based on `{SPEAKER}` and `{EMOTION}`, configure it according to the following steps:
### 1. Create a `ref_data` folder in the root directory of the project.
### 2. Create a directory structure similar to the following within the `ref_data` folder.
```
.
βββ SPEAKER1
β βββEMOTION1
β β βββ 21.15-26.44.lab
β β βββ 21.15-26.44.wav
β β βββ 27.51-29.98.lab
β β βββ 27.51-29.98.wav
β β βββ 30.1-32.71.lab
β β βββ 30.1-32.71.flac
β βββEMOTION2
β βββ 30.1-32.71.lab
β βββ 30.1-32.71.mp3
βββ SPEAKER2
ββββ EMOTION3
βββ 30.1-32.71.lab
βββ 30.1-32.71.mp3
```
That is, first place `{SPEAKER}` folders in `ref_data`, then place `{EMOTION}` folders under each speaker, and place any number of `audio-text pairs` under each emotion folder.
### 3. Enter the following command in the virtual environment
```bash
python tools/gen_ref.py
```
### 4. Call the API.
```bash
python -m tools.post_api \
--text "Text to be input" \
--speaker "${SPEAKER1}" \
--emotion "${EMOTION1}" \
--streaming True
```
The above example is for testing purposes only.
## WebUI Inference
You can start the WebUI using the following command:
```bash
python -m tools.webui \
--llama-checkpoint-path "checkpoints/fish-speech-1.2-sft" \
--decoder-checkpoint-path "checkpoints/fish-speech-1.2-sft/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
--decoder-config-name firefly_gan_vq
```
!!! note
You can use Gradio environment variables, such as `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` to configure WebUI.
Enjoy!
|