Spaces:
No application file
A newer version of the Gradio SDK is available:
5.45.0
Inference
Inference support command line, HTTP API and web UI.
!!! note Overall, reasoning consists of several parts:
1. Encode a given ~10 seconds of voice using VQGAN.
2. Input the encoded semantic tokens and the corresponding text into the language model as an example.
3. Given a new piece of text, let the model generate the corresponding semantic tokens.
4. Input the generated semantic tokens into VITS / VQGAN to decode and generate the corresponding voice.
Command Line Inference
Download the required vqgan
and llama
models from our Hugging Face repository.
huggingface-cli download fishaudio/fish-speech-1.2-sft --local-dir checkpoints/fish-speech-1.2-sft
1. Generate prompt from voice:
!!! note If you plan to let the model randomly choose a voice timbre, you can skip this step.
python tools/vqgan/inference.py \
-i "paimon.wav" \
--checkpoint-path "checkpoints/fish-speech-1.2-sft/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
You should get a fake.npy
file.
2. Generate semantic tokens from text:
python tools/llama/generate.py \
--text "The text you want to convert" \
--prompt-text "Your reference text" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/fish-speech-1.2-sft" \
--num-samples 2 \
--compile
This command will create a codes_N
file in the working directory, where N is an integer starting from 0.
!!! note
You may want to use --compile
to fuse CUDA kernels for faster inference (~30 tokens/second -> ~500 tokens/second).
Correspondingly, if you do not plan to use acceleration, you can comment out the --compile
parameter.
!!! info
For GPUs that do not support bf16, you may need to use the --half
parameter.
3. Generate vocals from semantic tokens:
VQGAN Decoder
python tools/vqgan/inference.py \
-i "codes_0.npy" \
--checkpoint-path "checkpoints/fish-speech-1.2-sft/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
HTTP API Inference
We provide a HTTP API for inference. You can use the following command to start the server:
python -m tools.api \
--listen 0.0.0.0:8080 \
--llama-checkpoint-path "checkpoints/fish-speech-1.2-sft" \
--decoder-checkpoint-path "checkpoints/fish-speech-1.2-sft/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
--decoder-config-name firefly_gan_vq
If you want to speed up inference, you can add the --compile parameter.
After that, you can view and test the API at http://127.0.0.1:8080/.
Below is an example of sending a request using tools/post_api.py
.
python -m tools.post_api \
--text "Text to be input" \
--reference_audio "Path to reference audio" \
--reference_text "Text content of the reference audio" \
--streaming True
The above command indicates synthesizing the desired audio according to the reference audio information and returning it in a streaming manner.
If you need to randomly select reference audio based on {SPEAKER}
and {EMOTION}
, configure it according to the following steps:
1. Create a ref_data
folder in the root directory of the project.
2. Create a directory structure similar to the following within the ref_data
folder.
.
βββ SPEAKER1
β βββEMOTION1
β β βββ 21.15-26.44.lab
β β βββ 21.15-26.44.wav
β β βββ 27.51-29.98.lab
β β βββ 27.51-29.98.wav
β β βββ 30.1-32.71.lab
β β βββ 30.1-32.71.flac
β βββEMOTION2
β βββ 30.1-32.71.lab
β βββ 30.1-32.71.mp3
βββ SPEAKER2
ββββ EMOTION3
βββ 30.1-32.71.lab
βββ 30.1-32.71.mp3
That is, first place {SPEAKER}
folders in ref_data
, then place {EMOTION}
folders under each speaker, and place any number of audio-text pairs
under each emotion folder.
3. Enter the following command in the virtual environment
python tools/gen_ref.py
4. Call the API.
python -m tools.post_api \
--text "Text to be input" \
--speaker "${SPEAKER1}" \
--emotion "${EMOTION1}" \
--streaming True
The above example is for testing purposes only.
WebUI Inference
You can start the WebUI using the following command:
python -m tools.webui \
--llama-checkpoint-path "checkpoints/fish-speech-1.2-sft" \
--decoder-checkpoint-path "checkpoints/fish-speech-1.2-sft/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
--decoder-config-name firefly_gan_vq
!!! note
You can use Gradio environment variables, such as GRADIO_SHARE
, GRADIO_SERVER_PORT
, GRADIO_SERVER_NAME
to configure WebUI.
Enjoy!