From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages
LLaSA (“LLaMA-based Speech Synthesis”) is a framework originally designed for large-scale text-to-speech synthesis using LLaMA. As detailed in their article, the project began with a training pipeline developed by zhenye234. In a notable fork by SebastianBodza, we've made some further modifications to the script and recently finetuned the model to generate natural-sounding Italian and German speech.
1. Background: Llasa, an Unified LLM-based TTS Framework
Llasa is a Transformer-based text-to-speech system designed to fully align with LLM paradigms like Llama. Unlike conventional TTS systems that rely on separate acoustic models, prosody predictors, and vocoders, Llasa uses a single Transformer trained in an autoregressive next-token prediction framework—similar to how LLMs process text. This results in a simpler, more scalable, and highly flexible architecture that can be fine-tuned for various speech synthesis tasks.
1.1 The Role of Xcodec2 in Llasa
One of the key innovations behind Llasa is its speech tokenizer, known as Xcodec2. This component converts raw audio waveforms into discrete speech tokens, allowing the Transformer to model speech as it does text. Unlike traditional audio codecs that use multi-layer vector quantization (VQ), Xcodec2 employs a single-layer vector quantizer for efficient, causal, and autoregressive speech token modeling. This approach ensures that all speech characteristics—including content, prosody, and timbre—are captured within the tokenized representation, enabling high-quality speech synthesis.
If you are interested in Xcodec, I wrote a small X thread detailing the architecture and the benchmarks.
1.2 Scaling Train-Time Compute: Bigger Models, Better Speech
Llasa comes with increasing model size (1B → 3B → 8B parameters) that are trained with increasing data size (80k → 250k hours) which lead to better alignment with human-like speech patterns. Larger models capture deeper semantic understanding, allowing them to generate more emotionally expressive and contextually accurate speech.
2. Preprocessing and Audio Tokenization with xcodec2
2.1 Requirements
To get started with the preprocessing pipeline, simply clone the repository using the following command:
git clone https://github.com/Deep-unlearning/LLaSA_training.git
Python Packages:
torch transformers wandb datasets accelerate>=0.26.0 deepspeed xcodec2==0.1.3 flash-attn liger_kernel bitsandbytes
2.2 Hardware Requirements:
GPU VRAM: A minimum of 24 GB of VRAM is recommended for fine-tuning smaller models (e.g., 1B parameters), while larger models (e.g., 3B or 8B) may require even more VRAM.
Preprocessing with
create_dataset.py
:A dedicated preprocessing script,
create_dataset.py
, automates the conversion of raw audio files into tokenized representations. This script handles the following tasks:- Audio Conversion: It loads raw audio recordings and applies the xcodec2 codec (with the recommended version
xcodec2==0.1.3
) to compress the waveform into a concise sequence of tokens. During this stage, the codec extracts important acoustic features—such as pitch, tone, and rhythm—ensuring that the essential characteristics of the audio signal are retained. - Data Alignment and Formatting: Once the audio is tokenized, the script aligns these tokens with the corresponding text data. By combining the speech tokens with text tokens (using special markers for boundaries like
<|TEXT_GENERATION_START|>
and<|SPEECH_GENERATION_START|>
), the pipeline creates a unified dataset. This integration is crucial, as it allows the model to learn the mapping between the textual input and its corresponding speech output in a seamless, end-to-end manner.
- Audio Conversion: It loads raw audio recordings and applies the xcodec2 codec (with the recommended version
You can use your own dataset:
python create_dataset.py \\
--dataset_name your_dataset \\ # Load any dataset from Huggingface!
--output_dir output \\
Audio Tokenization with xcodec2:
The xcodec2 codec is at the heart of the tokenization process:
- Compression and Encoding: xcodec2 efficiently compresses the audio by transforming the continuous speech waveform into a discrete sequence of tokens. This encoding process captures the salient acoustic details in a compact form, reducing the computational burden during training while preserving the quality of the original audio.
- Unified Token Representation: After encoding, the resulting speech tokens are remapped by adding an offset to their numerical values. This offset is typically set to the size of the text vocabulary plus a few additional special tokens. By doing so, both text and speech tokens are integrated into a single token space. This unified representation enables the model to learn cross-modal correlations effectively, as it can process a single sequence that contains both the text input and the associated speech output.
3. Finetuning Llasa
We’ve made some improvement in our updated script is the switch to AutoLigerKernelForCausalLM along with the integration of an 8-bit Adam optimizer. This dynamic duo not only boosts performance—thanks to features like flash attention—but also helps squeeze every bit of efficiency out of your hardware.
model = AutoLigerKernelForCausalLM.from_pretrained(
model_args.llm_model_name_or_path,
cache_dir=model_args.cache_dir,
attn_implementation="flash_attention_2", # Fast, efficient attention computation
torch_dtype='bfloat16'
)
# The optimizer setup using bitsandbytes' 8-bit Adam saves memory and speeds up training.
adam_8bit_optim = bnb.optim.Adam8bit(
optimizer_grouped_parameters,
betas=(training_args.adam_beta1, training_args.adam_beta2),
eps=training_args.adam_epsilon,
lr=training_args.learning_rate,
)
So even if your GPU wallet is looking a bit "poor," these enhancements ensure you're still getting top-tier performance without breaking the bank!
We used CML-TTS italian subset for our finetuning. We used CML-TTS italian subset for our finetune our Llasagna! You can run the training script train_tts.py
simply with Python, or use torchrun if you have multiple GPUs. All key hyperparameters are specified in the config.json
file, so feel free to tweak them to suit your setup.
Come try the model!
You try Llasagna here: https://huggingface.co/spaces/Steveeeeeeen/Llasagna-1b-tts
And Kartoffel here: https://huggingface.co/spaces/SebastianBodza/Kartoffel-1B-v0.1-llasa-1b-tts
What's next ?
With the Llasa-1b-multilingual model already released, the possibilities are endless! What additional languages, features, or improvements would you like to see? We invite you to share your ideas and join us in exploring new frontiers in text-to-speech synthesis. Let's build the future of multilingual TTS together!
Acknowledgment
- SebastianBodza for his invaluable contributions to this project.His improvements in streamlining the training process have played a key role in enhancing our LLaSA finetuning pipeline.