File size: 2,271 Bytes

f150b2c
07f72d8
f150b2c
2cedbb1
f121861
f150b2c
42b1e4c
2950a5b
8394290
 
2950a5b
 
42b1e4c
5b14b87
e9463f1
42b1e4c
5b14b87
3925b1c
5b14b87
 
f0db202
3ddd487
ed8beae
 
 
3ddd487
ed8beae
f0db202
 
 
 
 
 
0607187
 
f0db202
 
 
 
 
 
 
 
 
 
cabfe81
f0db202
 
 
 
 
 
9b66568
f0db202
 
 
62d99ac
7b9abc5
f0db202
6ee384a
f0db202
9b66568
f0db202
 
 
 
5b14b87
 
b24b960

---
license: cc-by-nc-4.0
tags:
- audio-to-audio
pipeline_tag: audio-to-audio
---

[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2502.04128)
**Update (2025-02-13):** Add [Llasa finetune instruction](https://github.com/zhenye234/LLaSA_training/tree/main/finetune).

**Update (2025-02-07):** Our paper has been released!


## Paper
 
LLaSA: Scaling Train Time and Inference Time Compute for LLaMA based Speech Synthesis  

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model (AAAI 2025, xcodec 1.0)


# Getting Started with XCodec2 on Hugging Face
XCodec2 is a speech tokenizer that offers the following key features:

1. **Single Vector Quantization**
2. **50 Tokens per Second**
3. **Multilingual Speech Semantic Support and High-Quality Speech Reconstruction**


To use `xcodec2`, ensure you have it installed. You can install it using the following command:

```bash
conda create -n xcodec2 python=3.9
conda activate xcodec2
pip install xcodec2  (Use `xcodec2==0.1.5` for codec inference and llasa fine-tuning. I’ve removed unnecessary dependencies, and it works fine in my testing. However,  I’m not sure if other problems may arise. If you prefer more stability, I recommend using `xcodec2==0.1.3` which accurately aligns during my codec training.)
 
```
Then,
```python
import torch
import soundfile as sf
from transformers import AutoConfig

 
from xcodec2.modeling_xcodec2 import XCodec2Model
 
model_path = "HKUSTAudio/xcodec2"  
 
model = XCodec2Model.from_pretrained(model_path)
model.eval().cuda()   

 
wav, sr = sf.read("test.wav")   
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0)  # Shape: (1, T)

 
with torch.no_grad():
   # Only 16khz speech
   # Only supports single input. For batch inference, please refer to the link below.
    vq_code = model.encode_code(input_waveform=wav_tensor)
    print("Code:", vq_code )  

    recon_wav = model.decode_code(vq_code).cpu()       # Shape: (1, 1, T')

 
sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr)
print("Done! Check reconstructed.wav")
```

# If you want to train your own xcodec2, batch inference, or large-scale code extraction, the code is released [here](https://github.com/zhenye234/X-Codec-2.0).