File size: 2,271 Bytes
f150b2c 07f72d8 f150b2c 2cedbb1 f121861 f150b2c 42b1e4c 2950a5b 8394290 2950a5b 42b1e4c 5b14b87 e9463f1 42b1e4c 5b14b87 3925b1c 5b14b87 f0db202 3ddd487 ed8beae 3ddd487 ed8beae f0db202 0607187 f0db202 cabfe81 f0db202 9b66568 f0db202 62d99ac 7b9abc5 f0db202 6ee384a f0db202 9b66568 f0db202 5b14b87 b24b960 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
license: cc-by-nc-4.0
tags:
- audio-to-audio
pipeline_tag: audio-to-audio
---
[](https://arxiv.org/abs/2502.04128)
**Update (2025-02-13):** Add [Llasa finetune instruction](https://github.com/zhenye234/LLaSA_training/tree/main/finetune).
**Update (2025-02-07):** Our paper has been released!
## Paper
LLaSA: Scaling Train Time and Inference Time Compute for LLaMA based Speech Synthesis
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model (AAAI 2025, xcodec 1.0)
# Getting Started with XCodec2 on Hugging Face
XCodec2 is a speech tokenizer that offers the following key features:
1. **Single Vector Quantization**
2. **50 Tokens per Second**
3. **Multilingual Speech Semantic Support and High-Quality Speech Reconstruction**
To use `xcodec2`, ensure you have it installed. You can install it using the following command:
```bash
conda create -n xcodec2 python=3.9
conda activate xcodec2
pip install xcodec2 (Use `xcodec2==0.1.5` for codec inference and llasa fine-tuning. I’ve removed unnecessary dependencies, and it works fine in my testing. However, I’m not sure if other problems may arise. If you prefer more stability, I recommend using `xcodec2==0.1.3` which accurately aligns during my codec training.)
```
Then,
```python
import torch
import soundfile as sf
from transformers import AutoConfig
from xcodec2.modeling_xcodec2 import XCodec2Model
model_path = "HKUSTAudio/xcodec2"
model = XCodec2Model.from_pretrained(model_path)
model.eval().cuda()
wav, sr = sf.read("test.wav")
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0) # Shape: (1, T)
with torch.no_grad():
# Only 16khz speech
# Only supports single input. For batch inference, please refer to the link below.
vq_code = model.encode_code(input_waveform=wav_tensor)
print("Code:", vq_code )
recon_wav = model.decode_code(vq_code).cpu() # Shape: (1, 1, T')
sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr)
print("Done! Check reconstructed.wav")
```
# If you want to train your own xcodec2, batch inference, or large-scale code extraction, the code is released [here](https://github.com/zhenye234/X-Codec-2.0). |