File size: 3,949 Bytes
6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 64d0f61 6747089 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
# TinyWave Base Speech 2B
**TinyWave Base Speech 2B** is a compact speech-to-speech generation model distilled from the 7B SPIRIT-LM-Base teacher. It uses HuBERT-based phonetic tokens for efficient, high-quality speech generation and is optimized for **fast inference** on **commodity hardware**.
This model focuses on generating semantically coherent speech continuations without expressive modulation (e.g., pitch/style tokens). It is ideal for **low-resource speech agents**, **instruction-following speech bots**, and **embedded systems**.
> π See the [TinyWave paper (arXiv:2506.23670)](https://arxiv.org/abs/2506.23670) and [demo site](https://mohammadmahdinoori.github.io/tinywave-landing/) for more details.
---
## π§ Usage
This model requires **SPIRIT-LM's base speech tokenizer**, which uses HuBERT units without pitch/style tokens.
### 1. Clone SPIRIT-LM and Install Requirements
```bash
git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'
````
---
### 2. Load Tokenizer
```python
from spiritlm.speech_tokenizer import spiritlm_base
speech_tokenizer = spiritlm_base()
```
---
### 3. Inference Code (Speech-to-Speech)
```python
from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch
# Load model and tokenizer
MODEL_PATH = "tinywave/speech-base-2b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)
# Load base speech tokenizer
speech_tokenizer = spiritlm_base()
def get_inference(audio_path):
audio, _ = torchaudio.load(audio_path)
input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
tokens = speech_tokenizer.encode_string(input_values)
input_ids = tokenizer(tokens, return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, top_p=0.9, temperature=0.9, do_sample=True)
return tokenizer.decode(output[0])
```
---
### 4. Decode to WAV
```python
import numpy as np
from scipy.io.wavfile import write
def save_array_to_wav_int16(audio_array: np.ndarray, sampling_rate=16000, filename="output.wav"):
scaled = np.int16(audio_array / np.max(np.abs(audio_array)) * 32767)
write(filename, sampling_rate, scaled)
decoded_audio = speech_tokenizer.decode(generated_output.replace(" ", "").replace("<s>", "").replace("</s>", ""), speaker_id=2)
save_array_to_wav_int16(decoded_audio, filename="generated.wav")
```
---
## π£οΈ Inference Example
### π§ Basic Speech Continuation
Input: `simple_prompt.wav`
Output: Semantically consistent speech continuation without expressive variation.
---
## π§ Model Details
| Feature | Description |
| ------------------- | ------------------------------------------------ |
| Architecture | 2B parameter distilled transformer |
| Tokenizer | SPIRIT-LM Base (HuBERT phonetic tokens) |
| Input Type | Discrete HuBERT tokens only (speech-only) |
| Output Type | Discrete audio tokens |
| Teacher Model | SPIRIT-LM-Base 7B |
| Tasks | Speech continuation |
| Distillation Method | Layer-aligned (hidden states, attention, logits) |
---
## π Citation
```bibtex
@article{nouriborji2025tinywave,
title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
journal={arXiv preprint arXiv:2506.23670},
year={2025}
}
```
---
## π Resources
* π [Project Page](https://mohammadmahdinoori.github.io/tinywave-landing/)
* π¬ [Demo Samples](https://mohammadmahdinoori.github.io/tinywave-landing/#samples)
* π§ [Training & Codebase](https://github.com/mohammadmahdinoori/TinyWave)
|