File size: 3,949 Bytes
6747089
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
 
 
 
 
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
 
 
 
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
 
 
 
64d0f61
6747089
 
 
 
64d0f61
6747089
 
64d0f61
6747089
 
 
 
 
 
 
 
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
 
 
64d0f61
6747089
 
 
64d0f61
6747089
 
 
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
 
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
 
 
 
 
 
 
 
 
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
 
 
 
 
 
 
 
64d0f61
6747089
64d0f61
6747089
64d0f61
6747089
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# TinyWave Base Speech 2B

**TinyWave Base Speech 2B** is a compact speech-to-speech generation model distilled from the 7B SPIRIT-LM-Base teacher. It uses HuBERT-based phonetic tokens for efficient, high-quality speech generation and is optimized for **fast inference** on **commodity hardware**.

This model focuses on generating semantically coherent speech continuations without expressive modulation (e.g., pitch/style tokens). It is ideal for **low-resource speech agents**, **instruction-following speech bots**, and **embedded systems**.

> πŸ“– See the [TinyWave paper (arXiv:2506.23670)](https://arxiv.org/abs/2506.23670) and [demo site](https://mohammadmahdinoori.github.io/tinywave-landing/) for more details.

---

## πŸ”§ Usage

This model requires **SPIRIT-LM's base speech tokenizer**, which uses HuBERT units without pitch/style tokens.

### 1. Clone SPIRIT-LM and Install Requirements

```bash
git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'
````

---

### 2. Load Tokenizer

```python
from spiritlm.speech_tokenizer import spiritlm_base
speech_tokenizer = spiritlm_base()
```

---

### 3. Inference Code (Speech-to-Speech)

```python
from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch

# Load model and tokenizer
MODEL_PATH = "tinywave/speech-base-2b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)

# Load base speech tokenizer
speech_tokenizer = spiritlm_base()

def get_inference(audio_path):
    audio, _ = torchaudio.load(audio_path)
    input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
    tokens = speech_tokenizer.encode_string(input_values)
    input_ids = tokenizer(tokens, return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, top_p=0.9, temperature=0.9, do_sample=True)
    return tokenizer.decode(output[0])
```

---

### 4. Decode to WAV

```python
import numpy as np
from scipy.io.wavfile import write

def save_array_to_wav_int16(audio_array: np.ndarray, sampling_rate=16000, filename="output.wav"):
    scaled = np.int16(audio_array / np.max(np.abs(audio_array)) * 32767)
    write(filename, sampling_rate, scaled)

decoded_audio = speech_tokenizer.decode(generated_output.replace(" ", "").replace("<s>", "").replace("</s>", ""), speaker_id=2)
save_array_to_wav_int16(decoded_audio, filename="generated.wav")
```

---

## πŸ—£οΈ Inference Example

### 🎧 Basic Speech Continuation

Input: `simple_prompt.wav`
Output: Semantically consistent speech continuation without expressive variation.

---

## 🧠 Model Details

| Feature             | Description                                      |
| ------------------- | ------------------------------------------------ |
| Architecture        | 2B parameter distilled transformer               |
| Tokenizer           | SPIRIT-LM Base (HuBERT phonetic tokens)          |
| Input Type          | Discrete HuBERT tokens only (speech-only)        |
| Output Type         | Discrete audio tokens                            |
| Teacher Model       | SPIRIT-LM-Base 7B                                |
| Tasks               | Speech continuation                              |
| Distillation Method | Layer-aligned (hidden states, attention, logits) |

---

## πŸ“Ž Citation

```bibtex
@article{nouriborji2025tinywave,
  title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
  author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
  journal={arXiv preprint arXiv:2506.23670},
  year={2025}
}
```

---

## πŸ“‚ Resources

* πŸ”— [Project Page](https://mohammadmahdinoori.github.io/tinywave-landing/)
* πŸ’¬ [Demo Samples](https://mohammadmahdinoori.github.io/tinywave-landing/#samples)
* 🧠 [Training & Codebase](https://github.com/mohammadmahdinoori/TinyWave)