Audio-to-Audio
Transformers
Safetensors
speech_language_model
File size: 6,414 Bytes
1cd101c
 
08e7e0e
 
 
0677661
b003bad
08e7e0e
 
82e1cfd
1cd101c
 
b09eb4e
 
 
1cd101c
 
 
 
 
b09eb4e
d69bd36
 
630fe3a
1cd101c
f932655
 
 
1cd101c
08e7e0e
 
 
 
1cd101c
08e7e0e
1cd101c
a10a98b
0f9b7c2
a86fb43
1cd101c
 
b09eb4e
a10a98b
1cd101c
 
08e7e0e
1cd101c
 
 
 
b09eb4e
1cd101c
 
 
0f9b7c2
f932655
1cd101c
 
d69bd36
 
08e7e0e
 
1cd101c
 
b09eb4e
a10a98b
1cd101c
08e7e0e
 
 
a10a98b
1cd101c
 
 
f932655
d69bd36
 
 
 
 
 
 
 
 
 
1cd101c
 
 
 
0f9b7c2
1cd101c
 
08e7e0e
1cd101c
 
a86fb43
b09eb4e
1cd101c
08e7e0e
1cd101c
 
0f9b7c2
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
library_name: transformers
license: mit
datasets:
- openslr/librispeech_asr
- slprl/SpokenSwag
- slprl/sTinyStories
base_model:
- Qwen/Qwen2.5-0.5B
pipeline_tag: audio-to-audio
---

# Model Card for SLAM

This is a Speech Language Model trained for generating speech continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).


## Model Details

### Model Description
This is a Speech Language Model, introduced in "[_Slamming_: Training a Speech Language Model on One GPU in a Day](https://arxiv.org/abs/2502.15814)", focusing on efficient training. 
It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from 
the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz). For a stronger version of the model trained with 
slightly more compute - 2*A100 for 2 days, see [slam_scaled](https://huggingface.co/slprl/slam_scaled).

The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data 
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was then trained with DPO over 
[SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).

- **Developed by:** [SLP-RL](https://huggingface.co/slprl)
- **Model type:** SpeechLM
- **License:** MIT
- **Finetuned from model:** [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)

### Model Sources

- **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
- **Paper:** [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
- **Demo:** [https://pages.cs.huji.ac.il/adiyoss-lab/slamming/](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)

## Uses
This is a base SpeechLM and as such can be used to generate continuations for speech segments, or as base for further tuning. See the _SlamKit_
[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples

### Out-of-Scope Use
This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.



## How to Get Started with the Model
We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).


## Training Details
We highly encourage users to read the full [paper](https://arxiv.org/abs/2502.15814), for full training details, a brief overview is provided below.


### Training Data
This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train, 
[Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).

### Training Procedure
This model was trained by next token prediction over several datasets, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
Please refer to the [paper]() or [code](https://github.com/slp-rl/slamkit) for the full training recipes.

#### Preprocessing
Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the 
official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).


## Evaluation
The paper provides full results, we do give here some results and also refer to the [demo page]() to listen to some samples.
| Model                                    | Compute (GPU days) | Parameters | sBLIMP ↑ | sStoryCloze ↑ | tStoryCloze ↑ | GenPPL ↓ | Auto-BLEU ↓ |
|------------------------------------------|--------------------|------------|----------|--------------|--------------|---------|------------|
| [TWIST-1.3B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/)                      | 160xV100          | 1B         | 57.00    | 52.4         | 70.6         | 131.8   | 3.20       |
| [TWIST-7B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/)                        | ?                 | 7B         | 59.00    | 55.3         | 74.1         | 93.7    | 3.06       |
| [TWIST-13B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/)                       | ?                 | 13B        | 59.20    | 55.4         | 76.4         | -       | -          |
| [Scaled Optimal](https://arxiv.org/abs/2404.00685)      | ?                 | 823M       | **61.3** | **56.7**     | **78.0**     | -       | -          |
| [Predicted Optimal]((https://arxiv.org/abs/2404.00685))   | 1xA5000           | 78M        | 56.85    | 54.09        | 70.49        | -       | -          |
| TWIST-350M (Original recipe)            | 1xA5000           | 305M       | 51.52 ± .19 | 53.65 ± .57 | 68.80 ± .47 | 259.2 ± 6.7 | 3.26 ± .46 |
| *Slam (-DPO) (ours)*                    | 1xA5000           | 358M       | *56.45* ± .17 | *55.59* ± .30 | *78.01* ± .27 | *88.3* ± 1.0 | 3.47 ± .17 |
| **Slam (ours)**                       | 1xA5000           | 358M       | **58.86** ± .20 | **58.04** ± .51 | **82.04** ± .21 | **62.8** ± 4.1 | 3.88 ± .11 |



### Compute Infrastructure
This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.

#### Hardware
This model was trained using **only a single Nvidia A5000 GPU**, 16 CPU cores and 24 GB of RAM for **24 hours**.

#### Software
The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
easy and efficient training of Speech Language Models.

## Citation

**BibTeX:**
```
@misc{maimon2025slamming,
      title={Slamming: Training a Speech Language Model on One GPU in a Day}, 
      author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
      year={2025},
      eprint={2502.15814},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.15814}, 
}
```