slam / README.md

Fix typos

b09eb4e verified about 1 month ago

6.41 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- openslr/librispeech_asr
	- slprl/SpokenSwag
	- slprl/sTinyStories
	base_model:
	- Qwen/Qwen2.5-0.5B
	pipeline_tag: audio-to-audio
	---

	# Model Card for SLAM

	This is a Speech Language Model trained for generating speech continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).


	## Model Details

	### Model Description
	This is a Speech Language Model, introduced in "[_Slamming_: Training a Speech Language Model on One GPU in a Day](https://arxiv.org/abs/2502.15814)", focusing on efficient training.
	It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
	the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz). For a stronger version of the model trained with
	slightly more compute - 2*A100 for 2 days, see [slam_scaled](https://huggingface.co/slprl/slam_scaled).

	The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
	[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was then trained with DPO over
	[SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).

	- Developed by: [SLP-RL](https://huggingface.co/slprl)
	- Model type: SpeechLM
	- License: MIT
	- Finetuned from model: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)

	### Model Sources

	- Repository: [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
	- Paper: [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
	- Demo [https://pages.cs.huji.ac.il/adiyoss-lab/slamming/](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)

	## Uses
	This is a base SpeechLM and as such can be used to generate continuations for speech segments, or as base for further tuning. See the _SlamKit_
	[codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples

	### Out-of-Scope Use
	This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.



	## How to Get Started with the Model
	We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).


	## Training Details
	We highly encourage users to read the full [paper](https://arxiv.org/abs/2502.15814), for full training details, a brief overview is provided below.


	### Training Data
	This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
	[Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
	[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
	dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).

	### Training Procedure
	This model was trained by next token prediction over several datasets, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
	Please refer to the [paper]() or [code](https://github.com/slp-rl/slamkit) for the full training recipes.

	#### Preprocessing
	Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
	official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
	We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).


	## Evaluation
	The paper provides full results, we do give here some results and also refer to the [demo page]() to listen to some samples.
	\| Model \| Compute (GPU days) \| Parameters \| sBLIMP ↑ \| sStoryCloze ↑ \| tStoryCloze ↑ \| GenPPL ↓ \| Auto-BLEU ↓ \|
	\|------------------------------------------\|--------------------\|------------\|----------\|--------------\|--------------\|---------\|------------\|
	\| [TWIST-1.3B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/) \| 160xV100 \| 1B \| 57.00 \| 52.4 \| 70.6 \| 131.8 \| 3.20 \|
	\| [TWIST-7B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/) \| ? \| 7B \| 59.00 \| 55.3 \| 74.1 \| 93.7 \| 3.06 \|
	\| [TWIST-13B](https://pages.cs.huji.ac.il/adiyoss-lab/twist/) \| ? \| 13B \| 59.20 \| 55.4 \| 76.4 \| - \| - \|
	\| [Scaled Optimal](https://arxiv.org/abs/2404.00685) \| ? \| 823M \| 61.3 \| 56.7 \| 78.0 \| - \| - \|
	\| [Predicted Optimal]((https://arxiv.org/abs/2404.00685)) \| 1xA5000 \| 78M \| 56.85 \| 54.09 \| 70.49 \| - \| - \|
	\| TWIST-350M (Original recipe) \| 1xA5000 \| 305M \| 51.52 ± .19 \| 53.65 ± .57 \| 68.80 ± .47 \| 259.2 ± 6.7 \| 3.26 ± .46 \|
	\| Slam (-DPO) (ours) \| 1xA5000 \| 358M \| 56.45 ± .17 \| 55.59 ± .30 \| 78.01 ± .27 \| 88.3 ± 1.0 \| 3.47 ± .17 \|
	\| Slam (ours) \| 1xA5000 \| 358M \| 58.86 ± .20 \| 58.04 ± .51 \| 82.04 ± .21 \| 62.8 ± 4.1 \| 3.88 ± .11 \|



	### Compute Infrastructure
	This model was trained as part of ["Slamming: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.

	#### Hardware
	This model was trained using only a single Nvidia A5000 GPU, 16 CPU cores and 24 GB of RAM for 24 hours.

	#### Software
	The model wastrained using the [SlamKit](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
	easy and efficient training of Speech Language Models.

	## Citation

	BibTeX:
	```
	@misc{maimon2025slamming,
	title={Slamming: Training a Speech Language Model on One GPU in a Day},
	author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
	year={2025},
	eprint={2502.15814},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2502.15814},
	}
	```