File size: 3,725 Bytes
d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 f9e60c5 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 517916a f9e60c5 d272641 f9e60c5 022f195 d272641 022f195 d272641 022f195 f9e60c5 022f195 f9e60c5 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 d272641 022f195 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
datasets:
- homebrewltd/Ichigo-tokenized-v0.1
language:
- en
- vi
license: apache-2.0
tags:
- sound language model
- audio-text-to-text
- torchtune
- whisperspeech
---

## Speechless
Speechless is a compact, open-source text-to-semantics (1B parameters) model, designed to generate direct semantic representations of audio as discrete tokens, bypassing the need for a text-to-speech (TTS) model. Unlike traditional pipelines that rely on generating and processing audio (TTS → ASR), Speechless eliminates this complexity by directly converting text into semantic speech tokens, simplifying training, saving resources, and enabling scalability, especially for low-resource languages.
Trained on over ~400 hours of English and ~1000 hours of Vietnamese data, Speechless is a core component of the Ichigo v0.5 family.
For more details, check out our official [blog post]().
### Model Summary
**Developed by:** Homebrew Research.
**Model Architecture:** Llama
**Model type:** Text to Semantics
**Language(s):** English and Vietnamese
**License:** Apache 2.0
### Resources
**Blog:** [Blog post]()
## Intended Use
**Intended Use Cases** This model is primarily designed for research purposes. This version focuses on generating direct semantic representations of audio as discrete tokens, eliminating the need for a text-to-speech (TTS) model.
**Out-of-scope** The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.
## How to Get Started
You can use given example code to load the model.
```python
import torch
from transformers import pipeline
model_id = "homebrewltd/Speechless-llama3.2-v0.1"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
pipe("<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research")
>>> [{'generated_text': '<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research.assistant\n\n<|sound_1968|><|sound_0464|><|sound_0642|><|duration_02|><|sound_0634|><|sound_0105|><|duration_02|><|sound_1745|><|duration_02|><|sound_1345|><|sound_0210|><|sound_1312|><|sound_1312|>'}]
```
## Training Specs
| **Parameter** | **Value** |
|----------------------------|-------------------------|
| **Epochs** | 2 |
| **Global Batch Size** | 144 |
| **Learning Rate** | 3e-4 |
| **Learning Scheduler** | Cosine |
| **Optimizer** | AdamW |
| **Warmup Ratio** | 0.05 |
| **Weight Decay** | 0.01 |
| **Max Sequence Length** | 512 |
| **Clip Grad Norm** | 1.0 |
## Evaluation
1. Vietnamese
| Model Name | Dataset test | Test samples | WER |
|------------|--------------|--------------|-----|
| **Speechless v0.1** | viet_bud500 | 7500 | **3.99** |
2. English
| Model Name | Dataset test | Test samples | WER |
|------------|--------------|--------------|-----|
| **Speechless v0.1** | librispeech_asr | 2620 | **3.27** |
## Citation Information
**BibTeX:**
```
@article{Speechless 2024,
title={Speechless},
author={Homebrew Research},
year=2024,
month=December},
url={https://huggingface.co/homebrewltd/Speechless-llama3.2-v0.1}
```
## Acknowledgement
- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**
- **[Llama3.2](https://huggingface.co/meta-llama/Meta-Llama-3.2-1B-Base)** |