File size: 3,725 Bytes
d272641
022f195
 
 
 
 
 
 
 
 
 
 
d272641
 
022f195
d272641
022f195
d272641
022f195
d272641
f9e60c5
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
022f195
d272641
517916a
f9e60c5
 
d272641
f9e60c5
 
 
 
 
 
 
 
 
 
 
 
022f195
d272641
 
022f195
d272641
022f195
 
f9e60c5
 
 
022f195
 
f9e60c5
 
 
 
d272641
 
 
022f195
d272641
022f195
 
 
d272641
022f195
d272641
022f195
 
 
d272641
022f195
d272641
 
 
022f195
 
 
 
 
 
 
 
d272641
022f195
d272641
022f195
d272641
022f195
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
datasets:
- homebrewltd/Ichigo-tokenized-v0.1
language:
- en
- vi
license: apache-2.0
tags:
- sound language model
- audio-text-to-text
- torchtune
- whisperspeech
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/BjNGSPCF5z-tp9aAGsZN9.png)

## Speechless

Speechless is a compact, open-source text-to-semantics (1B parameters) model, designed to generate direct semantic representations of audio as discrete tokens, bypassing the need for a text-to-speech (TTS) model. Unlike traditional pipelines that rely on generating and processing audio (TTS → ASR), Speechless eliminates this complexity by directly converting text into semantic speech tokens, simplifying training, saving resources, and enabling scalability, especially for low-resource languages.

Trained on over ~400 hours of English and ~1000 hours of Vietnamese data, Speechless is a core component of the Ichigo v0.5 family.

For more details, check out our official [blog post]().

### Model Summary

**Developed by:** Homebrew Research.

**Model Architecture:** Llama

**Model type:** Text to Semantics

**Language(s):** English and Vietnamese

**License:** Apache 2.0

### Resources

**Blog:** [Blog post]()

## Intended Use

**Intended Use Cases** This model is primarily designed for research purposes. This version focuses on generating direct semantic representations of audio as discrete tokens, eliminating the need for a text-to-speech (TTS) model.

**Out-of-scope** The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.

## How to Get Started

You can use given example code to load the model.

```python
import torch
from transformers import pipeline

model_id = "homebrewltd/Speechless-llama3.2-v0.1"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research")

>>> [{'generated_text': '<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research.assistant\n\n<|sound_1968|><|sound_0464|><|sound_0642|><|duration_02|><|sound_0634|><|sound_0105|><|duration_02|><|sound_1745|><|duration_02|><|sound_1345|><|sound_0210|><|sound_1312|><|sound_1312|>'}]
```


## Training Specs

| **Parameter**              | **Value**               |
|----------------------------|-------------------------|
| **Epochs**                 | 2                       |
| **Global Batch Size**      | 144                     |
| **Learning Rate**          | 3e-4                    |
| **Learning Scheduler**     | Cosine                  |
| **Optimizer**              | AdamW                   |
| **Warmup Ratio**           | 0.05                    |
| **Weight Decay**           | 0.01                    |
| **Max Sequence Length**    | 512                     |
| **Clip Grad Norm**         | 1.0                     |

## Evaluation

1. Vietnamese

| Model Name | Dataset test | Test samples | WER |
|------------|--------------|--------------|-----|
| **Speechless v0.1** | viet_bud500 | 7500 | **3.99** |

2. English

| Model Name | Dataset test | Test samples | WER |
|------------|--------------|--------------|-----|
| **Speechless v0.1** | librispeech_asr | 2620 | **3.27** |

## Citation Information

**BibTeX:**

```
@article{Speechless 2024,
  title={Speechless},
  author={Homebrew Research},
  year=2024,
  month=December},
  url={https://huggingface.co/homebrewltd/Speechless-llama3.2-v0.1}
```

## Acknowledgement

- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**

- **[Llama3.2](https://huggingface.co/meta-llama/Meta-Llama-3.2-1B-Base)**