File size: 3,725 Bytes

---
datasets:
- homebrewltd/Ichigo-tokenized-v0.1
language:
- en
- vi
license: apache-2.0
tags:
- sound language model
- audio-text-to-text
- torchtune
- whisperspeech
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/BjNGSPCF5z-tp9aAGsZN9.png)

## Speechless

Speechless is a compact, open-source text-to-semantics (1B parameters) model, designed to generate direct semantic representations of audio as discrete tokens, bypassing the need for a text-to-speech (TTS) model. Unlike traditional pipelines that rely on generating and processing audio (TTS → ASR), Speechless eliminates this complexity by directly converting text into semantic speech tokens, simplifying training, saving resources, and enabling scalability, especially for low-resource languages.

Trained on over ~400 hours of English and ~1000 hours of Vietnamese data, Speechless is a core component of the Ichigo v0.5 family.

For more details, check out our official [blog post]().

### Model Summary

**Developed by:** Homebrew Research.

**Model Architecture:** Llama

**Model type:** Text to Semantics

**Language(s):** English and Vietnamese

**License:** Apache 2.0

### Resources

**Blog:** [Blog post]()

## Intended Use

**Intended Use Cases** This model is primarily designed for research purposes. This version focuses on generating direct semantic representations of audio as discrete tokens, eliminating the need for a text-to-speech (TTS) model.

**Out-of-scope** The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.

## How to Get Started

You can use given example code to load the model.

```python
import torch
from transformers import pipeline

model_id = "homebrewltd/Speechless-llama3.2-v0.1"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research")

>>> [{'generated_text': '<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research.assistant\n\n<|sound_1968|><|sound_0464|><|sound_0642|><|duration_02|><|sound_0634|><|sound_0105|><|duration_02|><|sound_1745|><|duration_02|><|sound_1345|><|sound_0210|><|sound_1312|><|sound_1312|>'}]
```


## Training Specs

| **Parameter**              | **Value**               |
|----------------------------|-------------------------|
| **Epochs**                 | 2                       |
| **Global Batch Size**      | 144                     |
| **Learning Rate**          | 3e-4                    |
| **Learning Scheduler**     | Cosine                  |
| **Optimizer**              | AdamW                   |
| **Warmup Ratio**           | 0.05                    |
| **Weight Decay**           | 0.01                    |
| **Max Sequence Length**    | 512                     |
| **Clip Grad Norm**         | 1.0                     |

## Evaluation

1. Vietnamese

| Model Name | Dataset test | Test samples | WER |
|------------|--------------|--------------|-----|
| **Speechless v0.1** | viet_bud500 | 7500 | **3.99** |

2. English

| Model Name | Dataset test | Test samples | WER |
|------------|--------------|--------------|-----|
| **Speechless v0.1** | librispeech_asr | 2620 | **3.27** |

## Citation Information

**BibTeX:**

```
@article{Speechless 2024,
  title={Speechless},
  author={Homebrew Research},
  year=2024,
  month=December},
  url={https://huggingface.co/homebrewltd/Speechless-llama3.2-v0.1}
```

## Acknowledgement

- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**

- **[Llama3.2](https://huggingface.co/meta-llama/Meta-Llama-3.2-1B-Base)**