File size: 8,283 Bytes
c072e95
 
 
 
730a618
 
 
 
 
 
 
 
 
 
c072e95
730a618
 
 
 
199b5d3
 
730a618
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc9577d
730a618
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1bd5e70
 
730a618
 
 
0698a0f
 
 
 
 
 
 
 
730a618
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc9577d
730a618
 
 
 
 
 
fc9577d
730a618
 
 
 
 
fc9577d
730a618
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
library_name: transformers
tags:
- unsloth
- text-to-audio
- s2s
license: cc-by-sa-4.0
datasets:
- KandirResearch/Speech2Speech
language:
- en
base_model:
- OuteAI/OuteTTS-0.3-500M
pipeline_tag: text-to-audio
---
# CiSiMi: A Text-to-Speech TTS Model

[![Buy Me A Coffee](https://img.shields.io/badge/Ko--fi-Support%20My%20Work-FF5E5B?style=for-the-badge&logo=ko-fi&logoColor=white)](https://ko-fi.com/lyte)
[![Dataset](https://img.shields.io/badge/Dataset-KandirResearch/Speech2Speech-blue)](https://huggingface.co/datasets/KandirResearch/Speech2Speech)
[![Model](https://img.shields.io/badge/Model-KandirResearch/CiSiMi--v0.1-green)](https://huggingface.co/KandirResearch/CiSiMi-v0.1)
[![Demo](https://img.shields.io/badge/Demo-KandirResearch/CiSiMi--At--Home-orange)](https://huggingface.co/spaces/KandirResearch/CiSiMi-At-Home)

## Overview

CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. Built for resource-constrained environments, it's designed to run efficiently on CPU using llama.cpp, making advanced speech synthesis accessible even without powerful GPUs.

*"Being GPU poor and slightly disappointed with the csm release and my inability to run it, having to wait for time it takes me to run an ASR+LLM+TTS combo, I decided to ask Mom and Mom gave me CiSiMi At Home!"*

This project demonstrates the power of open-source tools to create accessible speech technology. While still in its early stages, it represents a step toward democratizing advanced text-to-audio capabilities.

## Technical Details

### Model Specifications
- **Architecture**: Based on OuteTTS-0.3-500M
- **Languages**: English
- **Pipeline**: Text-to-audio
- **Parameters**: 500M
- **Training Dataset Size**: ~15k samples
- **Future Goals**: Scale to 200k-500k dataset with multi-turn conversation using both a 500M and a 1B parameter model variants, plus adding streaming for realtime.

### Training Methodology

1. **Dataset Preparation**:
   - Started with [gruhit-patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit-patel/alpaca_speech_instruct)
   - Cleaned by removing code, mathematical expressions, and non-English content
   - Filtered to keep only entries with input+output texts of 256 tokens or less

2. **Audio Generation**:
   - Converted text outputs to speech using [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M)
   - Verified each audio generation using [OpenAI Whisper](https://github.com/openai/whisper)
   - Published the resulting dataset as [KandirResearch/Speech2Speech](https://huggingface.co/datasets/KandirResearch/Speech2Speech)

3. **Model Training**:
   - Preprocessed dataset using modified OuteTTS methodology ([training details](https://github.com/edwko/OuteTTS/blob/8eb0fa369df6f3c062f7084ddc33d10bc28992be/examples/training/OuteTTS-0.3/train.md))
   - Fine-tuned [OuteAI/OuteTTS-0.3-500M](https://huggingface.co/OuteAI/OuteTTS-0.3-500M) using Unsloth SFT
   - Trained for 6 epochs reaching a loss of 2.27 as a proof of concept
   - ~Trained for 3 epochs reaching a loss of 2.42 as a proof of concept~

## Usage Guide

### Sample

```
Explain to me how gravity works!
```

<audio controls><source src="https://huggingface.co/KandirResearch/CiSiMi-v0.1/resolve/main/sample.wav" type="audio/wav"></audio>

### Installation

```bash
pip install outetts llama-cpp-python --upgrade
pip install huggingface_hub sounddevice
```

### Implementation

```python
import torch
import outetts
import numpy as np
from huggingface_hub import hf_hub_download
from outetts.wav_tokenizer.audio_codec import AudioCodec
from outetts.version.v2.prompt_processor import PromptProcessor
from outetts.version.playback import ModelOutput

# Download the model
model_path = hf_hub_download(
    repo_id="KandirResearch/CiSiMi-v0.1",
    filename="unsloth.Q8_0.gguf",
)

# Configure the model
model_config = outetts.GGUFModelConfig_v2(
    model_path=model_path,
    tokenizer_path="KandirResearch/CiSiMi-v0.1",
)

# Initialize components
interface = outetts.InterfaceGGUF(model_version="0.3", cfg=model_config)
audio_codec = AudioCodec()
prompt_processor = PromptProcessor("KandirResearch/CiSiMi-v0.1")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
gguf_model = interface.get_model()

# Helper function to extract audio from tokens
def get_audio(tokens):
    outputs = prompt_processor.extract_audio_from_tokens(tokens)
    if not outputs:
        return None
    audio_tensor = audio_codec.decode(torch.tensor([[outputs]], dtype=torch.int64).to(device))
    return ModelOutput(audio_tensor, audio_codec.sr)

# Helper function to clean text output
def extract_text_from_tts_output(tts_output):
    text = ""
    for line in tts_output.strip().split('\n'):
        if '<|audio_end|>' in line or '<|im_end|>' in line:
            continue
        if '<|' in line:
            word = line.split('<|')[0].strip()
            if word:
                text += word + " "
        else:
            text += line.strip() + " "
    return text.strip()

# Generate response function
def generate_response(instruction):
    prompt = f"<|im_start|>\nInstructions:\n{instruction}\n<|im_end|>\nAnswer:\n"
    gen_cfg = outetts.GenerationConfig(
        text=prompt, 
        temperature=0.6, 
        repetition_penalty=1.1, 
        max_length=4096, 
        speaker=None
    )
    
    input_ids = prompt_processor.tokenizer.encode(prompt)
    tokens = gguf_model.generate(input_ids, gen_cfg)
    
    output_text = prompt_processor.tokenizer.decode(tokens, skip_special_tokens=False)
    
    if "<|audio_end|>" in output_text:
        first_part, _, _ = output_text.partition("<|audio_end|>")
        
        if "<|audio_end|>\n<|im_end|>\n" not in first_part:
            first_part += "<|audio_end|>\n<|im_end|>\n"
            
        extracted_text = extract_text_from_tts_output(first_part)
        
        audio_start_pos = first_part.find("<|audio_start|>\n") + len("<|audio_start|>\n")
        audio_end_pos = first_part.find("<|audio_end|>\n<|im_end|>\n") + len("<|audio_end|>\n<|im_end|>\n")
        
        if audio_start_pos >= len("<|audio_start|>\n") and audio_end_pos > audio_start_pos:
            audio_tokens_text = first_part[audio_start_pos:audio_end_pos]
            audio_tokens = prompt_processor.tokenizer.encode(audio_tokens_text)
            audio_output = get_audio(audio_tokens)
            
            if audio_output is not None and hasattr(audio_output, 'audio') and audio_output.audio is not None:
                audio_numpy = audio_output.audio.cpu().numpy()
                if audio_numpy.ndim > 1:
                    audio_numpy = audio_numpy.squeeze()
                
                return extracted_text, (audio_output.sr, audio_numpy)
    
    return output_text, None

# Example usage
question = "What is the meaning of life?"
response_text, response_audio = generate_response(question)
print(response_text)

# Play audio if available
if response_audio is not None:
    if "ipykernel" in sys.modules:
        from IPython.display import display, Audio
        display(Audio(response_audio[1], rate=response_audio[0], autoplay=True))
    else:
        import sounddevice as sd
        sd.play(response_audio[1], samplerate=response_audio[0])
        sd.wait()
```

## Limitations & Future Work

This early prototype has several areas for improvement:
- Limited training data (~15k samples)
- Basic prompt/chat template structure
- Opportunity to optimize training hyperparameters
- Potential for multi-turn conversation capabilities

**Potential Limitation**: This type of model quickly fills up context window, making smaller models generally more practical for implementation.

## Acknowledgments & Citations

This model builds on the following open-source projects:

1. [OuteAI/OuteTTS-0.3-500M](https://huggingface.co/OuteAI/OuteTTS-0.3-500M) - Base model
2. [gruhit-patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit-patel/alpaca_speech_instruct) - Initial dataset
3. [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) - TTS generation
4. [OpenAI Whisper](https://github.com/openai/whisper) - Speech verification
5. [Unsloth](https://github.com/unslothai/unsloth) - Training optimization