File size: 8,283 Bytes

---
library_name: transformers
tags:
- unsloth
- text-to-audio
- s2s
license: cc-by-sa-4.0
datasets:
- KandirResearch/Speech2Speech
language:
- en
base_model:
- OuteAI/OuteTTS-0.3-500M
pipeline_tag: text-to-audio
---
# CiSiMi: A Text-to-Speech TTS Model

[![Buy Me A Coffee](https://img.shields.io/badge/Ko--fi-Support%20My%20Work-FF5E5B?style=for-the-badge&logo=ko-fi&logoColor=white)](https://ko-fi.com/lyte)
[![Dataset](https://img.shields.io/badge/Dataset-KandirResearch/Speech2Speech-blue)](https://huggingface.co/datasets/KandirResearch/Speech2Speech)
[![Model](https://img.shields.io/badge/Model-KandirResearch/CiSiMi--v0.1-green)](https://huggingface.co/KandirResearch/CiSiMi-v0.1)
[![Demo](https://img.shields.io/badge/Demo-KandirResearch/CiSiMi--At--Home-orange)](https://huggingface.co/spaces/KandirResearch/CiSiMi-At-Home)

## Overview

CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. Built for resource-constrained environments, it's designed to run efficiently on CPU using llama.cpp, making advanced speech synthesis accessible even without powerful GPUs.

*"Being GPU poor and slightly disappointed with the csm release and my inability to run it, having to wait for time it takes me to run an ASR+LLM+TTS combo, I decided to ask Mom and Mom gave me CiSiMi At Home!"*

This project demonstrates the power of open-source tools to create accessible speech technology. While still in its early stages, it represents a step toward democratizing advanced text-to-audio capabilities.

## Technical Details

### Model Specifications
- **Architecture**: Based on OuteTTS-0.3-500M
- **Languages**: English
- **Pipeline**: Text-to-audio
- **Parameters**: 500M
- **Training Dataset Size**: ~15k samples
- **Future Goals**: Scale to 200k-500k dataset with multi-turn conversation using both a 500M and a 1B parameter model variants, plus adding streaming for realtime.

### Training Methodology

1. **Dataset Preparation**:
   - Started with [gruhit-patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit-patel/alpaca_speech_instruct)
   - Cleaned by removing code, mathematical expressions, and non-English content
   - Filtered to keep only entries with input+output texts of 256 tokens or less

2. **Audio Generation**:
   - Converted text outputs to speech using [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M)
   - Verified each audio generation using [OpenAI Whisper](https://github.com/openai/whisper)
   - Published the resulting dataset as [KandirResearch/Speech2Speech](https://huggingface.co/datasets/KandirResearch/Speech2Speech)

3. **Model Training**:
   - Preprocessed dataset using modified OuteTTS methodology ([training details](https://github.com/edwko/OuteTTS/blob/8eb0fa369df6f3c062f7084ddc33d10bc28992be/examples/training/OuteTTS-0.3/train.md))
   - Fine-tuned [OuteAI/OuteTTS-0.3-500M](https://huggingface.co/OuteAI/OuteTTS-0.3-500M) using Unsloth SFT
   - Trained for 6 epochs reaching a loss of 2.27 as a proof of concept
   - ~Trained for 3 epochs reaching a loss of 2.42 as a proof of concept~

## Usage Guide

### Sample

```
Explain to me how gravity works!
```

<audio controls><source src="https://huggingface.co/KandirResearch/CiSiMi-v0.1/resolve/main/sample.wav" type="audio/wav"></audio>

### Installation

```bash
pip install outetts llama-cpp-python --upgrade
pip install huggingface_hub sounddevice
```

### Implementation

```python
import torch
import outetts
import numpy as np
from huggingface_hub import hf_hub_download
from outetts.wav_tokenizer.audio_codec import AudioCodec
from outetts.version.v2.prompt_processor import PromptProcessor
from outetts.version.playback import ModelOutput

# Download the model
model_path = hf_hub_download(
    repo_id="KandirResearch/CiSiMi-v0.1",
    filename="unsloth.Q8_0.gguf",
)

# Configure the model
model_config = outetts.GGUFModelConfig_v2(
    model_path=model_path,
    tokenizer_path="KandirResearch/CiSiMi-v0.1",
)

# Initialize components
interface = outetts.InterfaceGGUF(model_version="0.3", cfg=model_config)
audio_codec = AudioCodec()
prompt_processor = PromptProcessor("KandirResearch/CiSiMi-v0.1")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
gguf_model = interface.get_model()

# Helper function to extract audio from tokens
def get_audio(tokens):
    outputs = prompt_processor.extract_audio_from_tokens(tokens)
    if not outputs:
        return None
    audio_tensor = audio_codec.decode(torch.tensor([[outputs]], dtype=torch.int64).to(device))
    return ModelOutput(audio_tensor, audio_codec.sr)

# Helper function to clean text output
def extract_text_from_tts_output(tts_output):
    text = ""
    for line in tts_output.strip().split('\n'):
        if '<|audio_end|>' in line or '<|im_end|>' in line:
            continue
        if '<|' in line:
            word = line.split('<|')[0].strip()
            if word:
                text += word + " "
        else:
            text += line.strip() + " "
    return text.strip()

# Generate response function
def generate_response(instruction):
    prompt = f"<|im_start|>\nInstructions:\n{instruction}\n<|im_end|>\nAnswer:\n"
    gen_cfg = outetts.GenerationConfig(
        text=prompt, 
        temperature=0.6, 
        repetition_penalty=1.1, 
        max_length=4096, 
        speaker=None
    )
    
    input_ids = prompt_processor.tokenizer.encode(prompt)
    tokens = gguf_model.generate(input_ids, gen_cfg)
    
    output_text = prompt_processor.tokenizer.decode(tokens, skip_special_tokens=False)
    
    if "<|audio_end|>" in output_text:
        first_part, _, _ = output_text.partition("<|audio_end|>")
        
        if "<|audio_end|>\n<|im_end|>\n" not in first_part:
            first_part += "<|audio_end|>\n<|im_end|>\n"
            
        extracted_text = extract_text_from_tts_output(first_part)
        
        audio_start_pos = first_part.find("<|audio_start|>\n") + len("<|audio_start|>\n")
        audio_end_pos = first_part.find("<|audio_end|>\n<|im_end|>\n") + len("<|audio_end|>\n<|im_end|>\n")
        
        if audio_start_pos >= len("<|audio_start|>\n") and audio_end_pos > audio_start_pos:
            audio_tokens_text = first_part[audio_start_pos:audio_end_pos]
            audio_tokens = prompt_processor.tokenizer.encode(audio_tokens_text)
            audio_output = get_audio(audio_tokens)
            
            if audio_output is not None and hasattr(audio_output, 'audio') and audio_output.audio is not None:
                audio_numpy = audio_output.audio.cpu().numpy()
                if audio_numpy.ndim > 1:
                    audio_numpy = audio_numpy.squeeze()
                
                return extracted_text, (audio_output.sr, audio_numpy)
    
    return output_text, None

# Example usage
question = "What is the meaning of life?"
response_text, response_audio = generate_response(question)
print(response_text)

# Play audio if available
if response_audio is not None:
    if "ipykernel" in sys.modules:
        from IPython.display import display, Audio
        display(Audio(response_audio[1], rate=response_audio[0], autoplay=True))
    else:
        import sounddevice as sd
        sd.play(response_audio[1], samplerate=response_audio[0])
        sd.wait()
```

## Limitations & Future Work

This early prototype has several areas for improvement:
- Limited training data (~15k samples)
- Basic prompt/chat template structure
- Opportunity to optimize training hyperparameters
- Potential for multi-turn conversation capabilities

**Potential Limitation**: This type of model quickly fills up context window, making smaller models generally more practical for implementation.

## Acknowledgments & Citations

This model builds on the following open-source projects:

1. [OuteAI/OuteTTS-0.3-500M](https://huggingface.co/OuteAI/OuteTTS-0.3-500M) - Base model
2. [gruhit-patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit-patel/alpaca_speech_instruct) - Initial dataset
3. [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) - TTS generation
4. [OpenAI Whisper](https://github.com/openai/whisper) - Speech verification
5. [Unsloth](https://github.com/unslothai/unsloth) - Training optimization