1. Introduction
This work introduces MonoSpeech, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. MonoSpeech is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that MonoSpeech delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, MonoSpeech establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.

2. Quick Start
Please refer to Github Repository
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The HF Inference API does not support any-to-any models for transformers library.
Model tree for guanwenhao/MonoSpeech
Base model
HuggingFaceTB/SmolLM2-360M-Instruct