1. Introduction

This work introduces MonoSpeech, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. MonoSpeech is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that MonoSpeech delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, MonoSpeech establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.

Github Repository

image

2. Quick Start

Please refer to Github Repository

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support any-to-any models for transformers library.

Model tree for guanwenhao/MonoSpeech

Finetuned
(46)
this model

Dataset used to train guanwenhao/MonoSpeech