1. Step-Audio-Chat

This repository contains the Multimodal Large Language Model (LLM) component of Step-Audio. It is a 130 billion parameter multimodal LLM that is responsible for understanding and generating human speech. The model is specifically designed to seamlessly integrate functions such as speech recognition, semantic understanding, dialogue management, voice cloning, and speech generation.

2. Evaluation

2.1 LLM judge metrics(GPT-4o) on StepEval-Audio-360

Comparison of fundamental capabilities of voice chat on the StepEval-Audio-360.
Model Factuality (% ↑) Relevance (% ↑) Chat Score ↑
GLM4-Voice 54.7 66.4 3.49
Qwen2-Audio 22.6 26.3 2.27
Moshi* 1.0 0 1.49
Step-Audio-Chat 66.4 75.2 4.11

*Note: Moshi are marked with "*" and should be considered for reference only.

2.2 Public Test Set

Model Llama Question Web Questions TriviaQA* ComplexBench HSK-6
GLM4-Voice 64.7 32.2 39.1 66.0 74.0
Moshi 62.3 26.6 22.8 - -
Freeze-Omni 72.0 44.7 53.9 - -
LUCY 59.7 29.3 27.0 - -
MinMo 78.9 55.0 48.3 - -
Qwen2-Audio 52.0 27.0 37.3 54.0 -
Step-Audio-Chat 81.0 75.1 58.0 74.0 86.0

Note: Results marked with "*" on TriviaQA dataset are considered for reference only.

TriviaQA dataset marked with "*" indicates results are for reference only.

2.3 Audio instruction following

Category Instruction Following Audio Quality
GLM-4-Voice Step-Audio GLM-4-Voice Step-Audio
Languages 1.9 3.8 2.9 3.3
Role-playing 3.8 4.2 3.2 3.6
Singing / RAP 2.1 2.4 2.4 4
Voice Control 3.6 4.4 3.3 4.1

3. More information

For more information, please refer to our repository: Step-Audio.

Downloads last month
462
Safetensors
Model size
132B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including stepfun-ai/Step-Audio-Chat