|
--- |
|
license: other |
|
license_name: tencent-hunyuan-a13b |
|
license_link: LICENSE |
|
--- |
|
|
|
|
|
|
|
<p align="center"> |
|
<img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br> |
|
</p><p></p> |
|
|
|
<p align="center"> |
|
 <a href="https://github.com/Tencent/Hunyuan-A13B"><b>GITHUB</b></a>   |
|
|
|
|
|
## Model Introduction |
|
|
|
The A13B models released by Tencent Hunyuan this time: [Tencent-Hunyuan-A13B-Pretrain](https://huggingface.co/tencent/Hunyuan-A13B-Pretrain) , [Tencent-Hunyuan-A13B-Instruct](https://huggingface.co/tencent/Hunyuan-A13B-Instruct) , [Tencent-Hunyuan-A13B-Instruct-FP8](https://huggingface.co/tencent/Tencent-Hunyuan-A13B-Instruct-FP8) and [Tencent-Hunyuan-A13B-Instruct-FP8](https://huggingface.co/tencent/Tencent-Hunyuan-A13B-Instruct-FP8), use better data allocation and training, have strong performance, and have achieved a good balance between computing and performance. It stands out from many large-scale language models and is currently one of the strongest Chinese Mixture of Experts (MoE) models, featuring a total of 80 billion parameters and 13 billion active parameters. |
|
|
|
### Introduction to Technical Advantages |
|
|
|
**Model** |
|
|
|
- **High-Quality Synthetic Data**: By enhancing training with synthetic data, Hunyuan-A13B is able to learn richer representations, handle long-context inputs, and generalize better to unseen data. |
|
|
|
- **KV Cache Compression**: Utilizing Grouped Query Attention (GQA) and Cross-Layer Attention (CLA) strategies, it significantly reduces memory usage and computational overhead of the KV cache, thereby improving inference throughput. |
|
|
|
- **Expert-Specific Learning Rate Scaling**: Different learning rates are assigned to different experts, ensuring that each sub-model can effectively learn from the data and contribute to overall performance. |
|
|
|
- **Long-Context Processing Capability**: Both the pre-trained model and the instruction-tuned model support text sequences of up to 256K tokens, significantly enhancing the ability to handle long-context tasks. |
|
|
|
- **Extensive Benchmarking**: Extensive experiments across multiple languages and tasks have validated the practical effectiveness and safety of Hunyuan-A13B. |
|
|
|
- **Hybrid Reasoning Capability**: It supports both fast thinking and slow thinking inference modes. |
|
|
|
|
|
|
|
**Architecture** |
|
|
|
Hunyuan-A13B adopts a Fine-grained Mixture of Experts (Fine-grained MoE) architecture, comprising a total of 80 billion parameters with 13 billion active parameters. The model has been trained on over 20 trillion tokens. It supports a context length of up to 256K tokens. The following are the detailed specifications of the model architecture: |
|
|
|
- **Total Parameters**: 80B |
|
- **Active Parameters**: 13B |
|
- **Number of Layers**: 32 |
|
- **Attention Heads**: 32 |
|
- **Number of Shared Experts**: 1 |
|
- **Number of Non-Shared Experts**: 64 |
|
- **Routing Strategy**: Top-8 |
|
- **Activation Function**: SwiGLU |
|
- **Hidden Layer Dimension**: 4096 |
|
- **Expert Hidden Layer Dimension**: 3072 |
|
|
|
|
|
|
|
|
|
## Related News |
|
* 2025.6.27 We have open-sourced **Hunyuan-A13B-Pretrain** , **Hunyuan-A13B-Instruct** , **Hunyuan-A13B-Instruct-FP8** , **Hunyuan-A13B-Instruct** on Hugging Face. |
|
<br> |
|
|
|
|
|
## Benchmark |
|
|
|
Note: The following benchmarks are evaluated by TRT-LLM-backend |
|
|
|
| Model | Hunyuan-Large | Qwen2.5-72B | Qwen3-32B | Qwen3-A22B | Hunyuan-A13B | |
|
|------------------|---------------|--------------|---------------|-------------|---------------| |
|
| MMLU | 88.4 | 86.1 | 83.61 | 87.81 | 88.17 | |
|
| MMLU-Pro | 60.20 | 58.10 | 65.54 | 68.18 | 67.23 | |
|
| MMLU-Redux | 87.47 | 83.90 | 83.41 | 87.40 | 87.67 | |
|
| BBH | 86.30 | 85.8 | 87.38 | 88.87 | 87.56 | |
|
| SuperGPQA | 38.90 | 37.84 * | 39.78 | 44.06 | 41.32 | |
|
| EvalPlus | 75.69 | 66.05 | 72.05 | 77.60 | 78.64 | |
|
| MultiPL-E | 59.13 | 61.00 | 67.06 | 65.94 | 69.33 | |
|
| MBPP | 72.60 | 84.70 | 78.20 | 81.40 | 83.86 | |
|
| CRUX-O | 60.63 | 56.00 * | 72.50 | 79.00 | 77.00 | |
|
| MATH | 69.80 | 62.1 | 61.62 | 71.84 | 72.35 | |
|
| GSM8k | 92.80 | 91.5 | 93.40 | 94.39 | 91.83 | |
|
| GPQA | - | 45.9 | 47.97 | 47.47 | 43.44 | |
|
| INCLUDE | 66.48 | 76.98 * | 67.97 | 73.46 | 74.90 | |
|
| MGSM | 67.52 | 79.53 * | 82.68 | 83.53 | 76.00 | |
|
| MMMLU | 76.89 | 79.28 * | 83.83 | 86.70 | 84.68 | |
|
|
|
|
|
|
|
|
|
|
|
| Topic | Bench | OpenAI-o1-1217 | DeepSeek R1 | Qwen3-A22B | Hunyuan-A13B-Instruct | |
|
|:-------------------:|:-----------------------------:|:-------------:|:------------:|:-----------:|:---------------------:| |
|
| **Mathematics** | AIME 2024<br>AIME 2025<br>MATH | 74.3<br>79.2<br>96.4 | 79.8<br>70<br>94.9 | 85.7<br>81.5<br>94.0 | 87.3<br>76.8<br>94.3 | |
|
| **Science** | GPQA-Diamond<br>OlympiadBench | 78<br>83.1 | 71.5<br>82.4 | 71.1<br>85.7 | 71.2<br>82.7 | |
|
| **Coding** | Livecodebench<br>Fullstackbench<br>ArtifactsBench | 63.9<br>64.6<br>38.6 | 65.9<br>71.6<br>44.6 | 70.7<br>65.6<br>44.6 | 63.9<br>67.8<br>43 | |
|
| **Reasoning** | BBH<br>DROP<br>ZebraLogic | 80.4<br>90.2<br>81 | 83.7<br>92.2<br>78.7 | 88.9<br>90.3<br>80.3 | 89.1<br>91.1<br>84.7 | |
|
| **Instruction<br>Following** | IF-Eval<br>SysBench | 91.8<br>82.5 | 88.3<br>77.7 | 83.4<br>74.2 | 84.7<br>76.1 | |
|
| **Text<br>Creation**| LengthCtrl<br>InsCtrl | 60.1<br>74.8 | 55.9<br>69 | 53.3<br>73.7 | 55.4<br>71.9 | |
|
| **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>81.8 | 59.8<br>56.4 | 61.2<br>62.9 | |
|
| **Agent** | BDCL v3<br> $\tau$-bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 63.8<br>58.7<br>n/a<br>55.3 | 70.8<br>46.7<br>n/a<br>51.7 | 78.3<br>54.7<br>51.2<br>63.5 | |
|
| **Average** | - | n/a | n/a | n/a | n/a | |
|
|
|
|
|
|
|
|
|
|
|
## Quick Start |
|
|
|
You can refer to the content in [Hunyuan-A13B](https://github.com/Tencent-Hunyuan/Hunyuan-A13B) to get started quickly. The training and inference code can use the version provided in this github repository. |
|
|
|
|
|
### Transformer |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import os |
|
|
|
|
|
def main(): |
|
model_name_or_path = os.environ['MODEL_PATH'] |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", |
|
trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here |
|
for name, param in model.named_parameters(): |
|
print(f"{name}: {param.size()}") |
|
messages = [ |
|
{ |
|
"role": "system", |
|
"content": "You are a helpful assistant.", |
|
}, |
|
{"role": "user", "content": "Write a short summary of the benefits of regular exercise."}, |
|
] |
|
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") |
|
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=100,do_sample=True) |
|
print(tokenizer.decode(outputs[0])) |
|
|
|
if __name__ == '__main__': |
|
main() |
|
|
|
``` |
|
|
|
|
|
## Deployment |
|
|
|
For deployment, you can use frameworks such as *vLLM*, *SGLang*, or *TensorRT-LLM* to serve the model and create an OpenAI-compatible API endpoint. |
|
|
|
|
|
### vllm |
|
|
|
#### Docker Image |
|
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official support is currently under development. |
|
|
|
|
|
- To get started: |
|
``` |
|
Pull the Docker image:docker pull xxx |
|
``` |
|
|
|
- Start the API server: |
|
|
|
``` |
|
docker start xxx |
|
``` |
|
|
|
|
|
#### Source Code |
|
|
|
Support for this model has been added via this PR: (https://github.com/vllm-project/vllm/pull/20114 )in the vLLM project. |
|
You can build and run vLLM from source after merging this pull request into your local repository. |
|
|
|
After applying the changes, you can start the API server by following the standard vLLM setup instructions. |
|
|
|
|
|
### SGLlang |
|
|
|
#### Docker Image |
|
|
|
We also provide a pre-built Docker image based on the latest version of SGLang. |
|
|
|
To get started: |
|
|
|
- Pull the Docker image |
|
|
|
``` |
|
docker pull xxx |
|
``` |
|
|
|
- Start the API server: |
|
|
|
``` |
|
docker run --gpus all \ |
|
--shm-size 32g \ |
|
-p 30000:30000 \ |
|
--ipc=host \ |
|
xxx \ |
|
python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000 |
|
``` |
|
|
|
|
|
#### Source Code |
|
|
|
The necessary integration has already been merged into the main branch via this PR(https://github.com/sgl-project/sglang/pull/7549 ). |
|
Once you have cloned or updated your local SGLang repository, you can build and run the API server using the standard SGLang setup process. |
|
|
|
After applying the changes, you can start the API server by following the standard SGLang setup instructions. |
|
|
|
``` |
|
python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000 |
|
``` |
|
|
|
|
|
|
|
### TensorRT-LLM |
|
|
|
|
|
#### Docker Image |
|
|
|
We also provide a pre-built Docker image based on the latest version of TensorRT-LLM. |
|
|
|
To get started: |
|
|
|
- Pull the Docker image |
|
|
|
``` |
|
docker pull xxx |
|
``` |
|
|
|
- Start the API server: |
|
|
|
``` |
|
docker run --gpus all \ |
|
--shm-size 32g \ |
|
-p 30000:30000 \ |
|
--ipc=host \ |
|
xxx \ |
|
python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000 |
|
``` |
|
|
|
#### Source Code |
|
|
|
The necessary integration has already been merged into the main branch via this PR(xxx ). |
|
Once you have cloned or updated your local TensorRT-LLM. repository, you can build and run the API server using the standard TensorRT-LLM. setup process. |
|
|
|
After applying the changes, you can start the API server by following the standard TensorRT-LLM. setup instructions. |
|
|
|
|
|
|
|
## Inference Performance |
|
|
|
This section presents the efficiency test results of deploying various models using vLLM, including inference speed (tokens/s) under different batch sizes. |
|
|
|
|
|
Evaluation Script: |
|
```python |
|
python3 benchmark_throughput.py --backend vllm \ |
|
--input-len 2048 \ |
|
--output-len 14336 \ |
|
--model $MODEL_PATH \ |
|
--tensor-parallel-size $TP \ |
|
--use-v2-block-manager \ |
|
--async-engine \ |
|
--trust-remote-code \ |
|
--num_prompts $BATCH_SIZE \ |
|
--max-num-seqs $BATCH_SIZE |
|
``` |
|
|
|
| Inference Framework | Model | Number of GPUs (GPU productA) | input_length | batch=1 | batch=16 | batch=32 | |
|
|------|-----------------------------|-----------|-------------------------|---------------------|----------------------|----------------------| |
|
| vLLM | Hunyuan-A13B-Instruct | 8 | 2048 | 190.84 | 1246.54 | 1981.99 | |
|
| vLLM | Hunyuan-A13B-Instruct | 4 | 2048 | 158.90 | 779.10 | 1301.75 | |
|
| vLLM | Hunyuan-A13B-Instruct | 2 | 2048 | 111.72 | 327.31 | 346.54 | |
|
| vLLM | Hunyuan-A13B-Instruct(int8 weight only) | 2 | 2048 | 109.10 | 444.17 | 721.93 | |
|
| vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 2 | 2048 | 91.83 | 372.01 | 617.70 | |
|
| vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 1 | 2048 | 60.07 | 148.80 | 160.41 | |
|
|
|
|
|
|
|
## Contact Us |
|
|
|
If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email (hunyuan_[email protected]). |