File size: 8,467 Bytes

# 灵智大模型 - 垂直领域行业专家

🌐 [官方网站，欢迎访问](https://ailingzhi.com)

## ✨ 亮点
- 从Qwen2-base完美复现了Qwen2-chat，并公开了训练数据；
- 在垂类领域训练场景下，灵智模型能够在提升垂类领域性能的同时也保持了通用领域的性能；
- 对多种训练范式（例如直接指令微调，先持续预训练再指令微调等八种范式）做了总结，并针对不同的模型大小采取了最佳的训练范式；
- 开源了8个灵智模型：`Lingzhi-0.5B-chat`, `Lingzhi-0.8B-chat`, `Lingzhi-1.5B-chat`, `Lingzhi-2.7B-chat`, `Lingzhi-7B-chat`, `Lingzhi-10B-chat`, `Lingzhi-57MOE14B-chat`, `Lingzhi-72B-chat`.

## 📄 摘要
在实际应用中，当预训练数据不可用时，进行**持续训练**是很常见的。然而，持续训练往往会在增强领域特定技能的同时导致大语言模型（LLMs）灾难性地遗忘其通用能力。在本文中，我们首先对常见的持续训练范式进行了实证研究，然后选择了最佳范式来训练灵智系列模型。实验表明，灵智能够在保持通用能力的同时增强领域特定的性能。我们已经开源了所有模型、训练数据和基准测试，用户可以将它们应用到自己的领域特定区域。

## 📘 介绍
大语言模型（LLMs）近年来因其在各种实际下游任务中的出色表现而备受关注。实际上，尽管现有的LLMs在通用领域表现良好，但由于在预训练或指令微调期间缺乏特定领域的专业暴露，它们可能在用户需要的特定领域（如会计、法律、金融）中表现不佳。

为了提升LLMs在特定领域的表现，我们需要收集相应的数据进行持续学习，如持续预训练（CPT）或有监督微调（SFT）。然而，我们注意到，仅在特定领域进行持续学习可能导致通用能力的灾难性遗忘，如规划、指令执行、数学、编程和自然语言理解等。

为了同时保持通用和领域特定能力，通常会部署一个未修改的原生模型用于通用任务，而一个微调模型用于专业任务。这将对计算硬件资源（如GPU和内存）提出巨大的需求，从而阻碍商业部署。众所周知，上述现象是业界面临的一个非常棘手的问题。因此，一个值得研究的问题出现了：如何在持续学习过程中提高领域特定的表现，而不损害通用能力？

为了解决这个问题，我们进行了实证研究，探索了各种持续学习范式并总结了它们的优缺点。最终，在实证研究之后，我们选择了最佳的学习范式和训练数据，基于Qwen2-base进行持续学习，衍生出我们的灵智系列模型。经过大量实验，灵智能够在多个特定领域中表现出色，同时在通用能力方面也表现出与原始Qwen2-chat模型相当的性能。

## 📋 示例
1. huggingface示例代码
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

lingzhi_model_path = "Lingzhi-AI/Lingzhi-7B-chat"

model = AutoModelForCausalLM.from_pretrained(
    lingzhi_model_path,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(lingzhi_model_path)

prompt = "帮我介绍一下灵智大模型。"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

2. modelscope示例代码
```python
from modelscope import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

lingzhi_model_path = "LingzhiLLM/Lingzhi-7B-chat"

model = AutoModelForCausalLM.from_pretrained(
    lingzhi_model_path,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(lingzhi_model_path)

prompt = "帮我介绍一下灵智大模型。"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

## 📊 结果

> 备注：Baselines中Qwen2的所有结果均是在我们统一的环境下进行评测的。

| **Base Model**       | **General** |        |             |        |          |        |           |        | **Domains** |        | **Avg.** |
| :-------------------- | :---------- | :----- | :---------- | :----- | :------- | :----- | :-------- | :----- | :---------- | :----- | :------- |
|                       | **English** |        | **Chinese** |        | **Math** |        | **Code**  |        |             |        |          |
|                       | MMLU        | BBH    | C-Eval      | CMMLU  | GSM8K    | MathQA | HumanEval | MBPP   | Account     | Law    |          |
| ***Baselines***       |             |        |             |        |          |        |           |        |             |        |          |
| Qwen2-0\.5B-chat      | 43\.30      | 10\.35 | 54\.16      | 53\.57 | 33\.97   | 25\.76 | 20\.73    | 12\.40 | 17\.01      | 25\.00 | 29\.62   |
| Qwen2-1\.5B-chat      | 55\.73      | 9\.55  | 69\.32      | 70\.13 | 54\.21   | 32\.93 | 42\.68    | 20\.60 | 32\.65      | 42\.07 | 42\.99   |
| Qwen2-7B-chat         | 69\.82      | 30\.56 | 81\.58      | 81\.77 | 66\.26   | 44\.09 | 72\.56    | 42\.20 | 55\.10      | 59\.15 | 60\.31   |
| Qwen2-57MOE14B-chat   |             |        |             |        |          |        |           |        |             |        |          |
| Qwen2-72B-chat        |             |        |             |        |          |        |           |        |             |        |          |
| ***Lingzhi Models***  |             |        |             |        |          |        |           |        |             |        |          |
| Lingzhi-0\.5B-chat    | 44\.25      | 25\.65 | 55\.05      | 53\.74 | 29\.34   | 29\.18 | 25\.00    | 22\.40 | 25\.85      | 40\.24 | 35\.07   |
| Lingzhi-0\.8B-chat    | 42\.93      | 27\.77 | 53\.34      | 50\.98 | 21\.00   | 28\.84 | 28\.66    | 18\.00 | 24\.49      | 40\.85 | 33\.69   |
| Lingzhi-1\.5B-chat    | 55\.35      | 33\.67 | 69\.47      | 69\.10 | 49\.58   | 35\.31 | 39\.02    | 31\.00 | 37\.41      | 42\.68 | 46\.26   |
| Lingzhi-2\.7B-chat    | 53\.65      | 36\.77 | 67\.09      | 67\.39 | 46\.02   | 34\.51 | 40\.85    | 30\.00 | 38\.10      | 60\.98 | 47\.54   |
| Lingzhi-7B-chat       | 69\.06      | 58\.95 | 82\.69      | 83\.05 | 74\.22   | 45\.59 | 56\.10    | 49\.80 | 72\.79      | 89\.02 | 68\.13   |
| Lingzhi-10B-chat      | 69\.37      | 64\.37 | 81\.50      | 82\.27 | 76\.19   | 46\.00 | 60\.98    | 50\.40 | 70\.07      | 82\.93 | 68\.41   |
| Lingzhi-57MOE14B-chat |             |        |             |        |          |        |           |        |             |        |          |
| Lingzhi-72B-chat      |             |        |             |        |          |        |           |        |             |        |          |


## 📚 引用
<span style="color:orange;">⚠️ **警告**</span> 如果您用到了我们的模型和数据，请使用以下参考文献。
```
@misc{lingzhi,
      title={Lingzhi: Improving Domain-Specific Performance without Compromising General Capabilities}, 
      author={Daoguang Zan, Lei Yu, Ailun Yu, Zhirong Huang, Zongshuai Ruan, Pengjie Huang},
      year={2024},
      note={All authors contributed equally. The computational power required to train the Lingzhi models (12*8 H800 80G) was provided by Lingzhi AI. Special thanks to them.}
}
```