Lingzhi-7B-chat / README.md

Upload README.md

f07c827 verified 9 months ago

8.47 kB

	# 灵智大模型 - 垂直领域行业专家

	🌐 [官方网站，欢迎访问](https://ailingzhi.com)

	## ✨ 亮点
	- 从Qwen2-base完美复现了Qwen2-chat，并公开了训练数据；
	- 在垂类领域训练场景下，灵智模型能够在提升垂类领域性能的同时也保持了通用领域的性能；
	- 对多种训练范式（例如直接指令微调，先持续预训练再指令微调等八种范式）做了总结，并针对不同的模型大小采取了最佳的训练范式；
	- 开源了8个灵智模型：`Lingzhi-0.5B-chat`, `Lingzhi-0.8B-chat`, `Lingzhi-1.5B-chat`, `Lingzhi-2.7B-chat`, `Lingzhi-7B-chat`, `Lingzhi-10B-chat`, `Lingzhi-57MOE14B-chat`, `Lingzhi-72B-chat`.

	## 📄 摘要
	在实际应用中，当预训练数据不可用时，进行持续训练是很常见的。然而，持续训练往往会在增强领域特定技能的同时导致大语言模型（LLMs）灾难性地遗忘其通用能力。在本文中，我们首先对常见的持续训练范式进行了实证研究，然后选择了最佳范式来训练灵智系列模型。实验表明，灵智能够在保持通用能力的同时增强领域特定的性能。我们已经开源了所有模型、训练数据和基准测试，用户可以将它们应用到自己的领域特定区域。

	## 📘 介绍
	大语言模型（LLMs）近年来因其在各种实际下游任务中的出色表现而备受关注。实际上，尽管现有的LLMs在通用领域表现良好，但由于在预训练或指令微调期间缺乏特定领域的专业暴露，它们可能在用户需要的特定领域（如会计、法律、金融）中表现不佳。

	为了提升LLMs在特定领域的表现，我们需要收集相应的数据进行持续学习，如持续预训练（CPT）或有监督微调（SFT）。然而，我们注意到，仅在特定领域进行持续学习可能导致通用能力的灾难性遗忘，如规划、指令执行、数学、编程和自然语言理解等。

	为了同时保持通用和领域特定能力，通常会部署一个未修改的原生模型用于通用任务，而一个微调模型用于专业任务。这将对计算硬件资源（如GPU和内存）提出巨大的需求，从而阻碍商业部署。众所周知，上述现象是业界面临的一个非常棘手的问题。因此，一个值得研究的问题出现了：如何在持续学习过程中提高领域特定的表现，而不损害通用能力？

	为了解决这个问题，我们进行了实证研究，探索了各种持续学习范式并总结了它们的优缺点。最终，在实证研究之后，我们选择了最佳的学习范式和训练数据，基于Qwen2-base进行持续学习，衍生出我们的灵智系列模型。经过大量实验，灵智能够在多个特定领域中表现出色，同时在通用能力方面也表现出与原始Qwen2-chat模型相当的性能。

	## 📋 示例
	1. huggingface示例代码
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	device = "cuda" if torch.cuda.is_available() else "cpu"

	lingzhi_model_path = "Lingzhi-AI/Lingzhi-7B-chat"

	model = AutoModelForCausalLM.from_pretrained(
	lingzhi_model_path,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(lingzhi_model_path)

	prompt = "帮我介绍一下灵智大模型。"
	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(device)

	generated_ids = model.generate(
	model_inputs.input_ids,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	2. modelscope示例代码
	```python
	from modelscope import AutoModelForCausalLM, AutoTokenizer
	import torch

	device = "cuda" if torch.cuda.is_available() else "cpu"

	lingzhi_model_path = "LingzhiLLM/Lingzhi-7B-chat"

	model = AutoModelForCausalLM.from_pretrained(
	lingzhi_model_path,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(lingzhi_model_path)

	prompt = "帮我介绍一下灵智大模型。"
	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(device)

	generated_ids = model.generate(
	model_inputs.input_ids,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	## 📊 结果

	> 备注：Baselines中Qwen2的所有结果均是在我们统一的环境下进行评测的。

	\| Base Model \| General \| \| \| \| \| \| \| \| Domains \| \| Avg. \|
	\| :-------------------- \| :---------- \| :----- \| :---------- \| :----- \| :------- \| :----- \| :-------- \| :----- \| :---------- \| :----- \| :------- \|
	\| \| English \| \| Chinese \| \| Math \| \| Code \| \| \| \| \|
	\| \| MMLU \| BBH \| C-Eval \| CMMLU \| GSM8K \| MathQA \| HumanEval \| MBPP \| Account \| Law \| \|
	\| *Baselines* \| \| \| \| \| \| \| \| \| \| \| \|
	\| Qwen2-0\.5B-chat \| 43\.30 \| 10\.35 \| 54\.16 \| 53\.57 \| 33\.97 \| 25\.76 \| 20\.73 \| 12\.40 \| 17\.01 \| 25\.00 \| 29\.62 \|
	\| Qwen2-1\.5B-chat \| 55\.73 \| 9\.55 \| 69\.32 \| 70\.13 \| 54\.21 \| 32\.93 \| 42\.68 \| 20\.60 \| 32\.65 \| 42\.07 \| 42\.99 \|
	\| Qwen2-7B-chat \| 69\.82 \| 30\.56 \| 81\.58 \| 81\.77 \| 66\.26 \| 44\.09 \| 72\.56 \| 42\.20 \| 55\.10 \| 59\.15 \| 60\.31 \|
	\| Qwen2-57MOE14B-chat \| \| \| \| \| \| \| \| \| \| \| \|
	\| Qwen2-72B-chat \| \| \| \| \| \| \| \| \| \| \| \|
	\| *Lingzhi Models* \| \| \| \| \| \| \| \| \| \| \| \|
	\| Lingzhi-0\.5B-chat \| 44\.25 \| 25\.65 \| 55\.05 \| 53\.74 \| 29\.34 \| 29\.18 \| 25\.00 \| 22\.40 \| 25\.85 \| 40\.24 \| 35\.07 \|
	\| Lingzhi-0\.8B-chat \| 42\.93 \| 27\.77 \| 53\.34 \| 50\.98 \| 21\.00 \| 28\.84 \| 28\.66 \| 18\.00 \| 24\.49 \| 40\.85 \| 33\.69 \|
	\| Lingzhi-1\.5B-chat \| 55\.35 \| 33\.67 \| 69\.47 \| 69\.10 \| 49\.58 \| 35\.31 \| 39\.02 \| 31\.00 \| 37\.41 \| 42\.68 \| 46\.26 \|
	\| Lingzhi-2\.7B-chat \| 53\.65 \| 36\.77 \| 67\.09 \| 67\.39 \| 46\.02 \| 34\.51 \| 40\.85 \| 30\.00 \| 38\.10 \| 60\.98 \| 47\.54 \|
	\| Lingzhi-7B-chat \| 69\.06 \| 58\.95 \| 82\.69 \| 83\.05 \| 74\.22 \| 45\.59 \| 56\.10 \| 49\.80 \| 72\.79 \| 89\.02 \| 68\.13 \|
	\| Lingzhi-10B-chat \| 69\.37 \| 64\.37 \| 81\.50 \| 82\.27 \| 76\.19 \| 46\.00 \| 60\.98 \| 50\.40 \| 70\.07 \| 82\.93 \| 68\.41 \|
	\| Lingzhi-57MOE14B-chat \| \| \| \| \| \| \| \| \| \| \| \|
	\| Lingzhi-72B-chat \| \| \| \| \| \| \| \| \| \| \| \|


	## 📚 引用
	<span style="color:orange;">⚠️ 警告</span> 如果您用到了我们的模型和数据，请使用以下参考文献。
	```
	@misc{lingzhi,
	title={Lingzhi: Improving Domain-Specific Performance without Compromising General Capabilities},
	author={Daoguang Zan, Lei Yu, Ailun Yu, Zhirong Huang, Zongshuai Ruan, Pengjie Huang},
	year={2024},
	note={All authors contributed equally. The computational power required to train the Lingzhi models (12*8 H800 80G) was provided by Lingzhi AI. Special thanks to them.}
	}
	```