Spaces:
Runtime error
Runtime error
File size: 12,616 Bytes
8f8a944 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
================================
开源指令微调数据集(LLM)
================================
HuggingFace Hub 中有众多优秀的开源数据,本节将以
`timdettmers/openassistant-guanaco <https://huggingface.co/datasets/timdettmers/openassistant-guanaco>`__
开源指令微调数据集为例,讲解如何开始训练。为便于介绍,本节以
`internlm2_chat_7b_qlora_oasst1_e3 <https://github.com/InternLM/xtuner/blob/main/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_oasst1_e3.py>`__
配置文件为基础进行讲解。
适配开源数据集
=====================
不同的开源数据集有不同的数据「载入方式」和「字段格式」,因此我们需要针对所使用的开源数据集进行一些适配。
载入方式
-----------
XTuner 使用上游库 ``datasets`` 的统一载入接口 ``load_dataset``\ 。
.. code:: python
data_path = 'timdettmers/openassistant-guanaco'
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
...)
.. tip::
一般来说,若想要使用不同的开源数据集,用户只需修改
``dataset=dict(type=load_dataset, path=data_path)`` 中的 ``path``
参数即可。
若想使用 openMind 数据集,可将 ``dataset=dict(type=load_dataset, path=data_path)`` 中的 ``type`` 替换为 ``openmind.OmDataset``。
字段格式
--------
为适配不同的开源数据集的字段格式,XTuner 开发并设计了一套 ``map_fn`` 机制,可以把不同的开源数据集转为统一的字段格式
.. code:: python
from xtuner.dataset.map_fns import oasst1_map_fn
train_dataset = dict(
type=process_hf_dataset,
...
dataset_map_fn=oasst1_map_fn,
...)
XTuner 内置了众多 map_fn
(\ `这里 <https://github.com/InternLM/xtuner/tree/main/xtuner/dataset/map_fns/dataset_map_fns>`__\ ),可以满足大多数开源数据集的需要。此处我们罗列一些常用
map_fn 及其对应的原始字段和参考数据集:
+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| map_fn | Columns | Reference Datasets |
+====================================================================================================================================+===================================================+=======================================================================================================================+
| `alpaca_map_fn <https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/map_fns/dataset_map_fns/alpaca_map_fn.py>`__ | ['instruction', 'input', 'output', ...] | `tatsu-lab/alpaca <https://huggingface.co/datasets/tatsu-lab/alpaca>`__ |
+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| `alpaca_zh_map_fn <https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/map_fns/dataset_map_fns/alpaca_zh_map_fn.py>`__ | ['instruction_zh', 'input_zh', 'output_zh', ...] | `silk-road/alpaca-data-gpt4-chinese <https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese>`__ |
+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| `oasst1_map_fn <https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/map_fns/dataset_map_fns/oasst1_map_fn.py>`__ | ['text', ...] | `timdettmers/openassistant-guanaco <https://huggingface.co/datasets/timdettmers/openassistant-guanaco>`__ |
+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| `openai_map_fn <https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/map_fns/dataset_map_fns/openai_map_fn.py>`__ | ['messages', ...] | `DavidLanz/fine_tuning_datraset_4_openai <https://huggingface.co/datasets/DavidLanz/fine_tuning_datraset_4_openai>`__ |
+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| `code_alpaca_map_fn <https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/map_fns/dataset_map_fns/code_alpaca_map_fn.py>`__ | ['prompt', 'completion', ...] | `HuggingFaceH4/CodeAlpaca_20K <https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K>`__ |
+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| `medical_map_fn <https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/map_fns/dataset_map_fns/medical_map_fn.py>`__ | ['instruction', 'input', 'output', ...] | `shibing624/medical <https://huggingface.co/datasets/shibing624/medical>`__ |
+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| `tiny_codes_map_fn <https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/map_fns/dataset_map_fns/tiny_codes_map_fn.py>`__ | ['prompt', 'response', ...] | `nampdn-ai/tiny-codes <https://huggingface.co/datasets/nampdn-ai/tiny-codes>`__ |
+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| `default_map_fn <https://github.com/InternLM/xtuner/blob/main/xtuner/dataset/map_fns/dataset_map_fns/default_map_fn.py>`__ | ['input', 'output', ...] | / |
+------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
例如,针对 ``timdettmers/openassistant-guanaco`` 数据集,XTuner 内置了
``oasst1_map_fn``\ ,以对其进行字段格式统一。具体实现如下:
.. code:: python
def oasst1_map_fn(example):
r"""Example before preprocessing:
example['text'] = ('### Human: Can you explain xxx'
'### Assistant: Sure! xxx'
'### Human: I didn't understand how xxx'
'### Assistant: It has to do with a process xxx.')
Example after preprocessing:
example['conversation'] = [
{
'input': 'Can you explain xxx',
'output': 'Sure! xxx'
},
{
'input': 'I didn't understand how xxx',
'output': 'It has to do with a process xxx.'
}
]
"""
data = []
for sentence in example['text'].strip().split('###'):
sentence = sentence.strip()
if sentence[:6] == 'Human:':
data.append(sentence[6:].strip())
elif sentence[:10] == 'Assistant:':
data.append(sentence[10:].strip())
if len(data) % 2:
# The last round of conversation solely consists of input
# without any output.
# Discard the input part of the last round, as this part is ignored in
# the loss calculation.
data.pop()
conversation = []
for i in range(0, len(data), 2):
single_turn_conversation = {'input': data[i], 'output': data[i + 1]}
conversation.append(single_turn_conversation)
return {'conversation': conversation}
通过代码可以看到,\ ``oasst1_map_fn`` 对原数据中的 ``text``
字段进行处理,进而构造了一个 ``conversation``
字段,以此确保了后续数据处理流程的统一。
值得注意的是,如果部分开源数据集依赖特殊的
map_fn,则需要用户自行参照以提供的 map_fn
进行自定义开发,实现字段格式的对齐。
训练
=====
用户可以使用 ``xtuner train`` 启动训练。假设所使用的配置文件路径为
``./config.py``\ ,并使用 DeepSpeed ZeRO-2 优化。
单机单卡
--------
.. code:: console
$ xtuner train ./config.py --deepspeed deepspeed_zero2
单机多卡
--------
.. code:: console
$ NPROC_PER_NODE=${GPU_NUM} xtuner train ./config.py --deepspeed deepspeed_zero2
多机多卡(以 2 \* 8 GPUs 为例)
--------------------------------------
**方法 1:torchrun**
.. code:: console
$ # excuete on node 0
$ NPROC_PER_NODE=8 NNODES=2 PORT=$PORT ADDR=$NODE_0_ADDR NODE_RANK=0 xtuner train mixtral_8x7b_instruct_full_oasst1_e3 --deepspeed deepspeed_zero2
$ # excuete on node 1
$ NPROC_PER_NODE=8 NNODES=2 PORT=$PORT ADDR=$NODE_0_ADDR NODE_RANK=1 xtuner train mixtral_8x7b_instruct_full_oasst1_e3 --deepspeed deepspeed_zero2
.. note::
\ ``$PORT`` 表示通信端口、\ ``$NODE_0_ADDR`` 表示 node 0 的 IP 地址。
二者并不是系统自带的环境变量,需要根据实际情况,替换为实际使用的值
**方法 2:slurm**
.. code:: console
$ srun -p $PARTITION --nodes=2 --gres=gpu:8 --ntasks-per-node=8 xtuner train internlm2_chat_7b_qlora_oasst1_e3 --launcher slurm --deepspeed deepspeed_zero2
模型转换
=========
模型训练后会自动保存成 PTH 模型(例如 ``iter_500.pth``\ ),我们需要利用
``xtuner convert pth_to_hf`` 将其转换为 HuggingFace
模型,以便于后续使用。具体命令为:
.. code:: console
$ xtuner convert pth_to_hf ${CONFIG_NAME_OR_PATH} ${PTH} ${SAVE_PATH}
$ # 例如:xtuner convert pth_to_hf ./config.py ./iter_500.pth ./iter_500_hf
.. _模型合并可选):
模型合并(可选)
================
如果您使用了 LoRA / QLoRA 微调,则模型转换后将得到 adapter
参数,而并不包含原 LLM
参数。如果您期望获得合并后的模型权重,那么可以利用
``xtuner convert merge`` :
.. code:: console
$ xtuner convert merge ${LLM} ${ADAPTER_PATH} ${SAVE_PATH}
$ # 例如:xtuner convert merge internlm/internlm2-chat-7b ./iter_500_hf ./iter_500_merged_llm
对话
=====
用户可以利用 ``xtuner chat`` 实现与微调后的模型对话:
.. code:: console
$ xtuner chat ${NAME_OR_PATH_TO_LLM} --adapter ${NAME_OR_PATH_TO_ADAPTER} --prompt-template ${PROMPT_TEMPLATE} [optional arguments]
.. tip::
例如:
.. code:: console
$ xtuner chat internlm2/internlm2-chat-7b --adapter ./iter_500_hf --prompt-template internlm2_chat
$ xtuner chat ./iter_500_merged_llm --prompt-template internlm2_chat
|