chore: update readme to be more clear (#1326) [skip ci]
Browse files
README.md
CHANGED
|
@@ -22,7 +22,7 @@ Features:
|
|
| 22 |
- [Introduction](#axolotl)
|
| 23 |
- [Supported Features](#axolotl-supports)
|
| 24 |
- [Quickstart](#quickstart-)
|
| 25 |
-
- [
|
| 26 |
- [Docker](#docker)
|
| 27 |
- [Conda/Pip venv](#condapip-venv)
|
| 28 |
- [Cloud GPU](#cloud-gpu) - Latitude.sh, RunPod
|
|
@@ -87,25 +87,20 @@ Features:
|
|
| 87 |
| phi | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ |
|
| 88 |
| RWKV | ✅ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
|
| 89 |
| Qwen | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ |
|
|
|
|
| 90 |
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
## Quickstart ⚡
|
| 93 |
|
| 94 |
Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.
|
| 95 |
|
| 96 |
-
**Requirements**: Python >=3.9 and Pytorch >=2.
|
| 97 |
|
| 98 |
`pip3 install "axolotl[flash-attn,deepspeed] @ git+https://github.com/OpenAccess-AI-Collective/axolotl"`
|
| 99 |
|
| 100 |
-
### For developers
|
| 101 |
-
```bash
|
| 102 |
-
git clone https://github.com/OpenAccess-AI-Collective/axolotl
|
| 103 |
-
cd axolotl
|
| 104 |
-
|
| 105 |
-
pip3 install packaging
|
| 106 |
-
pip3 install -e '.[flash-attn,deepspeed]'
|
| 107 |
-
```
|
| 108 |
-
|
| 109 |
### Usage
|
| 110 |
```bash
|
| 111 |
# preprocess datasets - optional but recommended
|
|
@@ -127,13 +122,14 @@ accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
|
|
| 127 |
accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml
|
| 128 |
```
|
| 129 |
|
| 130 |
-
##
|
| 131 |
|
| 132 |
### Environment
|
| 133 |
|
| 134 |
#### Docker
|
|
|
|
| 135 |
```bash
|
| 136 |
-
docker run --gpus '"all"' --rm -it winglian/axolotl:main-
|
| 137 |
```
|
| 138 |
|
| 139 |
Or run on the current files for development:
|
|
@@ -152,7 +148,7 @@ accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAcc
|
|
| 152 |
A more powerful Docker command to run would be this:
|
| 153 |
|
| 154 |
```bash
|
| 155 |
-
docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-
|
| 156 |
```
|
| 157 |
|
| 158 |
It additionally:
|
|
@@ -242,15 +238,18 @@ Please use WSL or Docker!
|
|
| 242 |
|
| 243 |
#### Launching on public clouds via SkyPilot
|
| 244 |
To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use [SkyPilot](https://skypilot.readthedocs.io/en/latest/index.html):
|
|
|
|
| 245 |
```bash
|
| 246 |
pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds
|
| 247 |
sky check
|
| 248 |
```
|
|
|
|
| 249 |
Get the [example YAMLs](https://github.com/skypilot-org/skypilot/tree/master/llm/axolotl) of using Axolotl to finetune `mistralai/Mistral-7B-v0.1`:
|
| 250 |
```
|
| 251 |
git clone https://github.com/skypilot-org/skypilot.git
|
| 252 |
cd skypilot/llm/axolotl
|
| 253 |
```
|
|
|
|
| 254 |
Use one command to launch:
|
| 255 |
```bash
|
| 256 |
# On-demand
|
|
@@ -260,32 +259,33 @@ HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN
|
|
| 260 |
HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET
|
| 261 |
```
|
| 262 |
|
| 263 |
-
|
| 264 |
### Dataset
|
| 265 |
|
| 266 |
Axolotl supports a variety of dataset formats. Below are some of the formats you can use.
|
| 267 |
Have dataset(s) in one of the following format (JSONL recommended):
|
| 268 |
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
{"instruction": "...", "input": "...", "output": "..."}
|
| 272 |
-
```
|
| 273 |
-
- `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: `system` to override default system prompt)
|
| 274 |
-
```json
|
| 275 |
-
{"conversations": [{"from": "...", "value": "..."}]}
|
| 276 |
-
```
|
| 277 |
-
- `llama-2`: the json is the same format as `sharegpt` above, with the following config (see the [config section](#config) for more details)
|
| 278 |
-
```yml
|
| 279 |
-
datasets:
|
| 280 |
-
- path: <your-path>
|
| 281 |
-
type: sharegpt
|
| 282 |
-
conversation: llama-2
|
| 283 |
-
```
|
| 284 |
- `completion`: raw corpus
|
| 285 |
```json
|
| 286 |
{"text": "..."}
|
| 287 |
```
|
| 288 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 289 |
<details>
|
| 290 |
|
| 291 |
<summary>See other formats</summary>
|
|
@@ -362,14 +362,28 @@ Have dataset(s) in one of the following format (JSONL recommended):
|
|
| 362 |
```json
|
| 363 |
{"scores": "...", "critiques": "...", "instruction": "...", "answer": "...", "revision": "..."}
|
| 364 |
```
|
| 365 |
-
- `pygmalion`: pygmalion
|
| 366 |
-
```json
|
| 367 |
-
{"conversations": [{"role": "...", "value": "..."}]}
|
| 368 |
-
```
|
| 369 |
- `metharme`: instruction, adds additional eos tokens
|
| 370 |
```json
|
| 371 |
{"prompt": "...", "generation": "..."}
|
| 372 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 373 |
- `sharegpt.load_role`: conversations where `role` is used instead of `from`
|
| 374 |
```json
|
| 375 |
{"conversations": [{"role": "...", "value": "..."}]}
|
|
@@ -385,6 +399,8 @@ Have dataset(s) in one of the following format (JSONL recommended):
|
|
| 385 |
|
| 386 |
</details>
|
| 387 |
|
|
|
|
|
|
|
| 388 |
#### How to add custom prompts
|
| 389 |
|
| 390 |
For a dataset that is preprocessed for instruction purposes:
|
|
@@ -406,12 +422,16 @@ datasets:
|
|
| 406 |
format: "[INST] {instruction} [/INST]"
|
| 407 |
no_input_format: "[INST] {instruction} [/INST]"
|
| 408 |
```
|
|
|
|
| 409 |
|
| 410 |
#### How to use your custom pretokenized dataset
|
| 411 |
|
| 412 |
- Do not pass a `type:`
|
| 413 |
- Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
|
| 414 |
|
|
|
|
|
|
|
|
|
|
| 415 |
|
| 416 |
### Config
|
| 417 |
|
|
@@ -425,22 +445,18 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
|
|
| 425 |
|
| 426 |
- dataset
|
| 427 |
```yaml
|
| 428 |
-
sequence_len: 2048 # max token length for prompt
|
| 429 |
-
|
| 430 |
-
# huggingface repo
|
| 431 |
datasets:
|
|
|
|
| 432 |
- path: vicgalle/alpaca-gpt4
|
| 433 |
-
type: alpaca
|
| 434 |
|
| 435 |
-
|
| 436 |
-
datasets:
|
| 437 |
- path: EleutherAI/pile
|
| 438 |
name: enron_emails
|
| 439 |
type: completion # format from earlier
|
| 440 |
field: text # Optional[str] default: text, field to use for completion data
|
| 441 |
|
| 442 |
-
|
| 443 |
-
datasets:
|
| 444 |
- path: bigcode/commitpackft
|
| 445 |
name:
|
| 446 |
- ruby
|
|
@@ -448,34 +464,29 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
|
|
| 448 |
- typescript
|
| 449 |
type: ... # unimplemented custom format
|
| 450 |
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
datasets:
|
| 454 |
- path: ...
|
| 455 |
type: sharegpt
|
| 456 |
-
conversation: chatml
|
| 457 |
|
| 458 |
-
|
| 459 |
-
datasets:
|
| 460 |
- path: data.jsonl # or json
|
| 461 |
ds_type: json # see other options below
|
| 462 |
type: alpaca
|
| 463 |
|
| 464 |
-
|
| 465 |
-
dataset:
|
| 466 |
- path: knowrohit07/know_sql
|
| 467 |
type: context_qa.load_v2
|
| 468 |
train_on_split: validation
|
| 469 |
|
| 470 |
-
|
| 471 |
-
|
| 472 |
-
dataset:
|
| 473 |
- path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
|
| 474 |
...
|
| 475 |
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
dataset:
|
| 479 |
- path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP.
|
| 480 |
ds_type: json # this is the default, see other options below.
|
| 481 |
```
|
|
@@ -484,9 +495,11 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
|
|
| 484 |
```yaml
|
| 485 |
load_in_4bit: true
|
| 486 |
load_in_8bit: true
|
|
|
|
| 487 |
bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically.
|
| 488 |
fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32
|
| 489 |
tf32: true # require >=ampere
|
|
|
|
| 490 |
bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision)
|
| 491 |
float16: true # use instead of fp16 when you don't want AMP
|
| 492 |
```
|
|
@@ -494,7 +507,7 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
|
|
| 494 |
|
| 495 |
- lora
|
| 496 |
```yaml
|
| 497 |
-
adapter: lora # qlora or leave blank for full finetune
|
| 498 |
lora_r: 8
|
| 499 |
lora_alpha: 16
|
| 500 |
lora_dropout: 0.05
|
|
@@ -503,9 +516,9 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
|
|
| 503 |
- v_proj
|
| 504 |
```
|
| 505 |
|
| 506 |
-
<details>
|
| 507 |
|
| 508 |
-
<summary>All yaml options (click
|
| 509 |
|
| 510 |
```yaml
|
| 511 |
# This is the huggingface model that contains *.pt, *.safetensors, or *.bin files
|
|
@@ -535,12 +548,13 @@ tokenizer_legacy:
|
|
| 535 |
# This is reported to improve training speed on some models
|
| 536 |
resize_token_embeddings_to_32x:
|
| 537 |
|
|
|
|
| 538 |
# Used to identify which the model is based on
|
| 539 |
is_falcon_derived_model:
|
| 540 |
is_llama_derived_model:
|
|
|
|
| 541 |
# Please note that if you set this to true, `padding_side` will be set to "left" by default
|
| 542 |
is_mistral_derived_model:
|
| 543 |
-
is_qwen_derived_model:
|
| 544 |
|
| 545 |
# optional overrides to the base model configuration
|
| 546 |
model_config_overrides:
|
|
@@ -633,7 +647,7 @@ test_datasets:
|
|
| 633 |
data_files:
|
| 634 |
- /workspace/data/eval.jsonl
|
| 635 |
|
| 636 |
-
# use RL training: dpo, ipo, kto_pair
|
| 637 |
rl:
|
| 638 |
|
| 639 |
# Saves the desired chat template to the tokenizer_config.json for easier inferencing
|
|
@@ -653,7 +667,7 @@ dataset_processes: # defaults to os.cpu_count() if not set
|
|
| 653 |
# Only needed if cached dataset is taking too much storage
|
| 654 |
dataset_keep_in_memory:
|
| 655 |
# push checkpoints to hub
|
| 656 |
-
hub_model_id: # repo path to push finetuned model
|
| 657 |
# how to push checkpoints to hub
|
| 658 |
# https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
|
| 659 |
hub_strategy:
|
|
@@ -1100,7 +1114,7 @@ Please use `--sample_packing False` if you have it on and receive the error simi
|
|
| 1100 |
|
| 1101 |
### Merge LORA to base
|
| 1102 |
|
| 1103 |
-
The following command will merge your LORA adapater with your base model.
|
| 1104 |
|
| 1105 |
```bash
|
| 1106 |
python3 -m axolotl.cli.merge_lora your_config.yml --lora_model_dir="./completed-model"
|
|
@@ -1161,7 +1175,7 @@ If you decode a prompt constructed by axolotl, you might see spaces between toke
|
|
| 1161 |
|
| 1162 |
1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer.
|
| 1163 |
2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
|
| 1164 |
-
3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same adjust your inference server accordingly.
|
| 1165 |
4. As an additional troubleshooting step, you can look at the token ids between 1 and 2 to make sure they are identical.
|
| 1166 |
|
| 1167 |
Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example.
|
|
@@ -1208,11 +1222,20 @@ PRs are **greatly welcome**!
|
|
| 1208 |
|
| 1209 |
Please run below to setup env
|
| 1210 |
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1211 |
pip3 install -r requirements-dev.txt -r requirements-tests.txt
|
| 1212 |
pre-commit install
|
| 1213 |
|
| 1214 |
# test
|
| 1215 |
pytest tests/
|
|
|
|
|
|
|
|
|
|
| 1216 |
```
|
| 1217 |
|
| 1218 |
Thanks to all of our contributors to date. Help drive open source AI progress forward by contributing to Axolotl.
|
|
|
|
| 22 |
- [Introduction](#axolotl)
|
| 23 |
- [Supported Features](#axolotl-supports)
|
| 24 |
- [Quickstart](#quickstart-)
|
| 25 |
+
- [Environment](#environment)
|
| 26 |
- [Docker](#docker)
|
| 27 |
- [Conda/Pip venv](#condapip-venv)
|
| 28 |
- [Cloud GPU](#cloud-gpu) - Latitude.sh, RunPod
|
|
|
|
| 87 |
| phi | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ |
|
| 88 |
| RWKV | ✅ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
|
| 89 |
| Qwen | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ |
|
| 90 |
+
| Gemma | ✅ | ✅ | ✅ | ❓ | ❓ | ✅ | ❓ |
|
| 91 |
|
| 92 |
+
✅: supported
|
| 93 |
+
❌: not supported
|
| 94 |
+
❓: untested
|
| 95 |
|
| 96 |
## Quickstart ⚡
|
| 97 |
|
| 98 |
Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.
|
| 99 |
|
| 100 |
+
**Requirements**: Python >=3.9 and Pytorch >=2.1.1.
|
| 101 |
|
| 102 |
`pip3 install "axolotl[flash-attn,deepspeed] @ git+https://github.com/OpenAccess-AI-Collective/axolotl"`
|
| 103 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
### Usage
|
| 105 |
```bash
|
| 106 |
# preprocess datasets - optional but recommended
|
|
|
|
| 122 |
accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml
|
| 123 |
```
|
| 124 |
|
| 125 |
+
## Advanced Setup
|
| 126 |
|
| 127 |
### Environment
|
| 128 |
|
| 129 |
#### Docker
|
| 130 |
+
|
| 131 |
```bash
|
| 132 |
+
docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest
|
| 133 |
```
|
| 134 |
|
| 135 |
Or run on the current files for development:
|
|
|
|
| 148 |
A more powerful Docker command to run would be this:
|
| 149 |
|
| 150 |
```bash
|
| 151 |
+
docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest
|
| 152 |
```
|
| 153 |
|
| 154 |
It additionally:
|
|
|
|
| 238 |
|
| 239 |
#### Launching on public clouds via SkyPilot
|
| 240 |
To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use [SkyPilot](https://skypilot.readthedocs.io/en/latest/index.html):
|
| 241 |
+
|
| 242 |
```bash
|
| 243 |
pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds
|
| 244 |
sky check
|
| 245 |
```
|
| 246 |
+
|
| 247 |
Get the [example YAMLs](https://github.com/skypilot-org/skypilot/tree/master/llm/axolotl) of using Axolotl to finetune `mistralai/Mistral-7B-v0.1`:
|
| 248 |
```
|
| 249 |
git clone https://github.com/skypilot-org/skypilot.git
|
| 250 |
cd skypilot/llm/axolotl
|
| 251 |
```
|
| 252 |
+
|
| 253 |
Use one command to launch:
|
| 254 |
```bash
|
| 255 |
# On-demand
|
|
|
|
| 259 |
HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET
|
| 260 |
```
|
| 261 |
|
|
|
|
| 262 |
### Dataset
|
| 263 |
|
| 264 |
Axolotl supports a variety of dataset formats. Below are some of the formats you can use.
|
| 265 |
Have dataset(s) in one of the following format (JSONL recommended):
|
| 266 |
|
| 267 |
+
#### Pretraining
|
| 268 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 269 |
- `completion`: raw corpus
|
| 270 |
```json
|
| 271 |
{"text": "..."}
|
| 272 |
```
|
| 273 |
|
| 274 |
+
Note: Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:
|
| 275 |
+
|
| 276 |
+
```yaml
|
| 277 |
+
pretraining_dataset: # hf path only
|
| 278 |
+
```
|
| 279 |
+
|
| 280 |
+
#### Supervised finetuning
|
| 281 |
+
|
| 282 |
+
##### Instruction
|
| 283 |
+
|
| 284 |
+
- `alpaca`: instruction; input(optional)
|
| 285 |
+
```json
|
| 286 |
+
{"instruction": "...", "input": "...", "output": "..."}
|
| 287 |
+
```
|
| 288 |
+
|
| 289 |
<details>
|
| 290 |
|
| 291 |
<summary>See other formats</summary>
|
|
|
|
| 362 |
```json
|
| 363 |
{"scores": "...", "critiques": "...", "instruction": "...", "answer": "...", "revision": "..."}
|
| 364 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 365 |
- `metharme`: instruction, adds additional eos tokens
|
| 366 |
```json
|
| 367 |
{"prompt": "...", "generation": "..."}
|
| 368 |
```
|
| 369 |
+
|
| 370 |
+
</details>
|
| 371 |
+
|
| 372 |
+
##### Conversation
|
| 373 |
+
|
| 374 |
+
- `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
|
| 375 |
+
```json
|
| 376 |
+
{"conversations": [{"from": "...", "value": "..."}]}
|
| 377 |
+
```
|
| 378 |
+
|
| 379 |
+
<details>
|
| 380 |
+
|
| 381 |
+
<summary>See other formats</summary>
|
| 382 |
+
|
| 383 |
+
- `pygmalion`: pygmalion
|
| 384 |
+
```json
|
| 385 |
+
{"conversations": [{"role": "...", "value": "..."}]}
|
| 386 |
+
```
|
| 387 |
- `sharegpt.load_role`: conversations where `role` is used instead of `from`
|
| 388 |
```json
|
| 389 |
{"conversations": [{"role": "...", "value": "..."}]}
|
|
|
|
| 399 |
|
| 400 |
</details>
|
| 401 |
|
| 402 |
+
Note: `type: sharegpt` opens a special config `conversation:` that enables conversions to many Conversation types. See dataset section under [all yaml options](#all-yaml-options).
|
| 403 |
+
|
| 404 |
#### How to add custom prompts
|
| 405 |
|
| 406 |
For a dataset that is preprocessed for instruction purposes:
|
|
|
|
| 422 |
format: "[INST] {instruction} [/INST]"
|
| 423 |
no_input_format: "[INST] {instruction} [/INST]"
|
| 424 |
```
|
| 425 |
+
See full config options under [all yaml options](#all-yaml-options).
|
| 426 |
|
| 427 |
#### How to use your custom pretokenized dataset
|
| 428 |
|
| 429 |
- Do not pass a `type:`
|
| 430 |
- Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
|
| 431 |
|
| 432 |
+
```yaml
|
| 433 |
+
- path: ...
|
| 434 |
+
```
|
| 435 |
|
| 436 |
### Config
|
| 437 |
|
|
|
|
| 445 |
|
| 446 |
- dataset
|
| 447 |
```yaml
|
|
|
|
|
|
|
|
|
|
| 448 |
datasets:
|
| 449 |
+
# huggingface repo
|
| 450 |
- path: vicgalle/alpaca-gpt4
|
| 451 |
+
type: alpaca
|
| 452 |
|
| 453 |
+
# huggingface repo with specific configuration/subset
|
|
|
|
| 454 |
- path: EleutherAI/pile
|
| 455 |
name: enron_emails
|
| 456 |
type: completion # format from earlier
|
| 457 |
field: text # Optional[str] default: text, field to use for completion data
|
| 458 |
|
| 459 |
+
# huggingface repo with multiple named configurations/subsets
|
|
|
|
| 460 |
- path: bigcode/commitpackft
|
| 461 |
name:
|
| 462 |
- ruby
|
|
|
|
| 464 |
- typescript
|
| 465 |
type: ... # unimplemented custom format
|
| 466 |
|
| 467 |
+
# fastchat conversation
|
| 468 |
+
# See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
|
|
|
|
| 469 |
- path: ...
|
| 470 |
type: sharegpt
|
| 471 |
+
conversation: chatml # default: vicuna_v1.1
|
| 472 |
|
| 473 |
+
# local
|
|
|
|
| 474 |
- path: data.jsonl # or json
|
| 475 |
ds_type: json # see other options below
|
| 476 |
type: alpaca
|
| 477 |
|
| 478 |
+
# dataset with splits, but no train split
|
|
|
|
| 479 |
- path: knowrohit07/know_sql
|
| 480 |
type: context_qa.load_v2
|
| 481 |
train_on_split: validation
|
| 482 |
|
| 483 |
+
# loading from s3 or gcs
|
| 484 |
+
# s3 creds will be loaded from the system default and gcs only supports public access
|
|
|
|
| 485 |
- path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
|
| 486 |
...
|
| 487 |
|
| 488 |
+
# Loading Data From a Public URL
|
| 489 |
+
# - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly.
|
|
|
|
| 490 |
- path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP.
|
| 491 |
ds_type: json # this is the default, see other options below.
|
| 492 |
```
|
|
|
|
| 495 |
```yaml
|
| 496 |
load_in_4bit: true
|
| 497 |
load_in_8bit: true
|
| 498 |
+
|
| 499 |
bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically.
|
| 500 |
fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32
|
| 501 |
tf32: true # require >=ampere
|
| 502 |
+
|
| 503 |
bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision)
|
| 504 |
float16: true # use instead of fp16 when you don't want AMP
|
| 505 |
```
|
|
|
|
| 507 |
|
| 508 |
- lora
|
| 509 |
```yaml
|
| 510 |
+
adapter: lora # 'qlora' or leave blank for full finetune
|
| 511 |
lora_r: 8
|
| 512 |
lora_alpha: 16
|
| 513 |
lora_dropout: 0.05
|
|
|
|
| 516 |
- v_proj
|
| 517 |
```
|
| 518 |
|
| 519 |
+
<details id="all-yaml-options">
|
| 520 |
|
| 521 |
+
<summary>All yaml options (click to expand)</summary>
|
| 522 |
|
| 523 |
```yaml
|
| 524 |
# This is the huggingface model that contains *.pt, *.safetensors, or *.bin files
|
|
|
|
| 548 |
# This is reported to improve training speed on some models
|
| 549 |
resize_token_embeddings_to_32x:
|
| 550 |
|
| 551 |
+
# (Internal use only)
|
| 552 |
# Used to identify which the model is based on
|
| 553 |
is_falcon_derived_model:
|
| 554 |
is_llama_derived_model:
|
| 555 |
+
is_qwen_derived_model:
|
| 556 |
# Please note that if you set this to true, `padding_side` will be set to "left" by default
|
| 557 |
is_mistral_derived_model:
|
|
|
|
| 558 |
|
| 559 |
# optional overrides to the base model configuration
|
| 560 |
model_config_overrides:
|
|
|
|
| 647 |
data_files:
|
| 648 |
- /workspace/data/eval.jsonl
|
| 649 |
|
| 650 |
+
# use RL training: 'dpo', 'ipo', 'kto_pair'
|
| 651 |
rl:
|
| 652 |
|
| 653 |
# Saves the desired chat template to the tokenizer_config.json for easier inferencing
|
|
|
|
| 667 |
# Only needed if cached dataset is taking too much storage
|
| 668 |
dataset_keep_in_memory:
|
| 669 |
# push checkpoints to hub
|
| 670 |
+
hub_model_id: # private repo path to push finetuned model
|
| 671 |
# how to push checkpoints to hub
|
| 672 |
# https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
|
| 673 |
hub_strategy:
|
|
|
|
| 1114 |
|
| 1115 |
### Merge LORA to base
|
| 1116 |
|
| 1117 |
+
The following command will merge your LORA adapater with your base model. You can optionally pass the argument `--lora_model_dir` to specify the directory where your LORA adapter was saved, otherwhise, this will be inferred from `output_dir` in your axolotl config file. The merged model is saved in the sub-directory `{lora_model_dir}/merged`.
|
| 1118 |
|
| 1119 |
```bash
|
| 1120 |
python3 -m axolotl.cli.merge_lora your_config.yml --lora_model_dir="./completed-model"
|
|
|
|
| 1175 |
|
| 1176 |
1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer.
|
| 1177 |
2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
|
| 1178 |
+
3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same, adjust your inference server accordingly.
|
| 1179 |
4. As an additional troubleshooting step, you can look at the token ids between 1 and 2 to make sure they are identical.
|
| 1180 |
|
| 1181 |
Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example.
|
|
|
|
| 1222 |
|
| 1223 |
Please run below to setup env
|
| 1224 |
```bash
|
| 1225 |
+
git clone https://github.com/OpenAccess-AI-Collective/axolotl
|
| 1226 |
+
cd axolotl
|
| 1227 |
+
|
| 1228 |
+
pip3 install packaging
|
| 1229 |
+
pip3 install -e '.[flash-attn,deepspeed]'
|
| 1230 |
+
|
| 1231 |
pip3 install -r requirements-dev.txt -r requirements-tests.txt
|
| 1232 |
pre-commit install
|
| 1233 |
|
| 1234 |
# test
|
| 1235 |
pytest tests/
|
| 1236 |
+
|
| 1237 |
+
# optional: run against all files
|
| 1238 |
+
pre-commit run --all-files
|
| 1239 |
```
|
| 1240 |
|
| 1241 |
Thanks to all of our contributors to date. Help drive open source AI progress forward by contributing to Axolotl.
|