| # Template-free prompt construction with the `input_output` format | |
| <!-- TOC --> | |
| - [Background](#background) | |
| - [Masking Inputs](#masking-inputs) | |
| - [You may not want prompt templates](#you-may-not-want-prompt-templates) | |
| - [The `input_output` format](#the-input_output-format) | |
| - [Usage](#usage) | |
| - [1. Prepare Data](#1-prepare-data) | |
| - [2. Use `type: input_output`](#2-use-type-input_output) | |
| - [3. Check the prompts](#3-check-the-prompts) | |
| <!-- /TOC --> | |
| <a id="markdown-background" name="background"></a> | |
| ## Background | |
| <a id="markdown-masking-inputs" name="masking-inputs"></a> | |
| ### Masking Inputs | |
| One of the most popular features of | |
| [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is | |
| setting the following configuration value: | |
| ```yaml | |
| train_on_inputs: false | |
| ``` | |
| If you declare a [dataset formats](https://github.com/OpenAccess-AI-Collective/axolotl?tab=readme-ov-file#dataset) | |
| such as `alpaca` or `chatml`, axolotl knows what is an input | |
| (i.e. human) vs. an output (i.e. the assistant) and masks the input | |
| labels so that your model can focus on predicting the outputs only. | |
| <a id="markdown-you-may-not-want-prompt-templates" name="you-may-not-want-prompt-templates"></a> | |
| ### You may not want prompt templates | |
| However, there are many situations where you don't want to use one of | |
| these formats or templates (I usually don't!). This is because they can: | |
| - Add unnecessary boilerplate to your prompts. | |
| - Create artifacts like special delimiters `<|im_start|>` that can | |
| quickly become footguns if you don't include them correctly at | |
| inference time. | |
| - Enforce a *chat* interface when you do not want one. Sometimes you | |
| just want to fine-tune a model to a very specific task and do NOT | |
| want multi-turn conversations, roles, etc. | |
| - Limit you to only certain roles that the template allows. | |
| <a id="markdown-the-inputoutput-format" name="the-inputoutput-format"></a> | |
| ### The `input_output` format | |
| You can construct your prompts without a template by using the | |
| `input_output` format, by setting `type: input_output` in your | |
| configuration file like this: | |
| **config.yml** | |
| ```yaml | |
| train_on_inputs: false # Mask segments of your data | |
| datasets: | |
| - path: output.jsonl | |
| type: input_output # use template free prompt construction | |
| ``` | |
| Unlike `type: completion`, which is also template-free, | |
| `type: input_output` allows you to mask segments of your text. More | |
| details on how this works are described below. | |
| <a id="markdown-usage" name="usage"></a> | |
| ## Usage | |
| This is how you can use the `input_output` format: | |
| <a id="markdown-1-prepare-data" name="1-prepare-data"></a> | |
| ### 1. Prepare Data | |
| To use the `input_output` format, collect your data in the following | |
| format into a jsonl file (below is the first row from the file | |
| `output`.jsonl` pretty printed): | |
| ```bash | |
| $ head -n1 output.jsonl | python -m json.tool | |
| {.cell-output .cell-output-stdout} | |
| { | |
| "segments": [ | |
| { | |
| "label": true, | |
| "text": "<s>Hello\n" | |
| }, | |
| { | |
| "label": true, | |
| "text": "hi there!. " | |
| }, | |
| { | |
| "label": false, | |
| "text": "goodbye " | |
| }, | |
| { | |
| "label": true, | |
| "text": "farewell</s>" | |
| } | |
| ] | |
| } | |
| ``` | |
| Set `label:false` when you want to mask a segment of text so that the | |
| model isn't trained on it. Some things to keep in mind: | |
| > [!IMPORTANT] | |
| > 1. **EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl | |
| concatenates all the segments as-is.** The tokenizer doesn't add | |
| anything additional. Notice how I added spaces, newlines, `<s>` | |
| (BOS), and `</s>` (EOS) myself. | |
| > 2. Make sure you check the materialized output to validate that the | |
| prompt is getting assembled how you like. | |
| <a id="markdown-2-use-type-inputoutput" name="2-use-type-inputoutput"></a> | |
| ### 2. Use `type: input_output` | |
| Let's materialize data with our `output.jsonl` file by setting | |
| `type: input_output` in our axolotl config: | |
| ```yaml | |
| # training_config.yaml | |
| base_model: mistralai/Mistral-7B-v0.1 | |
| data_seed: 49 | |
| seed: 49 | |
| datasets: | |
| - path: output.jsonl | |
| type: input_output | |
| val_set_size: 0.1 | |
| sequence_len: 896 | |
| sample_packing: false | |
| micro_batch_size: 2 | |
| gradient_accumulation_steps: 3 | |
| eval_batch_size: 2 | |
| num_epochs: 1 | |
| learning_rate: 0.0002 | |
| train_on_inputs: false | |
| special_tokens: | |
| bos_token: "<s>" | |
| eos_token: "</s>" | |
| unk_token: "<unk>" | |
| ``` | |
| You can use the following command to materialize your data. The | |
| `--debug` flag will print the tokens, along with the labels so you can | |
| verify that the correct items are being ignored: | |
| ```bash | |
| $ python -m axolotl.cli.preprocess training_config.yaml --debug | |
| ... | |
| [2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557) | |
| (13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2) | |
| ``` | |
| The format is `decoded_token`(`label`, `token_id`), for example, | |
| `<s>(1, 1)` means that the token is `<s>`, the label is `1` and the | |
| token_id is `1`. When the label is `-100` then that token is ignored for | |
| training. | |
| <a id="markdown-3-check-the-prompts" name="3-check-the-prompts"></a> | |
| ### 3. Check the prompts | |
| Here is another way to check the materialized output: | |
| ```python | |
| from transformers import AutoTokenizer | |
| from datasets import load_from_disk | |
| import yaml | |
| directory = !ls last_run_prepared/ | |
| with open('training_config.yaml', 'r') as f: | |
| cfg = yaml.safe_load(f) | |
| model_id = cfg['base_model'] | |
| tok = AutoTokenizer.from_pretrained(model_id) | |
| ds = load_from_disk(f'last_run_prepared/{directory[0]}/') | |
| ``` | |
| ```python | |
| >>> row = ds[0] | |
| >>> print(tok.decode(row['input_ids'])) | |
| <s> Hello | |
| hi there!. goodbye farewell</s> | |
| ``` | |
| We can check that the right tokens are ingored by comparing the labels | |
| to each token: | |
| ```python | |
| import pandas as pd | |
| pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in | |
| zip(row['input_ids'], row['labels'])]) | |
| ``` | |
| | token | label | id | | |
| |-------|-------|-------| | |
| | 0 | \<s\> | 1 | | |
| | 1 | Hello | 22557 | | |
| | 2 | \\n | 13 | | |
| | 3 | hi | 12014 | | |
| | 4 | there | 736 | | |
| | 5 | ! | 28808 | | |
| | 6 | . | 28723 | | |
| | 7 | | 28705 | | |
| | 8 | good | -100 | | |
| | 9 | bye | -100 | | |
| | 10 | | -100 | | |
| | 11 | fare | 19111 | | |
| | 12 | well | 5458 | | |
| | 13 | \</s\>| 2 | | |
| If we look at the input data, the above table seems correct! (The jsonl | |
| version is repeated below for reference): | |
| ```bash | |
| $ head -n1 output.jsonl | python -m json.tool | |
| {.cell-output .cell-output-stdout} | |
| { | |
| "segments": [ | |
| { | |
| "label": true, | |
| "text": "<s>Hello\n" | |
| }, | |
| { | |
| "label": true, | |
| "text": "hi there!. " | |
| }, | |
| { | |
| "label": false, | |
| "text": "goodbye " | |
| }, | |
| { | |
| "label": true, | |
| "text": "farewell</s>" | |
| } | |
| ] | |
| } | |
| ``` | |