How to structure the dataset for finetunning?
How to structure the dataset to use the Qwen2.5-coder as the pretrained model and to then finetune it for my sepcific use case? What would the jsonl file look like and what columns would it have ex: "input", "output" or?
I am using this Snippet with DataCollatorForCompletionOnlyLM and SFTTrainer from the python trl Package for Supervised finetuning of Qwen2.5-Instruct Models, should work with the Coder ones aswell
def formatting_prompts_func(batch):
output_texts = []
for i in range(len(batch['id'])):
res = batch[res_key][i] # provided via args to my script could be anything, basicaly representing the assistant response
text = f'''<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
{batch['prompt_messages'][i][0]['content']}<|im_end|>
<|im_start|>assistant
{res}<|im_end|>
'''
output_texts.append(text)
return output_texts
# Does automatically mask assistant response in input_ids during evaluation, and all tokens expect for the assistant response in labels, to only train on the assistant completion
collator = DataCollatorForCompletionOnlyLM('<|im_start|>assistant', tokenizer=tokenizer)
trainer = SFTTrainer(
model_init=model_init,
args=SFTConfig(**train_args.to_dict(), max_seq_length=7000),
train_dataset=train_set,
eval_dataset=test_set,
processing_class=tokenizer,
formatting_func=formatting_prompts_func,
data_collator=collator,
)
As you can see your Dataset file could have any structure since you can provide a formatting Function for preprocessing, which is an easy way to preprocess SFT Datasets.
Most Trainer implementations / Models use "input_ids" and "labels" Fields, which are filled during preprocessing of the Dataset and contain the token ids.
So you could also use the tokenizer to create those yourself, depending on the use case you will mask certain parts of the Sequences to ignore during loss calculation (for example the Prompt Tokens in the Labels) or the Assistant Response in the input ids during evaluation. Therefore you have to preprocess train and eval set seperately.
An other important Part is the max_seuqence_length, you can tokenize each sample of your Dataset (including Assistant Response) to dynamicaly get the maximum number which will be used for padding.
If this number is to low, some sequences could get cut off, if it is to big you waste Ressources.
If you dont want to use basic Supervised Fine Tuning you might have to dig deeper, since preprocessing might differ.
I am using this Snippet with DataCollatorForCompletionOnlyLM and SFTTrainer from the python trl Package for Supervised finetuning of Qwen2.5-Instruct Models, should work with the Coder ones aswell
def formatting_prompts_func(batch): output_texts = [] for i in range(len(batch['id'])): res = batch[res_key][i] # provided via args to my script could be anything, basicaly representing the assistant response text = f'''<|im_start|>system You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|> <|im_start|>user {batch['prompt_messages'][i][0]['content']}<|im_end|> <|im_start|>assistant {res}<|im_end|> ''' output_texts.append(text) return output_texts # Does automatically mask assistant response in input_ids during evaluation, and all tokens expect for the assistant response in labels, to only train on the assistant completion collator = DataCollatorForCompletionOnlyLM('<|im_start|>assistant', tokenizer=tokenizer) trainer = SFTTrainer( model_init=model_init, args=SFTConfig(**train_args.to_dict(), max_seq_length=7000), train_dataset=train_set, eval_dataset=test_set, processing_class=tokenizer, formatting_func=formatting_prompts_func, data_collator=collator, )
As you can see your Dataset file could have any structure since you can provide a formatting Function for preprocessing, which is an easy way to preprocess SFT Datasets.
Most Trainer implementations / Models use "input_ids" and "labels" Fields, which are filled during preprocessing of the Dataset and contain the token ids.
So you could also use the tokenizer to create those yourself, depending on the use case you will mask certain parts of the Sequences to ignore during loss calculation (for example the Prompt Tokens in the Labels) or the Assistant Response in the input ids during evaluation. Therefore you have to preprocess train and eval set seperately.An other important Part is the max_seuqence_length, you can tokenize each sample of your Dataset (including Assistant Response) to dynamicaly get the maximum number which will be used for padding.
If this number is to low, some sequences could get cut off, if it is to big you waste Ressources.If you dont want to use basic Supervised Fine Tuning you might have to dig deeper, since preprocessing might differ.
Thank you for this detailed overview😊