Reason behind not using special tokens in the prompt format?

#2
by Doctor-Shotgun - opened

Hello, hobbyist model finetuner here. Thanks for sharing your training hyperparameters!

I was just curious if there was a specific reason behind not using dedicated special tokens for role headers in the prompt format (such as the ones already defined in the llama 3 tokenizer, i.e. <|start_header_id|>etc.)?

It appears that the <|system|>, <|user|>, and <|assistant|> headers used in the prompt format are not defined special tokens, so they could in theory be variably tokenized into different combinations of substrings during training/inference.

From the paper it seems like some empiric testing was done - was this also attempted with the tokens above being defined as special?

I just found about this and I'm curious as well.

@Doctor-Shotgun and @sszymczyk -- it's because we hard set the chat template in open instruct to be the same for every model. It's not necessarily optimal, but it is a simple approach we've been using for a few years as the goal of our efforts is to easily translate recipe and code to OLMo.

natolambert changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment