ChatML Tokenizer for Gemma

This repository includes a fast tokenizer for google/gemma-7b with the ChatML format. The Tokenizer was created by replacing the string values of original tokens with id 106 (<start_of_turn>) and 107 (<end_of_turn>) with the chatML tokens <|im_start|> and <|im_end|>.

No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified.

Note: It is important to note that this tokenizer is not 100% ChatML compliant, since it seems google/gemma-7b, always requires the original <bos> token to be part of the input. This means the chat template is <bos> + chatml + <eos>

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")

messages = [
  {"role": "system", "content": "You are Gemma."},
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(chatml)
# <bos><|im_start|>system
# You are Gemma.<|im_end|>
# <|im_start|>user
# Hello, how are you?<|im_end|>
# <|im_start|>assistant
# I'm doing great. How can I help you today?<|im_end|>\n<eos>

Test

tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")
original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")

# get special tokens
print(tokenizer.special_tokens_map)
print(original_tokenizer.special_tokens_map)

# check length of vocab
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"

# tokenize messages 
messages = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
google_format = original_tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)

print(f"ChatML: \n{chatml}\n-------------------\nGoogle: \n{google_format}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Spaces using philschmid/gemma-tokenizer-chatml 3