a smol course documentation

Chat Templates

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Chat Templates

Chat templates are the foundation of instruction tuning - they provide a consistent format for structuring interactions between language models, users, and external tools. Think of them as the “grammar” that teaches models how to understand conversations, distinguish between different speakers, and respond appropriately.

Base Models vs Instruct Models

First, we need to understand the difference between base and instruct models. This is crucial for effective fine-tuning.

Base Model (SmolLM3-3B-Base): Trained on raw text to predict the next token. If you give it “The weather today is”, it might continue with “sunny and warm” or any plausible continuation.

Instruct Model (SmolLM3-3B): Fine-tuned to follow instructions and engage in conversations. If you ask “What’s the weather like?”, it understands this as a question requiring a response as a new message.

The Transformation Process

The journey from base to instruct model involves:

  • Chat template: A structured format for interactions between language models, users, and external tools.
  • Supervised fine-tuning: The technique used to train the model to generate appropriate responses.

SmolLM3 uses the ChatML (Chat Markup Language) format, which has become a standard in the industry due to its clarity and flexibility.

In the next chapter, we will go in to preference alignment. This is a technique that allows you to fine-tune a model to generate responses that are preferred by a human.

Pipeline Usage: Automated Chat Processing

The easiest way to use an open source large language model is to use the pipeline abstraction in 🤗 Transformers. It handles chat templates seamlessly, making it easy to use chat models without manual template management. So much so, you won’t even need to know the chat template format.

from transformers import pipeline

# Initialize the pipeline
pipe = pipeline("text-generation", "HuggingFaceTB/SmolLM3-3B", device_map="auto")

# Define your conversation
messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate"},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]

# Generate response - pipeline handles chat templates automatically
response = pipe(messages, max_new_tokens=128, temperature=0.7)
print(response[0]['generated_text'][-1])  # Print the assistant's response

Output:

{
    'role': 'assistant', 
    'content': "Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all."
}

In this example, the pipeline automatically:

  • Applies the correct chat template for the model based on the model’s tokenizer configuration on the Hugging Face Hub repo.
  • Handles tokenization and generation automatically based on the model’s tokenizer configuration.
  • Returns structured output with role information
  • Manages generation parameters and stopping criteria

Advanced Pipeline Usage

We can take fine-grained control of the generation process by passing in a generation_config dictionary to the pipeline abstraction.

# Configure generation parameters
generation_config = {
    "max_new_tokens": 200,
    "temperature": 0.8,
    "do_sample": True,
    "top_p": 0.9,
    "repetition_penalty": 1.1
}

# Multi-turn conversation
conversation = [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "Can you help me with calculus?"},
]

# Generate first response
response = pipe(conversation, **generation_config)
conversation = response[0]['generated_text']

# Continue the conversation
conversation.append({"role": "user", "content": "What is a derivative?"})
response = pipe(conversation, **generation_config)

print("Final conversation:")
for message in response[0]['generated_text']:
    print(f"{message['role']}: {message['content']}")

Understanding SmolLM3’s Chat Template

Now that we understand basic inference with a chat model, let’s dive into the chat template format. SmolLM3 uses a common chat template that handles multiple conversation types. Let’s examine how it works:

If you want to explore chat templates hand-on, you can try out the chat template playground:

ChatML Format Structure

SmolLM3 uses the ChatML format with special tokens that clearly delineate different parts of the conversation. For example, the system message is marked with <|im_start|>system and <|im_end|>.

<|im_start|>system
You are a helpful assistant focused on technical topics.<|im_end|>
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant

Key Components:

  • <|im_start|> and <|im_end|>: Special tokens that mark the beginning and end of each message
  • Roles: system, user, assistant (and tool for function calling)
  • Content: The actual message text between the role declaration and <|im_end|>

Dual-Mode Reasoning Support

SmolLM3’s is a new category of models that can reason, or not. It enables this feature through special formatting and a parameter. If the parameter is set to think, the model will show its reasoning process. This is communicated to the model through the thinking token.

Standard Mode (no_think):

<|im_start|>user
What is 15 × 24?<|im_end|>
<|im_start|>assistant
15 × 24 = 360<|im_end|>

Thinking Mode (think):

<|im_start|>user
What is 15 × 24?<|im_end|>
<|im_start|>assistant
<|thinking|>
I need to multiply 15 by 24. Let me break this down:
15 × 24 = 15 × (20 + 4) = (15 × 20) + (15 × 4) = 300 + 60 = 360
</|thinking|>

15 × 24 = 360<|im_end|>

This dual-mode capability allows SmolLM3 to show its reasoning process when needed, making it perfect for combining complex and simple tasks.

Working with SmolLM3 Chat Templates in Code

The transformers library automatically handles chat template formatting through the tokenizer. This means you only need to structure your messages correctly, and the library takes care of the special token formatting. Here’s how to work with SmolLM3’s chat template:

from transformers import AutoTokenizer

# Load SmolLM3's tokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

# Structure your conversation as a list of message dictionaries
messages = [
    {"role": "system", "content": "You are a helpful assistant focused on technical topics."},
    {"role": "user", "content": "Can you explain what a chat template is?"},
    {"role": "assistant", "content": "A chat template structures conversations between users and AI models by providing a consistent format that helps the model understand different roles and maintain context."}
]

# Apply the chat template
formatted_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=False,  # Return string instead of tokens
    add_generation_prompt=True  # Add prompt for next assistant response
)

print(formatted_chat)

Output:

<|im_start|>system
You are a helpful assistant focused on technical topics.<|im_end|>
<|im_start|>user
Can you explain what a chat template is?<|im_end|>
<|im_start|>assistant
A chat template structures conversations between users and AI models by providing a consistent format that helps the model understand different roles and maintain context.<|im_end|>
<|im_start|>assistant

Understanding the Message Structure

Each message in the conversation follows a simple dictionary format:

  • role: Identifies who is speaking (system, user, assistant, or tool).
  • content: The actual message content.

Message Types:

  1. System Messages: Set behavior and context for the entire conversation
  2. User Messages: Questions, requests, or statements from the human user
  3. Assistant Messages: Responses from the AI model
  4. Tool Messages: Results from function calls (for advanced use cases)

System Messages: Setting the Context

System messages are crucial for controlling SmolLM3’s behavior. They act as persistent instructions that influence all subsequent interactions. To create a system message, you can use the system role and the content key:

# Professional assistant
system_message = {
    "role": "system",
    "content": "You are a professional customer service agent. Always be polite, clear, and helpful."
}

# Technical expert
system_message = {
    "role": "system",
    "content": "You are a senior software engineer. Provide detailed technical explanations with code examples when appropriate."
}

# Creative assistant
system_message = {
    "role": "system",
    "content": "You are a creative writing assistant. Help users craft engaging stories and provide constructive feedback."
}

System messages have a significant impact on the model’s behavior. They are the first message in the conversation and they set the tone for the entire conversation. They should be specific, set boundaries, provide context, and use examples.

Multi-Turn Conversations

SmolLM3 can maintain context across multiple conversation turns. Each message builds upon the previous context. For example, the following code creates a conversation with a helpful programming tutor:

conversation = [
    {"role": "system", "content": "You are a helpful programming tutor."},
    {"role": "user", "content": "I'm learning Python. Can you explain functions?"},
    {"role": "assistant", "content": "Functions in Python are reusable blocks of code that perform specific tasks. They're defined using the 'def' keyword."},
    {"role": "user", "content": "Can you show me an example?"},
    {"role": "assistant", "content": "Sure! Here's a simple function:\n\n```python\ndef greet(name):\n    return f'Hello, {name}!'\n\nresult = greet('Alice')\nprint(result)  # Output: Hello, Alice!\n```"},
    {"role": "user", "content": "How do I make it return multiple values?"},
]

Generation Prompts: Controlling Model Behavior

One of the most important concepts in chat templates is the generation prompt. This tells the model when it should start generating a response versus continuing existing text.

Understanding add_generation_prompt

The add_generation_prompt parameter controls whether the template adds tokens that indicate the start of a bot response:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

messages = [
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Nice to meet you!"},
    {"role": "user", "content": "Can I ask a question?"}
]

# Without generation prompt - for completed conversations
formatted_without = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=False
)

print("Without generation prompt:")
print(formatted_without)
print("\n" + "="*50 + "\n")

# With generation prompt - for inference
formatted_with = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)

print("With generation prompt:")
print(formatted_with)

Output:

Without generation prompt:
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>

==================================================

With generation prompt:
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant

The generation prompt ensures that when the model generates text, it will write a bot response instead of doing something unexpected like continuing the user’s message.

When to Use Generation Prompts

  • For inference: Use add_generation_prompt=True when you want the model to generate a response.
  • For training: Use add_generation_prompt=False when preparing training data with complete conversations.
  • For evaluation: Use add_generation_prompt=True to test model responses.

Continuing Final Messages: Advanced Response Control

The continue_final_message parameter allows you to make the model continue the last message in a conversation instead of starting a new one. This is particularly useful for “prefilling” responses or ensuring specific output formats.

Basic Example

# Prefill a JSON response
chat = [
    {"role": "user", "content": "Can you format the answer in JSON?"},
    {"role": "assistant", "content": '{"name": "'},
]

# Continue the final message
formatted_chat = tokenizer.apply_chat_template(
    chat, 
    tokenize=False, 
    continue_final_message=True
)

print("Continuing final message:")
print(formatted_chat)
print("\n" + "="*50 + "\n")

# Compare with starting a new message
formatted_new = tokenizer.apply_chat_template(
    chat, 
    tokenize=False,
    add_generation_prompt=True
)

print("Starting new message:")
print(formatted_new)

Output:

Continuing final message:
<|im_start|>user
Can you format the answer in JSON?<|im_end|>
<|im_start|>assistant
{"name": "

==================================================

Starting new message:
<|im_start|>user
Can you format the answer in JSON?<|im_end|>
<|im_start|>assistant
{"name": "<|im_end|>
<|im_start|>assistant

Practical Applications

1. Structured Output Generation:

# Force the model to complete a specific format
messages = [
    {"role": "system", "content": "You are a helpful assistant that always responds in JSON format."},
    {"role": "user", "content": "What's the capital of France?"},
    {"role": "assistant", "content": '{\n  "question": "What\'s the capital of France?",\n  "answer": "'}
]

# The model will continue with just the answer, maintaining JSON structure

2. Code Completion:

# Guide the model to complete code
messages = [
    {"role": "user", "content": "Write a Python function to calculate factorial"},
    {"role": "assistant", "content": "def factorial(n):\n    if n == 0:\n        return 1\n    else:\n        return n * "}
]

# Model will complete the recursive call

3. Step-by-Step Reasoning:

# Guide the model through structured thinking
messages = [
    {"role": "user", "content": "Solve: 2x + 5 = 13"},
    {"role": "assistant", "content": "Let me solve this step by step:\n\nStep 1: "}
]

# Model will continue with the first step

Important Notes

  • You cannot use add_generation_prompt=True and continue_final_message=True together
  • The final message must have the “assistant” role when using continue_final_message=True
  • This feature removes end-of-sequence tokens from the final message

Working with Reasoning Mode

SmolLM3’s dual-mode reasoning can be controlled through special formatting:

Standard vs Thinking Mode

# Standard mode - direct answer
standard_messages = [
    {"role": "user", "content": "What is 15 × 24?"},
    {"role": "assistant", "content": "15 × 24 = 360"}
]

# Thinking mode - show reasoning process
thinking_messages = [
    {"role": "user", "content": "What is 15 × 24?"},
    {"role": "assistant", "content": "<|thinking|>\nI need to multiply 15 by 24. Let me break this down:\n15 × 24 = 15 × (20 + 4) = (15 × 20) + (15 × 4) = 300 + 60 = 360\n</|thinking|>\n\n15 × 24 = 360"}
]

# Apply templates
standard_formatted = tokenizer.apply_chat_template(standard_messages, tokenize=False)
thinking_formatted = tokenizer.apply_chat_template(thinking_messages, tokenize=False)

print("Standard mode:")
print(standard_formatted)
print("\nThinking mode:")
print(thinking_formatted)

Training with Thinking Mode

When preparing datasets with thinking mode, you can control whether to include the reasoning:

def create_thinking_example(question, answer, reasoning=None):
    """Create a training example with optional thinking"""
    if reasoning:
        assistant_content = f"<|thinking|>\n{reasoning}\n</|thinking|>\n\n{answer}"
    else:
        assistant_content = answer
    
    return [
        {"role": "user", "content": question},
        {"role": "assistant", "content": assistant_content}
    ]

# Example usage
math_example = create_thinking_example(
    question="What is the derivative of x²?",
    answer="The derivative of x² is 2x",
    reasoning="Using the power rule: d/dx(x^n) = n·x^(n-1)\nFor x²: n=2, so d/dx(x²) = 2·x^(2-1) = 2x"
)

Tool Usage and Function Calling

Modern chat templates support tool usage and function calling. Here’s how to work with tools in SmolLM3:

Defining Tools

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "The temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function", 
        "function": {
            "name": "calculate",
            "description": "Perform mathematical calculations",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "Mathematical expression to evaluate"
                    }
                },
                "required": ["expression"]
            }
        }
    }
]

Chat Templates with Tools

# Conversation with tool usage
messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "What's the weather like in Paris?"},
    {
        "role": "assistant", 
        "content": "I'll check the weather in Paris for you.",
        "tool_calls": [
            {
                "id": "call_1",
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "arguments": '{"location": "Paris, France", "unit": "celsius"}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "tool_call_id": "call_1", 
        "content": '{"temperature": 22, "condition": "sunny", "humidity": 60}'
    },
    {
        "role": "assistant",
        "content": "The weather in Paris is currently sunny with a temperature of 22°C and 60% humidity. It's a beautiful day!"
    }
]

# Apply chat template with tools
formatted_with_tools = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
    add_generation_prompt=False
)

print("Chat template with tools:")
print(formatted_with_tools)

The output of the chat template with tools is:

Chat template with tools:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 01 September 2025
Reasoning Mode: /think

## Custom Instructions

You are a helpful assistant with access to tools.

### Tools

You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:

<tools>
{'type': 'function', 'function': {'name': 'get_weather', 'description': 'Get the current weather for a location', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 'unit': {'type': 'string', 'enum': ['celsius', 'fahrenheit'], 'description': 'The temperature unit'}}, 'required': ['location']}}}
{'type': 'function', 'function': {'name': 'calculate', 'description': 'Perform mathematical calculations', 'parameters': {'type': 'object', 'properties': {'expression': {'type': 'string', 'description': 'Mathematical expression to evaluate'}}, 'required': ['expression']}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
...
{"temperature": 22, "condition": "sunny", "humidity": 60}<|im_end|>
<|im_start|>assistant
The weather in Paris is currently sunny with a temperature of 22°C and 60% humidity. It's a beautiful day!<|im_end|>

Training with Tool Usage

def format_tool_dataset(examples):
    """Format dataset with tool usage for training"""
    formatted_texts = []
    
    for messages, tools in zip(examples["messages"], examples.get("tools", [None] * len(examples["messages"]))):
        formatted_text = tokenizer.apply_chat_template(
            messages,
            tools=tools,
            tokenize=False,
            add_generation_prompt=False
        )
        formatted_texts.append(formatted_text)
    
    return {"text": formatted_texts}

Advanced Template Customization

For advanced use cases, you might need to customize or understand chat templates more deeply:

Inspecting a Model’s Chat Template

# View the actual template
print("SmolLM3 Chat Template:")
print(tokenizer.chat_template)

# See what special tokens are used
print("\nSpecial tokens:")
print(f"BOS: {tokenizer.bos_token}")
print(f"EOS: {tokenizer.eos_token}")
print(f"UNK: {tokenizer.unk_token}")
print(f"PAD: {tokenizer.pad_token}")

# Check for custom tokens
special_tokens = tokenizer.special_tokens_map
for name, token in special_tokens.items():
    print(f"{name}: {token}")

Custom Template Creation

# Create a custom template (advanced users only)
custom_template = """
{%- for message in messages %}
    {%- if message['role'] == 'system' %}
        {%- set system_message = message['content'] %}
    {%- endif %}
{%- endfor %}
{%- if system_message is defined %}
<|system|>{{ system_message }}<|end|>
{%- endif %}
{%- for message in messages %}
    {%- if message['role'] == 'user' %}
<|user|>{{ message['content'] }}<|end|>
    {%- elif message['role'] == 'assistant' %}
<|assistant|>{{ message['content'] }}<|end|>
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
<|assistant|>
{%- endif %}
"""

# Apply custom template (be very careful with this!)
# tokenizer.chat_template = custom_template

Template Debugging

def debug_chat_template(messages, tokenizer):
    """Debug chat template application"""
    
    # Apply template
    formatted = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    # Tokenize and decode to see actual tokens
    tokens = tokenizer(formatted, return_tensors="pt")
    
    print("=== TEMPLATE DEBUG ===")
    print(f"Input messages: {len(messages)}")
    print(f"Formatted length: {len(formatted)} chars")
    print(f"Token count: {tokens['input_ids'].shape[1]}")
    print("\nFormatted text:")
    print(repr(formatted))  # Shows escape characters
    print("\nTokens:")
    print(tokens['input_ids'][0].tolist()[:20], "...")  # First 20 tokens
    print("\nDecoded tokens:")
    for i, token_id in enumerate(tokens['input_ids'][0][:20]):
        token = tokenizer.decode([token_id])
        print(f"{i:2d}: {token_id:5d} -> {repr(token)}")

# Example usage
debug_messages = [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"}
]

debug_chat_template(debug_messages, tokenizer)

Key Takeaways

Understanding chat templates is crucial for effective instruction tuning. Here are the essential points to remember:

Core Concepts

  1. Template Consistency: Always use the same template format for training and inference - mismatches can significantly hurt performance
  2. Generation Prompts: Use add_generation_prompt=True for inference, False for training data preparation
  3. Role Structure: Clear role definitions (system, user, assistant, tool) help models understand conversation flow
  4. Context Management: Leverage SmolLM3’s extended context window efficiently by managing conversation history
  5. Special Token Handling: Let templates handle special tokens - avoid adding them manually

Advanced Features

  1. Dual-Mode Reasoning: Use <|thinking|> tags for complex problems requiring step-by-step reasoning
  2. Message Continuation: Use continue_final_message=True for structured output and prefilling responses
  3. Tool Integration: Modern templates support function calling and tool usage for enhanced capabilities
  4. Pipeline Automation: Text generation pipelines handle templates automatically for production use
  5. Multi-Dataset Training: Standardize different dataset formats before combining for training

Training Best Practices

  1. Dataset Preparation: Apply templates with add_generation_prompt=False and add_special_tokens=False for training
  2. Quality Control: Debug templates thoroughly to ensure proper formatting
  3. Performance Monitoring: Incorrect template usage can significantly impact model performance
  4. Multimodal Support: Templates extend to vision and audio models with appropriate modifications

Common Pitfalls to Avoid

  • Template mismatch: Using a different template than the model was trained on.
  • Double special tokens: Adding special tokens when the template already includes them.
  • Missing system messages: Not providing enough context for consistent model behavior.
  • Inconsistent formatting: Mixing different conversation formats in the same dataset.
  • Wrong generation prompts: Using incorrect add_generation_prompt settings for your use case.
  • Ignoring tool syntax: Not properly formatting tool calls and responses.
  • Context overflow: Not managing long conversations within token limits.

Production Considerations

  • Pipeline usage: Use automated pipelines for consistent template application in production.
  • Error handling: Implement validation for message formats and role sequences.
  • Performance optimization: Cache formatted templates when possible for repeated use.
  • Monitoring: Track template application success rates and formatting consistency.
  • Version control: Maintain template versions alongside model versions for reproducibility.

Beyond Basic Templates: Advanced Topics

This guide covered the fundamentals, but chat templates support many advanced features:

  • Multimodal templates: Handling images, audio, and video in conversations.
  • Document integration: Including external documents and knowledge bases.
  • Custom template creation: Building specialized templates for domain-specific applications.
  • Template optimization: Performance tuning for high-throughput applications.

For these advanced topics, refer to the specialized documentation linked below.

Next Steps

Now that you have a comprehensive understanding of chat templates, you’re ready to learn about supervised fine-tuning, where we’ll use these templates to train SmolLM3 on custom datasets.

Next: Supervised Fine-Tuning

Comprehensive Resources and Further Reading

Official Documentation

Model and Dataset Resources

Technical References

Community Resources

< > Update on GitHub