a smol course documentation

Hands-On Exercises: Fine-Tuning SmolLM3

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Hands-On Exercises: Fine-Tuning SmolLM3

Ask a Question Open In Colab

Welcome to the practical section! Here you’ll apply everything you’ve learned about chat templates and supervised fine-tuning using SmolLM3. These exercises progress from basic concepts to advanced techniques, giving you real-world experience with instruction tuning.

Learning Objectives

By completing these exercises, you will:

  • Master SmolLM3’s chat template system
  • Fine-tune SmolLM3 on real datasets using both Python APIs and CLI tools
  • Work with the SmolTalk2 dataset that was used to train the original model
  • Compare base model vs fine-tuned model performance
  • Deploy your models to Hugging Face Hub
  • Understand production workflows for scaling fine-tuning

Exercise 1: Exploring SmolLM3’s Chat Templates

Objective: Understand how SmolLM3 handles different conversation formats and reasoning modes.

SmolLM3 is a hybrid reasoning model which can follow instructions or generated tokens that ‘reason’ on a complex problem. When post-trained effectively, the model will reason on hard problems and generate direct responses on easy problems.

Environment Setup

  • You need a GPU with at least 8GB VRAM for training. CPU/MPS can run formatting and dataset exploration, but training larger models will likely fail.
  • First run will download several GB of model weights; ensure 15GB+ free disk and a stable connection.
  • If you need access to private repos, authenticate with Hugging Face Hub via login().

Let’s start by setting up our environment.

# Install required packages (run in Colab or your environment)
pip install "transformers>=4.36.0" "trl>=0.7.0" "datasets>=2.14.0" "torch>=2.0.0"
pip install "accelerate>=0.24.0" "peft>=0.7.0" "trackio"

Then, let’s import the necessary libraries and set up the accelerator device. below we validate whether we’re using a Nvidia GPU, an Apple Metal accelerator or the CPU. In reality, we can’t train models on the CPU, so we’ll use an accelerator.

# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datasets import load_dataset
import json
from typing import Optional, Dict, Any

if torch.cuda.is_available():
    device = "cuda"
    print(f"Using CUDA GPU: {torch.cuda.get_device_name()}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = "mps"
    print("Using Apple MPS")
else:
    device = "cpu"
    print("Using CPU - you will need to use a GPU to train models")

# Authenticate with Hugging Face (optional, for private models)
from huggingface_hub import login
# login()  # Uncomment if you need to access private models

Take a note of the device you’re using and your available GPU memory. If this is below 8GB, you will not be able to do some exercises.

Output
Using CUDA GPU: NVIDIA A100-SXM4-40GB
GPU memory: 42.5GB

Load SmolLM3 Models

Now let’s load the base and instruct models for comparison.

# Load both base and instruct models for comparison
base_model_name = "HuggingFaceTB/SmolLM3-3B-Base"
instruct_model_name = "HuggingFaceTB/SmolLM3-3B"

# Load tokenizers
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
instruct_tokenizer = AutoTokenizer.from_pretrained(instruct_model_name)

# Load models (use smaller precision for memory efficiency)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    dtype=torch.bfloat16,
    device_map="auto"
)

instruct_model = AutoModelForCausalLM.from_pretrained(
    instruct_model_name,
    dtype=torch.bfloat16,
    device_map="auto"
)

print("Models loaded successfully!")

This will download the models and tokenizers to your local machine from the Hugging Face Hub. This includes the model’s parameter weights, tokenizer, and other model configuration defined by the model authors.

Output

You should see green bars loading the model weights. This may take a few minutes.

tokenizer_config.json: 
 50.4k/? [00:00<00:00, 5.09MB/s]
tokenizer.json: 100%
 17.2M/17.2M [00:02<00:00, 10.7MB/s]
special_tokens_map.json: 100%
 151/151 [00:00<00:00, 21.5kB/s]
tokenizer_config.json: 
 50.4k/? [00:00<00:00, 5.45MB/s]
tokenizer.json: 100%
 17.2M/17.2M [00:00<00:00, 472kB/s]
special_tokens_map.json: 100%
 289/289 [00:00<00:00, 35.0kB/s]
chat_template.jinja: 
 5.60k/? [00:00<00:00, 577kB/s]
config.json: 100%
 943/943 [00:00<00:00, 121kB/s]
model.safetensors.index.json: 
 26.9k/? [00:00<00:00, 2.81MB/s]
Fetching 2 files: 100%
 2/2 [00:32<00:00, 32.11s/it]
model-00001-of-00002.safetensors: 100%
 4.97G/4.97G [00:31<00:00, 247MB/s]
model-00002-of-00002.safetensors: 100%
 1.18G/1.18G [00:17<00:00, 57.2MB/s]
Loading checkpoint shards: 100%
 2/2 [00:01<00:00,  1.18it/s]
generation_config.json: 100%
 126/126 [00:00<00:00, 17.1kB/s]
config.json: 
 1.92k/? [00:00<00:00, 229kB/s]
model.safetensors.index.json: 
 26.9k/? [00:00<00:00, 3.14MB/s]
Fetching 2 files: 100%
 2/2 [00:32<00:00, 32.38s/it]
model-00002-of-00002.safetensors: 100%
 1.18G/1.18G [00:17<00:00, 92.1MB/s]
model-00001-of-00002.safetensors: 100%
 4.97G/4.97G [00:31<00:00, 182MB/s]
Loading checkpoint shards: 100%
 2/2 [00:01<00:00,  1.14it/s]
generation_config.json: 100%
 182/182 [00:00<00:00, 21.0kB/s]
Models loaded successfully!

Explore Chat Template Formatting

Now let’s explore the chat template formatting. We will create different types of conversations to test.

# Create different types of conversations to test
conversations = {
    "simple_qa": [
        {"role": "user", "content": "What is machine learning?"},
    ],
    
    "with_system": [
        {"role": "system", "content": "You are a helpful AI assistant specialized in explaining technical concepts clearly."},
        {"role": "user", "content": "What is machine learning?"},
    ],
    
    "multi_turn": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is calculus?"},
        {"role": "assistant", "content": "Calculus is a branch of mathematics that deals with rates of change and accumulation of quantities."},
        {"role": "user", "content": "Can you give me a simple example?"},
    ],
    
    "reasoning_task": [
        {"role": "user", "content": "Solve step by step: If a train travels 120 miles in 2 hours, what is its average speed?"},
    ]
}

for conv_type, messages in conversations.items():
    print(f"--- {conv_type.upper()} ---")
    
    # Format without generation prompt (for completed conversations)
    formatted_complete = instruct_tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=False
    )
    
    # Format with generation prompt (for inference)
    formatted_prompt = instruct_tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )
    
    print("Complete conversation format:")
    print(formatted_complete)
    print("\nWith generation prompt:")
    print(formatted_prompt)
    print("\n" + "="*50 + "\n")
Output
--- SIMPLE_QA ---
Complete conversation format:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face.

<|im_start|>user
What is machine learning?<|im_end|>


With generation prompt:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face.

<|im_start|>user
What is machine learning?<|im_end|>
<|im_start|>assistant
<think>

</think>


==================================================

--- WITH_SYSTEM ---
Complete conversation format:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant specialized in explaining technical concepts clearly.

<|im_start|>user
What is machine learning?<|im_end|>


With generation prompt:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant specialized in explaining technical concepts clearly.

<|im_start|>user
What is machine learning?<|im_end|>
<|im_start|>assistant
<think>

</think>


==================================================

--- MULTI_TURN ---
Complete conversation format:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a math tutor.

<|im_start|>user
What is calculus?<|im_end|>
<|im_start|>assistant
<think>

</think>
Calculus is a branch of mathematics that deals with rates of change and accumulation of quantities.<|im_end|>
<|im_start|>user
Can you give me a simple example?<|im_end|>


With generation prompt:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a math tutor.

<|im_start|>user
What is calculus?<|im_end|>
<|im_start|>assistant
<think>

</think>
Calculus is a branch of mathematics that deals with rates of change and accumulation of quantities.<|im_end|>
<|im_start|>user
Can you give me a simple example?<|im_end|>
<|im_start|>assistant
<think>

</think>


==================================================

--- REASONING_TASK ---
Complete conversation format:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.

<|im_start|>user
Solve step by step: If a train travels 120 miles in 2 hours, what is its average speed?<|im_end|>


With generation prompt:
<|im_start|>system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.

<|im_start|>user
Solve step by step: If a train travels 120 miles in 2 hours, what is its average speed?<|im_end|>
<|im_start|>assistant


==================================================

Compare Base vs Instruct Model Responses

In this section, we run the same prompt through the base and instruct variants to observe formatting differences and how the chat template impacts generation quality and style.

# Test the same prompt on both models
test_prompt = "Explain quantum computing in simple terms."

# Prepare the prompt for base model (no chat template)
base_inputs = base_tokenizer(test_prompt, return_tensors="pt").to(device)

# Prepare the prompt for instruct model (with chat template)
instruct_messages = [{"role": "user", "content": test_prompt}]
instruct_formatted = instruct_tokenizer.apply_chat_template(
    instruct_messages, 
    tokenize=False, 
    add_generation_prompt=True
)
instruct_inputs = instruct_tokenizer(instruct_formatted, return_tensors="pt").to(device)

# Generate responses
print("=== Model comparison ===\n")

print("🤖 BASE MODEL RESPONSE:")
with torch.no_grad():
    base_outputs = base_model.generate(
        **base_inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=base_tokenizer.eos_token_id
    )
    base_response = base_tokenizer.decode(base_outputs[0], skip_special_tokens=True)
    print(base_response[len(test_prompt):])  # Show only the generated part

print("\n" + "="*50)
print("Instruct model response:")
with torch.no_grad():
    instruct_outputs = instruct_model.generate(
        **instruct_inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=instruct_tokenizer.eos_token_id
    )
    instruct_response = instruct_tokenizer.decode(instruct_outputs[0], skip_special_tokens=True)
    # Extract only the assistant's response
    assistant_start = instruct_response.find("<|im_start|>assistant\n") + len("<|im_start|>assistant\n")
    assistant_response = instruct_response[assistant_start:].split("<|im_end|>")[0]
    print(assistant_response)

If we dive into the out put below, we can see the differences between the base model and the instruct model. In short, the base model continues the string, while the instruct model uses the chat template. For example, the base model starts with " What are the differences between the classical bit and the quantum bit?", while the instruct model starts by answering the question: "Quantum computing is a type of computing that uses quantum bits".

Output
=== Model comparison ===

🤖 BASE MODEL RESPONSE:
 Why is it thought to be superior to our current technology? How is it superior? What is it's limit?
Quantum computing is based on the fact that in quantum mechanics, a particle can be in multiple states at the same time. This is called superposition. But a single particle can not be in multiple locations at the same time. That is called entanglement. So, how can you have a particle in multiple locations at the same time? Quantum mechanics says that if you measure the location of a particle, it will randomly jump to a particular location. So, if you have 1000 particles, you can have each particle in 1000 different locations at the same time.
This is very useful for solving problems. For example

==================================================
Instruct model response:
nowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face.

user
Explain quantum computing in simple terms.
assistant
<think>

</think>
Quantum computing is a type of computing that uses quantum bits, or qubits, to perform calculations. In traditional computers, we use bits, which can be either 0 or 1. But in quantum computing, we use qubits that can exist in multiple states at once, like both 0 and 1 simultaneously.

Think of it like flipping a coin. A regular coin can land on either heads or tails, but a quantum coin can land on both heads and tails at the same time. This property is called superposition.

Another unique aspect of quantum computing is entanglement. Imagine two coins that are linked together. If one coin lands on heads, the other coin will always land on tails, no matter how far apart they are

Test Dual-Mode Reasoning

Here we probe SmolLM3’s reasoning mode with math and proportionality problems, keeping temperature low for consistency and extracting only the assistant’s response from the chat-formatted output.

# Test SmolLM3's reasoning capabilities
reasoning_prompts = [
    "What is 15 × 24? Show your work.",
    "A recipe calls for 2 cups of flour for 12 cookies. How much flour is needed for 30 cookies?",
    "If I have $50 and spend $18.75 on lunch and $12.30 on a book, how much money do I have left?"
]

print("=== TESTING REASONING CAPABILITIES ===\n")

for i, prompt in enumerate(reasoning_prompts, 1):
    print(f"Problem {i}: {prompt}")
    
    messages = [{"role": "user", "content": prompt}]
    formatted_prompt = instruct_tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = instruct_tokenizer(formatted_prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = instruct_model.generate(
            **inputs,
            max_new_tokens=200,
            temperature=0.3,  # Lower temperature for more consistent reasoning
            do_sample=True,
            pad_token_id=instruct_tokenizer.eos_token_id
        )
        response = instruct_tokenizer.decode(outputs[0], skip_special_tokens=True)
        assistant_start = response.find("<|im_start|>assistant\n") + len("<|im_start|>assistant\n")
        assistant_response = response[assistant_start:].split("<|im_end|>")[0]
        print(f"Answer: {assistant_response}")
    
    print("\n" + "-"*50 + "\n")

If we dive into the out put below, we can see that the instruct model’s hybrid reasoning being applied with the /no_think mode. When the mode is activated, the model will enclose thinking process in <think> tags. It uses these tokens to explore possible solutions and answer the question. After the thinking process, the model will provide the final answer, which we can extract with the chat template, or string manipulation here.

Output

=== TESTING REASONING CAPABILITIES ===

Thinking prompt: /no_think
Problem 1: What is 15 × 24? Show your work.
Answer: nowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face.

user
What is 15 × 24? Show your work.
assistant
<think>

</think>
To find the product of 15 and 24, we can use the standard multiplication algorithm. Here's how we can do it step by step:

15

× 24


First, we multiply 15 by 4 (the units digit of 24):

15

× 24

60 (15 × 4)


Next, we multiply 15 by 20 (the tens digit of 24, shifted one place to the left):

15

× 24

60 (15 × 4)

300 (15 × 20)


Now, we add the two partial products:

15

× 24

60 (15 × 4)

300 (15 × 20)

360


So, 15 × 24 = 360.

--------------------------------------------------

Problem 2: A recipe calls for 2 cups of flour for 12 cookies. How much flour is needed for 30 cookies?
Answer: nowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face.

user
A recipe calls for 2 cups of flour for 12 cookies. How much flour is needed for 30 cookies?
assistant
<think>

</think>
To solve this problem, we need to determine the amount of flour needed per cookie and then multiply that by the number of cookies we want to make.

First, let's find out how much flour is needed per cookie. The recipe calls for 2 cups of flour for 12 cookies. To find the amount of flour per cookie, we divide the total amount of flour by the number of cookies:

2 cups / 12 cookies = 1/6 cup per cookie

Now that we know how much flour is needed per cookie, we can multiply that by the number of cookies we want to make (30):

1/6 cup per cookie * 30 cookies = 5 cups

So, to make 30 cookies, you would need 5 cups of flour.

--------------------------------------------------

Problem 3: If I have $50 and spend $18.75 on lunch and $12.30 on a book, how much money do I have left?
Answer: nowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /no_think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face.

user
If I have $50 and spend $18.75 on lunch and $12.30 on a book, how much money do I have left?
assistant
<think>

</think>
To find out how much money you have left, you need to subtract the total amount spent from your initial amount.

First, calculate the total amount spent on lunch and the book:
$18.75 (lunch) + $12.30 (book) = $31.05

Now, subtract the total amount spent from your initial amount:
$50 (initial amount) - $31.05 (total spent) = $18.95

So, you have $18.95 left.

--------------------------------------------------

Thinking prompt: /think
Problem 1: What is 15 × 24? Show your work.
Answer: nowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.

user
What is 15 × 24? Show your work.
assistant
<think>
Okay, let's see. I need to calculate 15 multiplied by 24. Hmm, how do I do that? I remember there are a few methods. Maybe the standard multiplication algorithm? Or maybe breaking it down into smaller parts. Let me try both ways to make sure I get the right answer.

First, the standard way. Let me write it out like I'm doing long multiplication. So, 15 times 24. I can think of 24 as 20 + 4. So maybe I can break it down into 15 times 20 plus 15 times 4. That might be easier.

Starting with 15 times 20. Well, 15 times 2 is 30, so adding a zero at the end makes it 300. So 15 × 20 = 300. Got that part.

Now, 15 times 4. Let me calculate that. 15 times 4 is 60. Right? Because 10

--------------------------------------------------

Problem 2: A recipe calls for 2 cups of flour for 12 cookies. How much flour is needed for 30 cookies?
Answer: nowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.

user
A recipe calls for 2 cups of flour for 12 cookies. How much flour is needed for 30 cookies?
assistant
<think>
Okay, so I need to figure out how much flour is needed for 30 cookies if the recipe calls for 2 cups of flour for 12 cookies. Hmm, let's see. I think this is a proportion problem. If 12 cookies require 2 cups, then I need to find out how much flour is needed for 30 cookies. 

First, maybe I should determine how much flour is needed per cookie. If 12 cookies take 2 cups, then per cookie it would be 2 divided by 12. Let me write that down: 2 cups / 12 cookies. That simplifies to 1/6 cup per cookie. Wait, 2 divided by 12 is 1/6? Let me check that again. 2 divided by 12 is indeed 1/6. Yeah, because 12 divided by 6 is 2, so 2 divided by 12 is 1/6. So each cookie needs 

--------------------------------------------------

Problem 3: If I have $50 and spend $18.75 on lunch and $12.30 on a book, how much money do I have left?
Answer: nowledge Cutoff Date: June 2025
Today Date: 03 September 2025
Reasoning Mode: /think

## Custom Instructions

You are a helpful AI assistant named SmolLM, trained by Hugging Face. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.

user
If I have $50 and spend $18.75 on lunch and $12.30 on a book, how much money do I have left?
assistant
<think>
Okay, let's see. The problem is about calculating how much money is left after spending on lunch and a book. I need to start with the initial amount of $50 and then subtract the amounts spent on lunch and the book.

First, I should add up the total amount spent. The lunch cost $18.75 and the book cost $12.30. To add these two amounts together, I need to make sure they are in the same units, which they already are since both are in dollars. So, adding $18.75 and $12.30. Let me do that step by step.

Starting with the dollars: 18 dollars plus 12 dollars is 30 dollars. Then the cents: 75 cents plus 30 cents is 105 cents. Now, 105 cents is equal to $1.05 because 100 cents make a dollar, so 105 cents is 1 dollar and 5 cents. Therefore, adding the dollars and cents together

--------------------------------------------------

Validation

Run the code above and verify that you can see:

  1. Different chat template formats for various conversation types
  2. Clear differences between base model and instruct model responses
  3. SmolLM3’s reasoning capabilities in action

Exercise 2: Dataset Processing for SFT

Objective: Learn to process and prepare datasets for supervised fine-tuning using SmolTalk2 and other datasets.

Explore the SmolTalk2 Dataset

We load the SmolTalk2 SFT split, inspect its structure and a few samples to understand fields (e.g., messages) and available subsets before preparing data for training.

# Load and explore the SmolTalk2 dataset
print("=== EXPLORING SMOLTALK2 DATASET ===\n")

# Load the SFT subset
dataset_dict = load_dataset("HuggingFaceTB/smoltalk2", "SFT")
print(f"Total splits: {len(dataset_dict)}")
print(f"Available splits: {list(dataset_dict.keys())}")
print(f"Number of total rows: {sum([dataset_dict[d].num_rows for d in dataset_dict])}")
print(f"Dataset structure: {dataset_dict}")

If we dive into the out put below, we can see the structure of the dataset. It has 25 splits, and the total number of rows is 3,383,242.

Output
=== EXPLORING SMOLTALK2 DATASET ===

Resolving data files: 100%
 124/124 [00:00<00:00, 9963.48it/s]
Resolving data files: 100%
 113/113 [00:00<00:00, 57.54it/s]
Resolving data files: 100%
 113/113 [00:00<00:00, 114.07it/s]
Loading dataset shards: 100%
 105/105 [00:00<00:00, 2570.62it/s]
Total splits: 25
Available splits: ['LongAlign_64k_Qwen3_32B_yarn_131k_think', 'OpenThoughts3_1.2M_think', 'aya_dataset_Qwen3_32B_think', 'multi_turn_reasoning_if_think', 's1k_1.1_think', 'smolagents_toolcalling_traces_think', 'smoltalk_everyday_convs_reasoning_Qwen3_32B_think', 'smoltalk_multilingual8_Qwen3_32B_think', 'smoltalk_systemchats_Qwen3_32B_think', 'table_gpt_Qwen3_32B_think', 'LongAlign_64k_context_lang_annotated_lang_6_no_think', 'Mixture_of_Thoughts_science_no_think', 'OpenHermes_2.5_no_think', 'OpenThoughts3_1.2M_no_think_no_think', 'hermes_function_calling_v1_no_think', 'smoltalk_multilingual_8languages_lang_5_no_think', 'smoltalk_smollm3_everyday_conversations_no_think', 'smoltalk_smollm3_explore_instruct_rewriting_no_think', 'smoltalk_smollm3_smol_magpie_ultra_no_think', 'smoltalk_smollm3_smol_rewrite_no_think', 'smoltalk_smollm3_smol_summarize_no_think', 'smoltalk_smollm3_systemchats_30k_no_think', 'table_gpt_no_think', 'tulu_3_sft_personas_instruction_following_no_think', 'xlam_traces_no_think']
Number of total rows: 3383242
Dataset structure: DatasetDict({
    LongAlign_64k_Qwen3_32B_yarn_131k_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 7526
    })
    OpenThoughts3_1.2M_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 1133524
    })
    aya_dataset_Qwen3_32B_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 15222
    })
    multi_turn_reasoning_if_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 28217
    })
    s1k_1.1_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 835
    })
    smolagents_toolcalling_traces_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 9079
    })
    smoltalk_everyday_convs_reasoning_Qwen3_32B_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 2057
    })
    smoltalk_multilingual8_Qwen3_32B_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 244736
    })
    smoltalk_systemchats_Qwen3_32B_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 27436
    })
    table_gpt_Qwen3_32B_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 13201
    })
    LongAlign_64k_context_lang_annotated_lang_6_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 6249
    })
    Mixture_of_Thoughts_science_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 86110
    })
    OpenHermes_2.5_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 384900
    })
    OpenThoughts3_1.2M_no_think_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 435193
    })
    hermes_function_calling_v1_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 8961
    })
    smoltalk_multilingual_8languages_lang_5_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 254047
    })
    smoltalk_smollm3_everyday_conversations_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 2260
    })
    smoltalk_smollm3_explore_instruct_rewriting_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 30391
    })
    smoltalk_smollm3_smol_magpie_ultra_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 406843
    })
    smoltalk_smollm3_smol_rewrite_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 53262
    })
    smoltalk_smollm3_smol_summarize_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 96061
    })
    smoltalk_smollm3_systemchats_30k_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 33997
    })
    table_gpt_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 13203
    })
    tulu_3_sft_personas_instruction_following_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 29970
    })
    xlam_traces_no_think: Dataset({
        features: ['messages', 'chat_template_kwargs', 'source'],
        num_rows: 59962
    })
})

Process Different Dataset Types

The SmolTalk2 dataset is a collection of open source datasets compiled together for convenience. It contains a mixture of useful post training use cases, like tool use, long context, and more. They are all in chat format, which is easy to use for training. However, not all datasets are shared in consistent format so often we need to process them into a unified chat messages layout.

For this exercise, we will standardize multiple dataset formats into a unified chat messages layout. We define lightweight processors for QA and instruction datasets and walk through a concrete example using GSM8K.

# Function to process different dataset formats
def process_qa_dataset(examples, question_col, answer_col):
    """Process Q&A datasets into chat format"""
    processed = []
    
    for question, answer in zip(examples[question_col], examples[answer_col]):
        messages = [
            {"role": "user", "content": question},
            {"role": "assistant", "content": answer}
        ]
        processed.append(messages)
    
    return {"messages": processed}

def process_instruction_dataset(examples):
    """Process instruction-following datasets"""
    processed = []
    
    for instruction, response in zip(examples["instruction"], examples["response"]):
        messages = [
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": response}
        ]
        processed.append(messages)
    
    return {"messages": processed}

# Example: Process GSM8K math dataset
print("=== PROCESSING GSM8K DATASET ===\n")

gsm8k = load_dataset("openai/gsm8k", "main", split="train[:100]")  # Small subset for demo
print(f"Original GSM8K example: {gsm8k[0]}")

# Convert to chat format
def process_gsm8k(examples):
    processed = []
    for question, answer in zip(examples["question"], examples["answer"]):
        messages = [
            {"role": "system", "content": "You are a math tutor. Solve problems step by step."},
            {"role": "user", "content": question},
            {"role": "assistant", "content": answer}
        ]
        processed.append(messages)
    return {"messages": processed}

gsm8k_processed = gsm8k.map(process_gsm8k, batched=True, remove_columns=gsm8k.column_names)
print(f"Processed example: {gsm8k_processed[0]}")

Below we find two samples from the two separate datasets in the same format.

Output
=== PROCESSING GSM8K DATASET ===

README.md: 
 7.94k/? [00:00<00:00, 572kB/s]
main/train-00000-of-00001.parquet: 100%
 2.31M/2.31M [00:01<00:00, 42.6kB/s]
main/test-00000-of-00001.parquet: 100%
 419k/419k [00:00<00:00, 813kB/s]
Generating train split: 100%
 7473/7473 [00:00<00:00, 321312.49 examples/s]
Generating test split: 100%
 1319/1319 [00:00<00:00, 97120.71 examples/s]
Original GSM8K example: {'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}
Map: 100%
 100/100 [00:00<00:00, 4792.50 examples/s]
Processed example: {'messages': [{'content': 'You are a math tutor. Solve problems step by step.', 'role': 'system'}, {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'role': 'user'}, {'content': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72', 'role': 'assistant'}]}

Apply Chat Templates to Datasets

Once messages are normalized, we apply the model’s chat template to convert each example into plain training text (text column) suitable for language modeling with SFT.

# Function to apply chat templates to processed datasets
def apply_chat_template_to_dataset(dataset, tokenizer):
    """Apply chat template to dataset for training"""
    
    def format_messages(examples):
        formatted_texts = []
        
        for messages in examples["messages"]:
            # Apply chat template
            formatted_text = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=False  # We want the complete conversation
            )
            formatted_texts.append(formatted_text)
        
        return {"text": formatted_texts}
    
    return dataset.map(format_messages, batched=True)

# Apply to our processed GSM8K dataset
gsm8k_formatted = apply_chat_template_to_dataset(gsm8k_processed, instruct_tokenizer)
print("=== FORMATTED TRAINING DATA ===")
print(gsm8k_formatted[0]["text"])
Output
=== PROCESSING GSM8K DATASET ===

README.md: 
 7.94k/? [00:00<00:00, 572kB/s]
main/train-00000-of-00001.parquet: 100%
 2.31M/2.31M [00:01<00:00, 42.6kB/s]
main/test-00000-of-00001.parquet: 100%
 419k/419k [00:00<00:00, 813kB/s]
Generating train split: 100%
 7473/7473 [00:00<00:00, 321312.49 examples/s]
Generating test split: 100%
 1319/1319 [00:00<00:00, 97120.71 examples/s]
Original GSM8K example: {'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}
Map: 100%
 100/100 [00:00<00:00, 4792.50 examples/s]
Processed example: {'messages': [{'content': 'You are a math tutor. Solve problems step by step.', 'role': 'system'}, {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'role': 'user'}, {'content': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72', 'role': 'assistant'}]}

Exercise 3: Fine-Tuning SmolLM3 with SFTTrainer

Objective: Perform supervised fine-tuning on SmolLM3 using TRL’s SFTTrainer with real datasets.

You will need a GPU with at least 8GB VRAM.

Step 1: Setup and Model Loading

We load the base model and tokenizer, set padding behavior, and move the model to an appropriate device to prepare for fine-tuning.

# Import required libraries for fine-tuning
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer, SFTConfig, DataCollatorForCompletionOnlyLM
from datasets import load_dataset
import torch
import wandb  # Optional: for experiment tracking

# Initialize Weights & Biases (optional)
# wandb.init(project="smollm3-finetuning")

# Load SmolLM3 base model for fine-tuning
model_name = "HuggingFaceTB/SmolLM3-3B-Base"
new_model_name = "SmolLM3-Custom-SFT"

print(f"Loading {model_name}...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token
tokenizer.padding_side = "right"  # Padding on the right for generation

print(f"Model loaded! Parameters: {model.num_parameters():,}")
Loading HuggingFaceTB/SmolLM3-3B-Base...
Model loaded! Parameters: 3,075,098,624

Dataset Preparation

Here we select a manageable subset for speed, then map each example to a single text string by applying the chat template—this is the field the trainer will read.

# Load and prepare training dataset
print("=== PREPARING DATASET ===\n")

# Option 1: Use SmolTalk2 (recommended for beginners)
dataset = load_dataset("HuggingFaceTB/smoltalk2", "SFT")
train_dataset = dataset["smoltalk_everyday_convs_reasoning_Qwen3_32B_think"].select(range(1000))  # Use subset for faster training

# Option 2: Use your own processed dataset from Exercise 2
# train_dataset = gsm8k_formatted.select(range(500))

print(f"Training examples: {len(train_dataset)}")
print(f"Example: {train_dataset[0]}")

# Prepare the dataset for SFT
def format_chat_template(example):
    """Format the messages using the chat template"""
    if "messages" in example:
        # SmolTalk2 format
        messages = example["messages"]
    else:
        # Custom format - adapt as needed
        messages = [
            {"role": "user", "content": example["instruction"]},
            {"role": "assistant", "content": example["response"]}
        ]
    
    # Apply chat template
    text = instruct_tokenizer.apply_chat_template(
        messages, 
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": text}

# Apply formatting
formatted_dataset = train_dataset.map(format_chat_template)
formatted_dataset = formatted_dataset.remove_columns(
    [col for col in formatted_dataset.column_names if col != "text"]
)
print(f"Formatted example: {formatted_dataset[0]['text'][:200]}...")
Output
=== PREPARING DATASET ===
Training examples: 1000
Example: {'messages': [{'content': 'Solve the problem step by step.', 'role': 'system'}, {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'role': 'user'}]}
Formatted example: You are a math tutor. Solve problems step by step. Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?...

Training Configuration

We configure key knobs for SFT (batch size, sequence length, learning rate, logging/saving cadence) and enable optional tracking and Hub integration.

# Configure training parameters
training_config = SFTConfig(
    # Model and data
    output_dir=f"./{new_model_name}",
    dataset_text_field="text",
    max_length=2048,
    
    # Training hyperparameters
    per_device_train_batch_size=2,  # Adjust based on your GPU memory
    gradient_accumulation_steps=2,
    learning_rate=5e-5,
    num_train_epochs=1,  # Start with 1 epoch
    max_steps=500,  # Limit steps for demo
    
    # Optimization
    warmup_steps=50,
    weight_decay=0.01,
    optim="adamw_torch",
    
    # Logging and saving
    logging_steps=10,
    save_steps=100,
    eval_steps=100,
    save_total_limit=2,
    
    # Memory optimization
    dataloader_num_workers=0,
    group_by_length=True,  # Group similar length sequences
    
    # Hugging Face Hub integration
    push_to_hub=False,  # Set to True to upload to Hub
    hub_model_id=f"your-username/{new_model_name}",
    
    # Experiment tracking
    report_to=["trackio"],  # Use trackio for experiment tracking
    run_name=f"{new_model_name}-training",
)

print("Training configuration set!")
print(f"Effective batch size: {training_config.per_device_train_batch_size * training_config.gradient_accumulation_steps}")
Training configuration set!
Effective batch size: 4

Optional: Train with LoRA/PEFT (memory-efficient)

If you have limited GPU memory or want faster iterations, use LoRA via PEFT. This trains only small adapter weights while keeping the base model frozen, then you can either keep using adapters or merge them later for deployment.

# LoRA configuration with PEFT
from peft import LoraConfig

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Create SFTTrainer with LoRA enabled
from trl import SFTTrainer

lora_trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset,  # dataset with a "text" field or messages + dataset_text_field in config
    args=training_config,
    peft_config=peft_config,  # << enable LoRA
)

print("Starting LoRA training…")
lora_trainer.train()

Step 4: Initialize SFTTrainer and Train

We instantiate the trainer, capture a pre-training baseline generation, launch train(), and save the resulting checkpoints to the configured output directory.


trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset,
    args=config,
)

And we can train the model.

trainer.train()

Test the Fine-Tuned Model

Finally, we regenerate the same prompt to qualitatively compare outputs before vs after training, and optionally push the model to the Hub for sharing.

# Test the fine-tuned model
print("=== AFTER TRAINING ===")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(response[len(formatted_prompt):])

# Optional: Push to Hugging Face Hub
if training_config.push_to_hub:
    trainer.push_to_hub(
        commit_message="Fine-tuned SmolLM3 with custom dataset",
        tags=["smol-course", "sft", "instruction-tuning"]
    )
    print(f"Model pushed to Hub: {training_config.hub_model_id}")

Exercise 4: Production Workflow with TRL CLI

In the previous exercises we’ve dived deep into using TRL’s Python API for fine-tuning and explored the data we’re using and generating. In this exercise we’ll explore using the TRL CLI to fine-tune a model. This will be the most common way to fine-tune a model in production.

We can define a command in TRL CLI to fine-tune a model. We’ll be able to run it with trl sft command. The CLI command and Python API share the same configuration options.

We preprocessed the smoltalk_everyday_convs_reasoning_Qwen3_32B_think subset of SmolTalk2 so that is easier to work with it when using the TRL CLI.

# Fine-tune SmolLM3 using TRL CLI
trl sft \
    --model_name_or_path HuggingFaceTB/SmolLM3-3B-Base \
    --dataset_name HuggingFaceTB/smoltalk2_everyday_convs_think \
    --output_dir ./smollm3-sft-cli \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --learning_rate 5e-5 \
    --num_train_epochs 1 \
    --max_length 2048 \
    --logging_steps 10 \
    --save_steps 500 \
    --warmup_steps 100 \
    --bf16 True \
    --push_to_hub \
    --hub_model_id your-username/smollm3-sft-cli

For convenience and reproducibility, we can also create a configuration file to fine-tune a model. For example, we could create a file called sft_config.yaml and put the following content in it:

# Model and dataset
model_name_or_path: HuggingFaceTB/SmolLM3-3B-Base
dataset_name: HuggingFaceTB/smoltalk2_everyday_convs_think
output_dir: ./smollm3-advanced-sft

# Training hyperparameters
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 3e-5
num_train_epochs: 2
max_length: 4096

# Optimization
warmup_steps: 200
weight_decay: 0.01
optim: adamw_torch
lr_scheduler_type: cosine

# Memory and performance
bf16: true
dataloader_num_workers: 4
group_by_length: true
remove_unused_columns: false

# Logging and evaluation
logging_steps: 25
eval_steps: 250
save_steps: 500
eval_strategy: steps
load_best_model_at_end: true
metric_for_best_model: eval_loss

# Hub integration
push_to_hub: true
hub_model_id: your-username/smollm3-advanced
hub_strategy: every_save

We could then commit this file to the repository and track it with Git.

# Run training with config file
trl sft --config sft_config.yaml

Troubleshooting

If you get GPU out of memory errors:

  • Reduce per_device_train_batch_size to 1
  • Reduce max_length to 1024 or 512
  • Use torch.cuda.empty_cache() to clear GPU memory

If models fail to load:

  • Check your internet connection
  • Try using device_map="cpu" for CPU loading
  • Use a smaller model like HuggingFaceTB/SmolLM3-1.7B for testing

If training fails:

  • Make sure your dataset is properly formatted
  • Check that all examples have reasonable length (not too long)
  • Monitor the training loss - it should decrease steadily

Conclusion

Congratulations! You’ve completed comprehensive hands-on exercises covering:

  • SmolLM3’s chat template system and dual-mode reasoning
  • Dataset processing and preparation techniques
  • Supervised fine-tuning with Python APIs
  • Production workflows using CLI tools
  • Distributed training setups

These skills form the foundation for building sophisticated instruction-tuned models. In the next modules, we’ll explore preference alignment, parameter-efficient fine-tuning, and advanced evaluation techniques.

Resources for Further Learning

< > Update on GitHub