Supervised Fine-Tuning with SmolLM3

Supervised Fine-Tuning (SFT) is the cornerstone of instruction tuning - it’s how we transform a base language model into an instruction-following assistant. In this section, you’ll learn to fine-tune SmolLM3 using real-world datasets and production-ready tools.

What is Supervised Fine-Tuning?

SFT is the process of continuing to train a pre-trained model on task-specific datasets with labeled examples. Think of it as specialized education:

Pre-training teaches the model general language understanding (like learning to read).
Supervised fine-tuning teaches specific skills and behaviors (like learning to do a specific task).

The key insight behind SFT is that we’re not teaching the model new knowledge from scratch. Instead, we’re reshaping how existing knowledge is applied. The pre-trained model already understands language, grammar, and has absorbed vast amounts of factual information. SFT focuses this general capability toward specific application patterns, response styles, and task-specific requirements.

This approach is effective because it leverages the rich representations learned during pre-training while requiring significantly less computational resources than training from scratch. The model learns to recognize instruction patterns, maintain conversation context, follow safety guidelines, and generate responses in desired formats.

Before starting SFT, consider whether using an existing instruction-tuned model with well-crafted prompts would suffice for your use case. SFT involves significant computational resources and engineering effort, so it should only be pursued when prompting existing models proves insufficient. Learn more about this decision process in the Hugging Face LLM Course.

The SmolLM3 SFT Journey

SmolLM3’s instruction-following capabilities come from a sophisticated SFT process:

Base Model (SmolLM3-3B-Base): Trained on 11T tokens of general text
SFT Training: Fine-tuned on curated instruction datasets including SmolTalk2
Preference Alignment: Further refined using techniques like APO (Anchored Preference Optimization)

This multi-stage approach creates a model that’s both knowledgeable and helpful.

Why SFT Works: The Science Behind It

SFT is effective because it leverages the rich representations learned during pre-training while adapting the model’s behavior patterns. During SFT, the model’s parameters are fine-tuned through gradient descent on task-specific examples, causing subtle but important changes in how the model processes and generates text.

Specifically, the process works through several key mechanisms:

Behavioral Adaptation: The model learns to recognize instruction patterns and respond appropriately. This involves updating the attention mechanisms to focus on instruction cues in language and adjusting the output distribution to favor the desired responses. Research has shown that instruction tuning primarily affects the model’s surface-level behavior rather than its underlying knowledge (Wei et al., 2021).

Task Specialization: Rather than learning entirely new concepts, the model learns to apply its existing knowledge in specific contexts. This is why SFT is much more efficient than pre-training - we’re refining existing capabilities rather than building them from scratch. Studies indicate that most of the factual knowledge comes from pre-training, while SFT teaches the model how to format and present this knowledge appropriately (Ouyang et al., 2022).

Safety Alignment: Through exposure to carefully curated examples, the model learns to be more helpful, harmless, and honest. This involves both learning what to say and what not to say in various situations. The effectiveness of this approach has been demonstrated in works like InstructGPT (Ouyang et al., 2022) and Constitutional AI (Bai et al., 2022).

SFT doesn’t teach new facts - it teaches new behaviors. The model already knows about the world from pre-training; SFT teaches it how to be a helpful assistant using that knowledge.

The mathematical foundation involves minimizing the cross-entropy loss between the model’s predictions and the target responses in your training dataset. This process gradually shifts the model’s probability distributions to favor the types of responses demonstrated in your training examples.

When to Use Supervised Fine-Tuning

The key question is: “Does my use case require behavior that differs significantly from general-purpose conversation?” If yes, SFT is likely beneficial.

Decision framework: Use this checklist to determine if SFT is appropriate for your project:

Have you tried prompt engineering with existing instruction-tuned models?
Do you need consistent output formats that prompting cannot achieve?
Is your domain specialized enough that general models struggle?
Do you have high-quality training data (at least 1,000 examples)?
Do you have the computational resources for training and evaluation?

If you answered “yes” to most of these, SFT is likely worth pursuing.

The SFT Process

Now let’s move on to the process of SFT itself. The SFT process follows a systematic approach that ensures high-quality results:

1. Dataset Preparation and Selection

The quality of your training data is the most critical factor for successful SFT. Unlike pre-training where quantity often matters most, SFT prioritizes quality and relevance. Your dataset should contain input-output pairs that demonstrate exactly the behavior you want your model to learn.

Choose the Right Dataset:

SmolTalk2: The dataset used to train SmolLM3, containing high-quality instruction-response pairs.
Domain-specific datasets: For specialized applications (medical, legal, technical).
Custom datasets: Your own curated examples for specific use cases.

Each training example should consist of:

Input prompt: The user’s instruction or question
Expected response: The ideal assistant response
Context (optional): Any additional information needed

Dataset size guidelines:

Minimum: 1,000 high-quality examples for basic fine-tuning.

Recommended: 10,000+ examples for robust performance.

Quality over quantity: 1,000 well-curated examples often outperform 10,000 mediocre ones.

Remember: Your model will learn to mimic the patterns in your training data, so invest time in data curation.

2. Environment Setup and Configuration

To set up an environment for SFT, we will need advance compute resources. We have three main options:

Local GPU: If you are lucky enough to have a access to a GPU with (at least 16GB of VRAM), you can train your model locally!
Hugging Face Jobs: If you don’t have a GPU and don’t want to use a cloud provider, you can use Hugging Face Jobs! We’ll go into more detail about this in the next section.
Notebook GPUs: If you like to use a notebook provider like Google Colab, you can use their GPUs!
Cloud GPU: If you want to take control of your compute resources, you can use a cloud provider like AWS, GCP, or Azure.

In terms of hardware requirements, you will need a GPU with at least 16GB of VRAM, for example an Nvidia RTX 4080 or A10G.

3. Training Configuration

Choosing the right hyperparameters is crucial for successful SFT. The goal is to find the sweet spot where the model learns effectively without overfitting or becoming unstable. Here’s a detailed breakdown of each parameter and how to choose them:

Key Hyperparameters:

Learning Rate (5e-5 to 1e-4): Controls how much the model weights change with each update

Start with 5e-5 for SmolLM3; this is conservative and stable.
Too high: The model becomes unstable; loss oscillates or explodes.
Too low: The model learns very slowly and may not converge in reasonable time.

Batch Size (4-16): Number of examples processed simultaneously

Larger batches: More stable gradients, but require more GPU memory.
Smaller batches: Less memory usage, but noisier gradients.
Use gradient accumulation to achieve larger effective batch sizes.

Max Sequence Length (2048-4096): Maximum tokens per training example

Longer sequences: Can handle more complex conversations.
Shorter sequences: Faster training, less memory usage.
Match your use case: Use the typical length of your target conversations.

Training Steps (1000-5000): Total number of parameter updates

Depends on dataset size: More data usually requires more steps.
Monitor validation loss: Stop when it stops improving.
Rule of thumb: Three to five epochs through your dataset.

Warmup Steps (10% of total): Gradual learning rate increase at start

Prevents early instability: Helps the model adapt gradually.
Typical range: 100-500 steps for most SFT tasks.

Hyperparameter starting points for SmolLM3:

To bootstrap your training, you can use the following hyperparameters:

Learning Rate:

# Conservative (stable, slower)
learning_rate = 5e-5

# Balanced (recommended)
learning_rate = 1e-4

# Aggressive (faster, less stable)
learning_rate = 2e-4

Batch Size:

We can reduce GPU device batch size by using gradient accumulation.

# Limited GPU Memory
per_device_train_batch_size = 2
gradient_accumulation_steps = 8

# Balanced GPU Memory
per_device_train_batch_size = 4
gradient_accumulation_steps = 4

# More GPU Memory
per_device_train_batch_size = 8
gradient_accumulation_steps = 2

Max Sequence Length:

# Very short sequences
max_length = 512

# Short sequences
max_length = 1024

# Long sequences 
max_length = 2048

# Very long sequences
max_length = 4096

4. Monitoring and Evaluation

Effective monitoring is crucial for successful SFT. Unlike pre-training where you primarily watch loss decrease, SFT requires careful attention to both quantitative metrics and qualitative outputs. The goal is to ensure your model is learning the desired behaviors without overfitting or developing unwanted patterns.

Key Metrics to Monitor:

Training Loss: Should decrease steadily but not too rapidly

Healthy pattern: Smooth, gradual decrease.
Warning signs: Sudden spikes, oscillations, or plateaus.
Typical range: Starts around 2-4, should decrease to 0.5-1.5.

Validation Loss: Most important metric for preventing overfitting

Should track training loss: A small gap indicates good generalization.
Growing gap: Sign of overfitting; the model may be memorizing training data.
Use for early stopping: Stop training when validation loss stops improving.

Sample Outputs: Regular qualitative checks are essential

Generate responses: Test the model on held-out prompts during training.
Check format consistency: Ensure the model follows desired response patterns.
Monitor for degradation: Watch for repetitive or nonsensical outputs.

Resource Usage: Track GPU memory and training speed

Memory spikes: May indicate batch size is too large.
Slow training: Could suggest inefficient data loading or processing.

Understanding Loss Patterns in SFT

Training loss typically follows three distinct phases, as illustrated in this example from the Hugging Face LLM Course:

Initial Sharp Drop: Rapid adaptation to new data distribution
Gradual Stabilization: Learning rate slows as model fine-tunes
Convergence: Loss values stabilize, indicating training completion

Healthy Training Pattern: The key indicator of successful training is a small gap between training and validation loss, suggesting the model is learning generalizable patterns rather than memorizing specific examples.

Warning Signs to Watch For

Several patterns in the loss curves can indicate potential issues:

Overfitting Pattern

If validation loss increases while training loss continues to decrease, your model is overfitting. Consider:

Reducing training steps or epochs
Increasing dataset size or diversity
Adding regularization techniques
Using early stopping based on validation loss

Underfitting Pattern

If loss doesn’t show significant improvement, the model might be:

Learning too slowly (try increasing learning rate)
Struggling with task complexity (check data quality)
Hitting architectural limitations (consider different model size)

Potential Memorization

Extremely low loss values could suggest memorization rather than learning. This is concerning if:

Model performs poorly on new, similar examples
Outputs lack diversity or creativity
Responses are too similar to training examples

Learn more about loss interpretation in the Hugging Face LLM Course.

Experiment Tracking with Trackio: For comprehensive experiment tracking, we recommend Trackio - a lightweight, free experiment tracking library built on Hugging Face infrastructure. Trackio provides:

Drop-in replacement: API compatible with wandb.init, wandb.log, and wandb.finish.
Local-first design: Dashboard runs locally by default, with optional Hugging Face Spaces hosting.
Free hosting: Everything, including hosting on Hugging Face Spaces, is free.
Lightweight: Fewer than 3,000 lines of Python code, easily extensible.

We can track any metrics during training, for example:

# Simple Trackio integration
import trackio

# Initialize tracking
trackio.init(project="smollm3-sft")

# Log metrics during training
trackio.log({"train_in_loss": 0.5, "learning_rate": 5e-5})

# Finish tracking
trackio.finish()

The most convenient way to track your training is to use trackio’s transformers integration. You can specify your Trackio project name and space ID using environment variables:

export TRACKIO_PROJECT_NAME="my-project"
export TRACKIO_SPACE_ID="username/space_id"

Or you can set them in your code:

import os

os.environ["TRACKIO_PROJECT_NAME"] = "my-project"
os.environ["TRACKIO_SPACE_ID"] = "username/space_id"

Then you can use the SFTTrainer class from TRL to track your training and let it handle the tracking for you.

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    args=config,
)

Trackio will serve an application with the metrics from training that looks like this:

Logged metrics

While training and evaluating we record the following reward metrics:

global_step: The total number of optimizer steps taken so far.
epoch: The current epoch number, based on dataset iteration.
num_tokens: The total number of tokens processed so far.
loss: The average cross-entropy loss computed over non-masked tokens in the current logging interval.
entropy: The average entropy of the model’s predicted token distribution over non-masked tokens.
mean_token_accuracy: The proportion of non-masked tokens for which the model’s top-1 prediction matches the ground truth token.
learning_rate: The current learning rate, which may change dynamically if a scheduler is used.
grad_norm: The L2 norm of the gradients, computed before gradient clipping.

Expected dataset type and format

SFT supports both language modeling and prompt-completion datasets. The [SFTTrainer] is compatible with both standard and conversational dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.

# Standard language modeling
{"text": "The sky is blue."}

# Conversational language modeling
{"messages": [{"role": "user", "content": "What color is the sky?"},
              {"role": "assistant", "content": "It is blue."}]}

# Standard prompt-completion
{"prompt": "The sky is",
 "completion": " blue."}

# Conversational prompt-completion
{"prompt": [{"role": "user", "content": "What color is the sky?"}],
 "completion": [{"role": "assistant", "content": "It is blue."}]}

If your dataset is not in one of these formats, you can preprocess it to convert it into the expected format. Here is an example with the FreedomIntelligence/medical-o1-reasoning-SFT dataset:

from datasets import load_dataset

dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")

def preprocess_function(example):
    return {
        "prompt": [{"role": "user", "content": example["Question"]}],
        "completion": [
            {"role": "assistant", "content": f"<think>{example['Complex_CoT']}</think>{example['Response']}"}
        ],
    }

dataset = dataset.map(preprocess_function, remove_columns=["Question", "Response", "Complex_CoT"])
print(next(iter(dataset["train"])))

{
    "prompt": [
        {
            "content": "Given the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?",
            "role": "user",
        }
    ],
    "completion": [
        {
            "content": "<think>Okay, let's see what's going on here. We've got sudden weakness [...] clicks into place!</think>The specific cardiac abnormality most likely to be found in [...] the presence of a PFO facilitating a paradoxical embolism.",
            "role": "assistant",
        }
    ],
}

Chat Templates in Training

We’ll return briefly to chat templates in the context of training. Using chat templates correctly during training is crucial for model performance. Here are the key considerations and best practices:

Preprocessing and tokenization

During training, each example is expected to contain a text field or a (prompt, completion) pair, depending on the dataset format. For more details on the expected formats, see Dataset formats. The SFTTrainer tokenizes each input using the model’s tokenizer. If both prompt and completion are provided separately, they are concatenated before tokenization.

Computing the loss

sft_figure

The loss used in SFT is the token-level cross-entropy loss, defined as: $\mathcal{L}_{\text{SFT}}(\theta) = - \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}),$

where $y_t$ is the target token at timestep $t$ , and the model is trained to predict the next token given the previous ones. In practice, padding tokens are masked out during loss computation.

Supervised Fine-Tuning with TRL (Transformer Reinforcement Learning)

TRL is the go-to toolkit for training language models, built specifically for instruction tuning and alignment. It’s what we’ll use throughout this course.

Why TRL?

Production ready: Used by major organizations and research labs.
Comprehensive: Supports SFT, DPO, ORPO, PPO, and more advanced techniques.
Efficient: Optimized for memory usage and training speed.
Flexible: Works with any Hugging Face model.
CLI support: Command-line tools for scalable training workflows.

Key Components

SFTTrainer: The core class for supervised fine-tuning
SFTConfig: Configuration management for training parameters
CLI Tools: Command-line interface for production workflows
Integration: Seamless integration with Hugging Face Hub, Trackio, Weights & Biases, and more

TRL’s Architecture

TRL is built on top of the Hugging Face ecosystem:

Transformers: Model loading and inference.
Datasets: Data processing and management.
Accelerate: Distributed training and optimization.
PEFT: Parameter-efficient fine-tuning (LoRA, QLoRA).

This integrated approach means you get all the benefits of the Hugging Face ecosystem while using state-of-the-art training techniques.

TRL versus other training libraries:

TRL: Specialized for LLM training, built for instruction tuning.

Transformers Trainer: General purpose, suitable for basic fine-tuning.

DeepSpeed: Focuses on large-scale distributed training.

Accelerate: Provides low-level distributed training primitives.

TRL provides the best balance of ease-of-use and advanced features for SFT. For more details on training approaches, see the Hugging Face LLM Course.

Hands-On: Your First SmolLM3 Fine-Tune

Ready to put theory into practice? Here’s a preview of what you’ll build in the exercises. You can use either Python or CLI approach:

python

cli

Severless Training Options

While you can train models locally, cloud infrastructure offers significant advantages for SFT training. For users who want to skip the complexity of GPU setup and environment management, Hugging Face Jobs provides a seamless solution.

See Training with Hugging Face Jobs for fully managed cloud infrastructure with high-end GPUs, automatic scaling, and integrated monitoring.

Key Takeaways

SFT is Essential: It’s the bridge between base models and instruction-following assistants
Data Quality Matters: High-quality datasets lead to better fine-tuned models - invest time in curation
Monitor Carefully: Watch both loss curves and actual outputs to catch issues early
TRL Simplifies Everything: From research to production, TRL provides the tools you need
SmolLM3 is Perfect for Learning: Powerful enough to be useful, small enough to be accessible
Multiple Approaches: Both programmatic and CLI workflows for different use cases

🎓 Continue Learning: This introduction covers the fundamentals, but SFT is a deep topic. For more advanced techniques, evaluation methods, and troubleshooting tips, explore the Hugging Face LLM Course which provides comprehensive coverage of modern LLM training techniques.

Next Steps

Now that you understand the theory, choose your training approach:

Training with Hugging Face Jobs - Use cloud infrastructure for training Hands-On Exercises - Fine-tune your own SmolLM3 model locally or in the cloud

Resources and Further Reading

Training with Hugging Face Jobs - Cloud-based training with managed infrastructure
Trackio Documentation - Free, lightweight experiment tracking
TRL Documentation - Comprehensive guide to all TRL features
SFTTrainer API Reference - Detailed parameter documentation
SmolTalk2 Dataset - The dataset that trained SmolLM3
SmolLM3 Model Card - Official model documentation
TRL CLI Documentation - Command-line interface guide

< > Update on GitHub

a smol course

Supervised Fine-Tuning with SmolLM3

What is Supervised Fine-Tuning?

The SmolLM3 SFT Journey

Why SFT Works: The Science Behind It

When to Use Supervised Fine-Tuning

The SFT Process

1. Dataset Preparation and Selection

2. Environment Setup and Configuration

3. Training Configuration

4. Monitoring and Evaluation

Understanding Loss Patterns in SFT

Warning Signs to Watch For

Overfitting Pattern

Underfitting Pattern

Potential Memorization

Logged metrics

Expected dataset type and format

Chat Templates in Training

Preprocessing and tokenization

Computing the loss

Supervised Fine-Tuning with TRL (Transformer Reinforcement Learning)

Why TRL?

Key Components

TRL’s Architecture

Hands-On: Your First SmolLM3 Fine-Tune

Severless Training Options

Key Takeaways

Next Steps

Resources and Further Reading