Navigating the RLHF Landscape: From Policy Gradients to PPO, GAE, and DPO for LLM Alignment

Community Article Published February 11, 2025

Welcome to this blog post! If you’re keen on exploring Reinforcement Learning from Human Feedback (RLHF) in large language models (LLMs) and want to understand the process from the ground up—from basic policy gradient methods and the classic REINFORCE algorithm, through the derivation of Proximal Policy Optimization (PPO) using clipping objectives and Generalized Advantage Estimation (GAE) to balance bias and variance, and finally to an in-depth discussion on offline methods like DPO—you’ve come to the right place.

In this post, we start with the fundamentals of gradient policy optimization, gradually build up through the well-established REINFORCE algorithm, and then explore how to implement PPO via clipping objectives. We will also examine how GAE helps strike the ideal balance between bias and variance. Later on, we’ll derive and discuss offline training approaches such as DPO, providing insight into the advantages and challenges of each training path.

1. On-Policy vs. Off-Policy Reinforcement Learning

Today, the mainstream RLHF approaches in LLMs can be broadly categorized into two camps:

  • On-Policy Methods (exemplified by PPO)
  • Off-Policy Methods (exemplified by DPO)

But what exactly distinguishes On-Policy from Off-Policy? Here’s a simple rule of thumb:

  • On-Policy: During training, the model actively generates its own data samples.
  • Off-Policy: The training relies on pre-collected data (or data produced by another policy), eliminating the need for real-time generation.

Generally speaking, On-Policy methods tend to be more computationally demanding and time-consuming—the main bottleneck being the generation step. For generative tasks, the model must output tokens one at a time, a process that is extremely resource-intensive. Despite this slower pace, On-Policy approaches offer a higher theoretical performance ceiling because they continuously explore and update based on the model’s current state. This advantage will become even clearer when we dive deeper into PPO later.

Let’s first explore the On-Policy approach.

The Essence of On-Policy Learning

At its core, On-Policy learning involves having the model generate its own responses and then scoring those responses to guide subsequent parameter updates. In short, the key idea is to have the model “get in the game” itself.

Imagine you are a model learning to play chess. There are two possible training scenarios:

  1. Method One: You actively play chess, with a coach providing real-time feedback on every move. When you capture an opponent’s piece, your coach cheers you on; if you make an impulsive mistake that leads to a counterattack, your coach immediately advises you on how to improve.
  2. Method Two: Instead of playing, you’re given a collection of recordings—some from professional matches and others from subpar games—annotated to show which moves are effective and which are not. You passively learn by mimicking the good moves and avoiding the bad ones.

The fundamental difference between these two methods is whether you’re actually “playing” the game. Method One represents On-Policy training, where the model generates its own actions and learns from them; Method Two embodies Off-Policy training, where learning is solely based on existing data.

Off-Policy methods are generally faster during training because they rely on readily available data without requiring the model to generate new samples on the fly. However, they are highly sensitive to how well the pre-collected data aligns with the model’s current capabilities. If there’s a significant mismatch—whether the data is too challenging or too trivial—the learning process can suffer. On the other hand, On-Policy training sidesteps this issue since the training samples always reflect the model’s current performance level.

Key Components in an On-Policy Framework

In the context of language models, a typical On-Policy algorithm comprises the following components:

  • Actor: The model that generates sentences (analogous to you playing chess).
  • Critic: Acting like a coach, it provides immediate feedback for each generated output and is updated alongside the Actor as the model’s abilities evolve.
  • Reward Model: Serving as the referee, it assigns final scores or preference evaluations. This component usually remains fixed during training.
  • Reference Model: A unique element in PPO for large models, it prevents the Actor from straying too far from the original pre-trained distribution, thereby mitigating issues such as reward hacking.

Given that each of these components can be extremely large (often necessitating the simultaneous loading of multiple 70B-parameter models), On-Policy training typically demands enormous computational resources. This is why PPO is often described as being “computationally expensive.”

Next, we will focus on PPO—the most representative On-Policy method—to see how it practically balances training cost with learning efficiency.

2. PPO (Proximal Policy Optimization)

2.1 Starting with Policy Gradient Optimization

Imagine you’re a novice chess player just beginning to learn the game. Your goal is to improve your chances of winning by continuously refining your chess strategy (denoted as πθ\pi_{\theta}, where θ\theta represents your strategy parameters). Think of each game as a trajectory τ\tau that you want to optimize in order to secure a higher reward.

More generally, the objective in reinforcement learning is to optimize a policy so as to maximize the expected return:

π=argmaxπJ(π) \pi^* = \arg \max_\pi J(\pi)

Formally, the return of a policy is defined over all possible trajectories:

J(πθ)=τP(τπ)R(τ)=Eτπ[R(τ)] J(\pi_{\theta}) = \int_\tau P(\tau \mid \pi) R(\tau) = \mathbb{E}_{\tau \sim \pi} [R(\tau)]

A trajectory is simply a sequence of states and corresponding actions:

τ=(s0,a0,s1,a1,) \tau = (s_0, a_0, s_1, a_1, \dots)

In our chess analogy, the state sts_t represents the current board configuration, while the action ata_t indicates where you decide to move next. The next state is governed by a probability distribution—reflecting, for instance, your opponent’s response:

st+1P(st,at) s_{t+1} \sim P(\cdot \mid s_t, a_t)

Thus, the probability of a trajectory τ\tau is given by:

P(τπ)=ρ0(s0)t=0T1P(st+1st,at)π(atst) P(\tau \mid \pi) = \rho_0(s_0) \prod_{t=0}^{T-1} P(s_{t+1} \mid s_t, a_t) \pi(a_t \mid s_t)

In reinforcement learning, we frequently discount future rewards—future returns are never as valuable as immediate ones. Therefore, the total return for a trajectory is defined as:

R(τ)=t=0γtrt R(\tau) = \sum_{t=0}^\infty \gamma^t r_t

where γ[0,1]\gamma \in [0, 1] is the discount factor and rtr_t is the reward received at time tt.

In deep learning we typically update parameters by minimizing a loss function using stochastic gradient descent. However, since our goal here is to maximize the return, we use stochastic gradient ascent to update the policy:

θk+1=θk+αθJ(πθ)θk \theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta}) \big|_{\theta_k}

Here, θJ(πθ)\nabla_{\theta} J(\pi_{\theta}) is known as the policy gradient. In other words, after each game you review your moves to assess how each one contributed to the final outcome, and then adjust your strategy accordingly. This overall approach is known as a policy gradient algorithm.

However, just as in chess where you must consider all possible moves and board states, computing the exact gradient requires summing or integrating over all possible trajectories. In practice (unless the game is extremely simple), this is computationally infeasible—even if R(τ)R(\tau) is differentiable—because the long trajectory lengths make auto-differentiation very memory intensive. Therefore, we need to carefully derive a method for computing the policy gradient.

Derivation of the Policy Gradient

To derive a practical expression for the policy gradient—much like reviewing a game move by move—we start with the gradient of the objective function. Viewing each game as a trajectory τ\tau, the gradient of the objective is:

θJ(πθ)=θEτπθ[R(τ)] \nabla_{\theta}J(\pi_{\theta}) = \nabla_{\theta} \operatorname{E}_{\tau \sim \pi_{\theta}} [R(\tau)]

Step 1: Expand the Expectation

This step is equivalent to considering all possible games by expanding the expectation into an integral over all trajectories:

=θτP(τθ)R(τ)dτ = \nabla_{\theta} \int_{\tau} P(\tau|\theta) R(\tau) \, d\tau

Step 2: Interchange the Gradient and the Integral

Similar to decomposing the influence of each move, we bring the gradient operator inside the integral:

=τθP(τθ)R(τ)dτ = \int_{\tau} \nabla_{\theta} P(\tau|\theta) R(\tau) \, d\tau

Step 3: Apply the Log-Derivative Trick

By using a mathematical trick known as the log-derivative (or likelihood ratio) trick—much like breaking down the significance of each move—we have:

=τP(τθ)θlogP(τθ)R(τ)dτ = \int_{\tau} P(\tau|\theta) \nabla_{\theta} \log P(\tau|\theta) \cdot R(\tau) \, d\tau

Step 4: Return to Expectation Form

Finally, we can rewrite the integral back into the expectation form:

=Eτπθ[θlogP(τθ)R(τ)] = \operatorname{E}_{\tau \sim \pi_{\theta}} \left[ \nabla_{\theta} \log P(\tau|\theta) \cdot R(\tau) \right]

Decomposing θlogP(τθ)\nabla_{\theta} \log P(\tau|\theta)

During a game, every move is determined by your decision at that time. If we represent a game trajectory τ\tau as:

P(τθ)=ρ0(s0)i=0T1P(si+1si,ai)πθ(aisi) P(\tau|\theta) = \rho_0(s_0) \prod_{i=0}^{T-1} P(s_{i+1}|s_i, a_i) \pi_{\theta}(a_i|s_i)

then πθ(aisi)\pi_{\theta}(a_i|s_i) is the probability of selecting a particular move aia_i given the board state sis_i. Taking the logarithm and then the gradient, we obtain:

θlogP(τθ)=i=0Tθlogπθ(aisi) \nabla_{\theta} \log P(\tau|\theta) = \sum_{i=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_i|s_i)

(Note: The opponent’s moves P(si+1si,ai)P(s_{i+1}|s_i, a_i) are determined by fixed rules and do not depend on θ\theta, so their gradients vanish.)

2.2 Final Policy Gradient Formula

Substituting the above result back into our expectation, we arrive at the final policy gradient formula:

θJ(πθ)=Eτπθ[i=0Tθlogπθ(aisi)R(τ)] \nabla_{\theta} J(\pi_{\theta}) = \operatorname{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{i=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_i|s_i) \cdot R(\tau) \right]

In this formula, the decision made at each move (captured by logπθ\log \pi_{\theta}) influences the overall outcome of the game, independent of the fixed rules governing the opponent’s responses. In practice, we approximate this expectation using Monte Carlo sampling—akin to improving your chess skills through repeated play. The sample-based policy gradient can be approximated as:

g^=1DτDt=0Tθlogπθ(atst)R(τ) \hat{g} = \frac{1}{\mathcal{D}} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_{\theta} \text{log} \pi_\theta (a_t \mid s_t) R(\tau)

If you look closely, you’ll notice that R(τ)R(\tau) appears directly within the gradient calculation of the policy parameters.

2.3 The REINFORCE Algorithm: Process and Implementation Steps

Let’s now introduce the classic policy gradient method—the REINFORCE algorithm—which is similar to playing games, reviewing your performance, and continuously refining your strategy:

  1. Construct the Policy Network
    Build a neural network to define your chess strategy πθ\pi_{\theta}:

    • Input: The current board state sts_t
    • Output: A probability distribution over the next moves P(atst)P(a_t \mid s_t)
  2. Trajectory Sampling
    Play games using the current policy to sample trajectories τ\tau and record the reward received at each move (for example, a reward after winning a game).

    • You can either set a fixed number of moves per game (say, 100 moves) or play until the game ends.
  3. Gradient Computation
    Compute the gradient estimate from the collected dataset D\mathcal{D} of games, much like assessing the contribution of each move in your review:

g^=1DτDt=0Tθlogπθ(atst)R(τ) \hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \, R(\tau)

  1. Parameter Update
    Update your policy parameters using stochastic gradient ascent—just as you would adjust your playing style based on your game reviews:

θk+1=θk+αg^ \theta_{k+1} = \theta_k + \alpha \hat{g}

or equivalently:

θk+1=θk+αθJ(πθ)θk \theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\pi_\theta) \big|_{\theta_k}

  1. Iterative Optimization
    Repeat the cycle of “playing – reviewing – adjusting” until your policy converges and you can consistently perform at a high level.

Explanation of Core Formulas

Gradient Estimate Formula

g^=1DτDt=0Tθlogπθ(atst)Gradient of decision at each moveR(τ)Total reward for the game \hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \underbrace{\nabla_\theta \log \pi_\theta(a_t|s_t)}_{\text{Gradient of decision at each move}} \cdot \underbrace{R(\tau)}_{\text{Total reward for the game}}

  • Here, we use extensive Monte Carlo sampling to approximate the full expectation.
  • The total reward R(τ)R(\tau) represents the outcome of the entire game, serving as a measure of the cumulative impact of your decisions.

Parameter Update Rule

θk+1=θk+αg^ \theta_{k+1} = \theta_k + \alpha \hat{g}

  • The term α\alpha is the learning rate, analogous to how much you adjust your strategy after reviewing a game; the gradient points in the direction that increases your winning probability.

Algorithm Characteristics

  • Key Advantage: This method relies solely on your actual gameplay experience and does not require any prior knowledge of the opponent’s strategy (model-free).
  • Computational Requirements: It necessitates a large number of game samples to mitigate the high variance inherent in the gradient estimates.
  • Possible Improvements: Later approaches (such as Actor-Critic methods) introduce a value function baseline to stabilize policy updates—similar to receiving professional coaching feedback during your game reviews to accelerate improvement.

2.4 Challenges in Policy Gradient Optimization

A central assumption in policy gradient optimization is that we can reliably estimate the policy gradient using our chosen method. However, when the problem scales up—for instance, when each trajectory τ\tau becomes very long or the policy model is extremely large—you must sample many trajectories to obtain an accurate gradient estimate; otherwise, you face high variance. Although the gradient estimator in policy gradient algorithms is theoretically unbiased (its expected value converges to the true gradient), its variance can be extremely high. Recall the gradient estimate:

g^=1DτDt=0Tθlogπθ(atst)R(τ) \hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, R(\tau)

where:

  • D|\mathcal{D}| denotes the size of the dataset D\mathcal{D},
  • πθ\pi_\theta is the current policy (your chess strategy),
  • R(τ)R(\tau) is the total return of the game (trajectory τ\tau),
  • at,sta_t, s_t represent the action and state at time tt, respectively.

Imagine playing chess and trying to attribute the overall outcome to each move. If you attempt to credit every move with the entire game’s result, the evaluation becomes extremely unstable—that is, it exhibits high variance. Next, we explore methods to reduce this variance in our estimates.

2.5 Reducing Variance: Focusing Only on the Future

Notice in the gradient estimate above that, regardless of the current step tt, R(τ)R(\tau) always includes rewards from the entire trajectory. This isn’t entirely reasonable since a decision should only consider its impact on future outcomes—the past cannot be changed and should not influence the evaluation of the current action.

Returning to our chess example: if every move’s score also factors in earlier moves (good or bad), it obscures the true value of the current decision. In practice, when evaluating the current move, you only need to consider the “future rewards”—the rewards accrued from this move until the game’s end. This concept is known as rewards-to-go.

Mathematically, we adjust our gradient estimate as follows:

θJ(θ)1Ni=1N(t=0Tθlogπθ(ai,tsi,t))(t=tTr(si,t,ai,t)) \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}) \right) \left( \sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) \right)

Here, t=tTr(si,t,ai,t)\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) represents the total rewards from the current move to the end of the game. This is analogous to reviewing a game by focusing only on the impact of moves from a certain point forward, ignoring what has already transpired.

By eliminating these redundant past rewards, the variance in our gradient estimates naturally decreases.

2.6 Reducing Variance: Introducing a Baseline

To further mitigate fluctuations in our evaluation, we can subtract a baseline value from the future rewards at each step. Mathematically, this baseline is often denoted as bb (in practice, we commonly use the value function Vπ(s)V^\pi(s) as the baseline). The modified gradient estimate becomes:

θJ(θ)1Ni=1N(t=0Tθlogπθ(ai,tsi,t))(t=tTr(si,t,ai,t)b) \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}) \right) \left( \sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b \right)

Here, bb serves as a baseline—typically a function of the state sts_t—representing the expected return from the current state. The term t=tTr(si,t,ai,t)b\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b then reflects the advantage, or the amount by which the actual return exceeds this baseline. In practice, we use this advantage instead of the raw reward in our gradient estimates to reduce variance.

In the context of aligning large language models, an extra linear layer is typically added on top of the language model (i.e., the policy πθ\pi_\theta) to estimate the expected return Vπ(s)V^\pi(s) for a given state. This functions as a benchmark for each state, helping us gauge the true advantage of a decision. For a more intuitive explanation of why a baseline is necessary, see my previous blog post.

2.7 Reducing Variance: Introducing QQ and VV

Earlier we discussed the concept of rewards-to-go, represented by t=tTr(si,t,ai,t)\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}). In reinforcement learning, this term is known as the Q-function Qπ(s,a)Q^\pi(s, a), which captures the total future return when taking action aa in state ss. By subtracting the state value Vπ(s)V^\pi(s), we obtain the advantage function:

Aπ(s,a)=Qπ(s,a)Vπ(s) A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

Using our chess analogy, the Q-function quantifies the potential outcome of the game after making a move in state ss, while the state value represents the inherent strength of the board position. Even if the board is favorable, a poor move can reduce your relative advantage. Therefore, it’s important not only to consider the absolute score of a move but also to compare it against the baseline. A positive Aπ(s,a)A^\pi(s, a) indicates that the move significantly improves your position, whereas a negative value suggests it is suboptimal.

Ultimately, we can express the policy gradient as:

θJ(θ)1Ni=1N(t=0Tθlogπθ(ai,tsi,t))Aπ(s,a) \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}) \right) A^\pi(s, a)

This formula encapsulates how the relative performance of each move (compared to the average) guides the adjustment of your strategy.

2.8 Explaining the Advantage Function

Simply put, the advantage function Aπ(s,a)A^\pi(s, a) tells you how much more likely a specific move aa in a given state ss is to improve your chances of winning compared to an average move. If a move yields an expected return significantly above the baseline, its advantage is positive—signaling it as a promising move; if not, it indicates underperformance relative to the norm.

In summary, by focusing solely on future rewards, introducing a baseline, and utilizing the advantage function, we can effectively reduce the variance in our gradient estimates. This approach is akin to reviewing a game by concentrating on the pivotal moves that truly alter the outcome, thereby making the policy updates more stable and targeted.

2.9 Estimating the Advantage Term – Using the Iterative GAE Strategy

There are several ways to estimate the advantage term. For example:

A^π(st,at)=[r(st,at)+γVπ(st+1)]Vπ(st) \hat{A}^\pi(s_t, a_t) = \big[ r(s_t, a_t) + \gamma V^\pi(s_{t+1}) \big] - V^\pi(s_t)

A^π(st,at)=[r(st,at)+γr(st+1,at+1)+γ2Vπ(st+2)]Vπ(st) \hat{A}^\pi(s_t, a_t) = \big[ r(s_t, a_t) + \gamma r(s_{t+1}, a_{t+1}) + \gamma^2 V^\pi(s_{t+2}) \big] - V^\pi(s_t)

A^π(st,at)=[r(st,at)+γr(st+1,at+1)+γ2r(st+2,at+2)+γ3Vπ(st+3)]Vπ(st) \hat{A}^\pi(s_t, a_t) = \big[ r(s_t, a_t) + \gamma r(s_{t+1}, a_{t+1}) + \gamma^2 r(s_{t+2}, a_{t+2}) + \gamma^3 V^\pi(s_{t+3}) \big] - V^\pi(s_t)

These examples illustrate that by summing over multiple steps, we can balance bias and variance:

  • Stopping too early in accumulating actual rewards introduces high bias, because only a small portion of the true return is considered alongside minimal actual rewards.
  • Accumulating too many rewards leads to high variance, since relying on a larger number of real samples can make the estimate unstable.

To strike a balance in this bias-variance trade-off, we employ a weighted sum of these terms, known as Generalized Advantage Estimation (GAE):

δt=rt+γVπ(st+1)Vπ(st) \delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)

A^t=δt+γλA^t+1 \hat{A}_t = \delta_t + \gamma \lambda \hat{A}_{t+1}

  • This is a recursive formula. The advantage estimate at the final time step can be viewed as the first expansion, with each preceding step adding another layer weighted by the decay factor λ\lambda.
  • By iteratively accumulating these terms across time steps, we balance the high variance of actual rewards with the high bias introduced by relying solely on the value function.

In the context of aligning large language models, this approach guides the policy (i.e., the language model) to increase the probability of selecting the next token that, on average, yields a reward above the baseline given a specific prompt (state). In other words, the model will tend to choose tokens that are more likely to lead to sequences meeting our desired reward criteria—resulting in outputs that are better aligned with our training data distribution.

2.10 The PPO Loss

In Proximal Policy Optimization (PPO), to prevent the policy from changing too drastically during updates, we design a specialized loss function composed of several key components:

Policy Loss

LPOLICY=min(πθ(atst)πold(atst)A^t,  clip(πθ(atst)πold(atst),1ϵ,1+ϵ)A^t) L_{\text{POLICY}} = \min \Bigg( \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)} \hat{A}_t, \; \text{clip}\Bigg(\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)},\, 1 - \epsilon,\, 1 + \epsilon\Bigg) \, \hat{A}_t \Bigg)

This component is akin to not overhauling your strategy all at once when playing chess. Instead, you prefer to fine-tune your moves gradually so that while you improve your position, you avoid taking reckless steps that could disrupt the overall structure.

Value Function Loss

LVF=12Vθ(s)(t=0Tγtrt  |  s0=s)22 L_{\text{VF}} = \frac{1}{2} \left\lVert V_\theta(s) - \left( \sum_{t=0}^T \gamma^t r_t \;\middle|\; s_0 = s \right) \right\rVert_2^2

This term ensures that for every state, your estimated expected return (comparable to predicting how a game might unfold) is as close as possible to the actual returns received.

Entropy Loss

LENTROPY=xp(x)logp(x) L_{\text{ENTROPY}} = - \sum_x p(x) \log p(x)

The entropy loss encourages the policy to maintain a level of exploration. Much like a skilled chess player who not only masters known patterns but is also willing to experiment with new ideas to remain adaptable, this term prevents the policy from becoming too deterministic.

Overall PPO Loss

LPPO=LPOLICY+c1LVF+c2LENTROPY L_{\text{PPO}} = L_{\text{POLICY}} + c_1 L_{\text{VF}} + c_2 L_{\text{ENTROPY}}

By combining these components, we obtain the total PPO loss. This loss function is designed to improve the win rate (or reward) when updating the policy while ensuring that the policy does not deviate too drastically from its original behavior—thus promoting stable and efficient learning.

2.11 Advantages of Using PPO

  • Stability: The clipping operation ensures that policy updates are modest—just as you wouldn’t drastically change your playing style mid-game, each move remains consistent and measured.
  • Sample Efficiency: PPO makes effective use of the game data collected, although in large-scale models, a substantial number of samples is still required.
  • Intrinsic Safety: With clipping updates and KL-penalties relative to the reference policy, PPO effectively prevents drastic deviations during training, ensuring that generated outputs remain consistent with the pre-trained style.

Overall, just as an experienced chess player continuously improves by playing, reviewing, and fine-tuning their strategy, PPO achieves robust and efficient reinforcement learning through precise gradient updates coupled with restrictions on policy changes.

3. Understanding and Deriving GAE (Generalized Advantage Estimation)

Imagine participating in a high-stakes chess tournament. Every move not only affects the current board state but can also have long-lasting consequences for the rest of the game. To determine how much advantage a particular move provides, you must consider both the immediate score and its potential impact on future positions. Generalized Advantage Estimation (GAE) is designed to address exactly this problem—it helps evaluate how much better a certain action in a given state is compared to the average, much like reviewing a game to see whether a move significantly increased your chances of winning.

GAE innovatively combines the ideas of multi-step estimation and temporal difference (TD) learning. By using a series of past errors to predict future returns, it strikes an optimal balance between bias and variance. When used in conjunction with PPO, it’s like a chess player who not only has a robust strategic framework (PPO) but also leverages precise position evaluations (GAE) to continually optimize every move, resulting in more stable and effective overall performance.

3.1 Introducing the Concept of Residuals

When playing chess, you can rarely gauge the full value of a move immediately; instead, you estimate its effect based on your predictions of subsequent moves. The difference between this prediction and the actual outcome is known as the temporal difference (TD) residual.

Why Do We Need Residuals?

Imagine you’re at a critical juncture in a game. Based on experience, you believe that moving left might be advantageous, but after making the move, the outcome isn’t as favorable as anticipated. The discrepancy between your initial prediction and the actual result is the residual—it helps you correct your future estimates to make better decisions later on.

Mathematical Expression of the Residual

For policy gradient methods, the gradient of the value function can be written as:

Rθ=E(at,st)πθ[Aθ(at,st)logPθ(atst)] \nabla R_{\theta} = \mathbb{E}_{(a_t, s_t)\sim \pi_\theta} \left [A^{\theta}(a_t, s_t) \nabla \log P_{\theta} (a_t \mid s_t) \right]

Here, Aθ(at,st)A^{\theta}(a_t, s_t) represents the advantage (the future return after baseline correction) of taking action ata_t in state sts_t, which we denote as A(t)A(t). If we simply define:

A(t)=rtV(st) A(t) = r_t - V(s_t)

then, to more accurately reflect the full impact of the current move on future outcomes, we introduce the TD residual. Specifically, for taking action ata_t in state sts_t, arriving at state st+1s_{t+1} with an immediate reward rtr_t, we define the TD residual as:

δt=rt(V(st)γV(st+1)) \delta_t = r_t - \big(V(s_t) - \gamma V(s_{t+1})\big)

  • rtr_t is like the immediate score you receive after making the move.
  • V(st)V(s_t) and V(st+1)V(s_{t+1}) are your predicted values for the current and next board positions—similar to a coach’s evaluation.
  • γ\gamma is the discount factor, indicating that future rewards are gradually de-emphasized.

Intuitive Explanation of the Residual

Suppose you’re playing chess and your current board state is sts_t. You choose a move ata_t that leads to a new board state st+1s_{t+1} and you receive an immediate reward rtr_t. Your goal is to evaluate the overall impact of that move:

  • Your estimate of the current state’s value, V(st)V(s_t), represents the expected return if you continue playing according to your current policy.
  • Your estimate for the next state, V(st+1)V(s_{t+1}), reflects the expected return from that point onward.

Ideally, you would have:

V(st)=rt+γV(st+1) V(s_{t}) = r_t + \gamma V(s_{t+1})

That is, the value of the current state should equal the immediate reward plus the discounted future value. In reality, however, V(st)V(s_{t}) may deviate from this ideal, and the TD residual δt\delta_t quantifies that error. It tells you how much the actual outcome deviates from your expectation, much like realizing that a particular move had a better or worse impact than you had predicted, thereby providing a basis for adjusting your strategy.

To obtain a more accurate evaluation—just as you would review not just the immediate move but several subsequent moves—we define Ak(t)A^k(t) as the cumulative advantage from time tt over the next kk moves:

A1(t)=δtA2(t)=δt+γδt+1A3(t)=δt+γδt+1+γ2δt+2Ak(t)=i=0k1γiδt+i \begin{align} A^1(t) & = \delta_t \\ A^2(t) & = \delta_t + \gamma \delta_{t+1} \\ A^3(t) & = \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} \\ A^k(t) & = \sum_{i=0}^{k-1} \gamma^{i} \delta_{t+i} \end{align}

When kk tends to infinity:

Ak(t)=i=0γiδt+i A^k(t) = \sum_{i=0}^{\infty} \gamma^{i} \delta_{t+i}

Based on the definition of the residual, this expands to:

Ak(t)=rt+γrt+1++γk1rt+k1+γkV(st+k)V(st) A^k(t) = r_t + \gamma r_{t+1} + \dots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k}) - V(s_t)

This is analogous to considering all the scores from the current move until the end of the game. Clearly, the more moves you include (larger kk), the smaller your bias becomes, but the variance may increase due to more randomness; conversely, focusing only on the immediate moves introduces higher bias but reduces variance.

3.2 The Bias-Variance Tradeoff

In practice, rather than summing rewards over a fixed number of steps kk, we employ an exponentially weighted approach—this is the core idea behind GAE.

First, we define the TD residual as:

δt=rt+γVπ(st+1)Vπ(st) \delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)

Then, we accumulate the advantage using the recursive formula:

A^t=δt+γλA^t+1 \hat{A}_t = \delta_t + \gamma \lambda \hat{A}_{t+1}

Here, λ[0,1)\lambda \in [0,1) determines how much weight you give to future moves:

  • When λ=1\lambda=1, you consider the full effect of the entire game (minimal bias but maximal variance).
  • When λ=0\lambda=0, you rely solely on the immediate information (maximal bias but minimal variance).

Alternatively, we can represent the multi-step advantage as a weighted sum. Define:

AGAE1(t)=At1+λAt2+λ2At3+ A^{\text{GAE1}}(t) = A_t^1 + \lambda A_t^2 + \lambda^2 A_t^3 + \dots

Expanding this, we have:

AGAE1(t)=δt+λ(δt+γδt+1)+λ2(δt+γδt+1+γ2δt+2)+=δt(1+λ+λ2+)+γδt+1(λ+λ2+)+γ2δt+2(λ2+λ3+) \begin{align} A^{\text{GAE1}}(t) & = \delta_t + \lambda(\delta_t + \gamma \delta_{t+1}) + \lambda^2 (\delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2}) + \dots \\ & = \delta_t (1+\lambda+\lambda^2+\dots) + \gamma \delta_{t+1} (\lambda+\lambda^2+\dots) + \gamma^2 \delta_{t+2} (\lambda^2+\lambda^3+\dots) \end{align}

Assuming we sum only the first kk terms, using the formula for a geometric series we get:

AGAE1(t)=δt1λk1λ+γδt+1λ(1λk)1λ+γ2δt+2λ2(1λk)1λ+ A^{\text{GAE1}}(t) = \delta_t \frac{1 - \lambda^k}{1 - \lambda} + \gamma \delta_{t+1} \frac{\lambda(1 - \lambda^k)}{1 - \lambda} + \gamma^2 \delta_{t+2} \frac{\lambda^2(1 - \lambda^k)}{1 - \lambda} + \dots

As kk \rightarrow \infty:

AGAE1(t)=δt11λ+γδt+1λ1λ+γ2δt+2λ21λ+ A^{\text{GAE1}}(t) = \delta_t \frac{1}{1 - \lambda} + \gamma \delta_{t+1} \frac{\lambda}{1 - \lambda} + \gamma^2 \delta_{t+2} \frac{\lambda^2}{1 - \lambda} + \dots

Multiplying both sides by 1λ1-\lambda yields:

(1λ)AGAE1(t)=δt+γλδt+1+γ2λ2δt+2+ (1-\lambda) A^{\text{GAE1}}(t) = \delta_t + \gamma\lambda \delta_{t+1} + \gamma^2\lambda^2 \delta_{t+2} + \dots

If we define AGAE(t)=(1λ)AGAE1(t) A^{\text{GAE}}(t) = (1-\lambda) A^{\text{GAE1}}(t) , we finally obtain:

AGAE(t)=δt+γλδt+1+γ2λ2δt+2+=k=0(λγ)kδt+k A^{\text{GAE}}(t) = \delta_t + \gamma\lambda \delta_{t+1} + \gamma^2\lambda^2 \delta_{t+2} + \dots = \sum_{k=0}^{\infty} (\lambda \gamma)^k \delta_{t+k}

This formulation clearly shows how adjusting λ\lambda balances the influence of short-term versus long-term rewards. Just as a chess player neither focuses solely on the next move nor attempts to predict every possible outcome, an appropriate choice of λ\lambda allows you to weight near-term and far-term effects optimally:

  • When λ=1\lambda=1, the advantage estimate includes information from the entire game (minimal bias but high variance):

AGAE(t)=k=0γkδt+k=Ak(t)=rt+γrt+1++γk1rt+k1+γkV(st+k)V(st) \begin{align} A^{\text{GAE}}(t) & = \sum_{k=0}^{\infty} \gamma^k \delta_{t+k} \\ & = A^k(t) = r_t + \gamma r_{t+1} + \dots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k}) - V(s_t) \end{align}

  • When λ=0\lambda=0, only the immediate information is considered (maximal bias but minimal variance):

AGAE(t)=δt=rt(V(st)γV(st+1)) A^{\text{GAE}}(t) = \delta_t = r_t - \big(V(s_t) - \gamma V(s_{t+1})\big)

In summary, λ\lambda is a crucial hyperparameter that balances bias and variance:

  • A larger λ\lambda means more future observations are taken into account, reducing bias but increasing variance.
  • A smaller λ\lambda relies more on the current estimate, leading to higher bias but lower variance.

This method is analogous to a chess player who not only considers the immediate effects of a move but also judiciously weighs the implications of the subsequent moves, thereby making the optimal decision.

4. The Token-per-Token Process for Training LLMs with PPO

Below is a straightforward, token-by-token explanation of what happens when using PPO to train an LLM. This section clarifies the roles of the parameters (\(\theta_{\text{old}}\) and θ\theta) and walks through the process of the first update.

4.1 Aligning LLM with PPO

  • Old Policy Parameters (\(\theta_{\text{old}}\)): These are the model parameters used to generate data (for example, token sequences). Before each PPO update, you sample tokens using these parameters and record the probability of generating each token (typically stored as a log-probability).

  • Current Policy Parameters (\(\theta\)): These are the parameters being updated during training. Through the PPO algorithm, you adjust these parameters based on the sampled data so that the tokens generated better align with the reward signals. After the update, θ\theta will differ from θold\theta_{\text{old}}.

You can think of θold\theta_{\text{old}} as the "old version" of the model, while θ\theta is the "new version" that, after one training iteration, should perform better (or conform more closely to the reward signal). After each update, the new model becomes the "old model" for the next round of sampling.

2. The "Token-per-Token" Training Process

Assume you already have a pre-trained LLM with its parameters set as θold\theta_{\text{old}}. The following outlines the process step by step:

(1) Sampling Phase
  1. Providing a Prompt
    You input a prompt into the LLM (using θold\theta_{\text{old}}) to generate text.

  2. Generating Tokens One by One
    The model generates the next token (action) based on the current context (state). For instance, when generating the t-th token:

    • The current state sts_t consists of the prompt along with the preceding t1t-1 tokens.
    • The model selects token ata_t as its action and assigns it a probability πθold(atst)\pi_{\theta_{\text{old}}}(a_t \mid s_t) (typically recorded as a log-probability).
  3. Recording Data
    For each token, you log:

    • The state sts_t (the context)
    • The action ata_t (the generated token)
    • The generation probability (or log-probability) under the old policy
    • Optionally, any reward information (such as a score from a reward model) and the estimated value V(st)V(s_t).
    • This collection forms a trajectory—a sequence of tokens along with their associated data.
(2) Computing the Advantage

In PPO, it is necessary to compute an advantage AtA_t for each token (using GAE, multi-step, or single-step methods).

  • For example, for the t-th token, you might compute:

At=δt+γλδt+1+γ2λ2δt+2+ A_t = \delta_t + \gamma \lambda\, \delta_{t+1} + \gamma^2 \lambda^2\, \delta_{t+2} + \cdots

where δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) is the one-step temporal difference error.

  • In practice, because the trajectory is finite, the summation is computed only until the end of the sequence.
(3) Update Phase: From θold\theta_{\text{old}} to θ\theta

Once you have sampled a batch of token data, you update the model parameters using PPO. The token-per-token process unfolds as follows:

  1. Using the Old Policy as Reference
    You have recorded the log-probability of each token as generated by θold\theta_{\text{old}}, i.e., logπθold(atst)\log \pi_{\theta_{\text{old}}}(a_t \mid s_t).

  2. Recomputing Probabilities with the Current Policy
    With the current model parameters θ\theta (initially, θ\theta is identical to θold\theta_{\text{old}}, but it changes through successive gradient updates), you recalculate the log-probability for generating the same token ata_t given the same state sts_t, denoted by logπθ(atst)\log \pi_\theta(a_t \mid s_t).

  3. Calculating the Probability Ratio
    For each token, compute the ratio:

rt(θ)=exp(logπθ(atst)logπθold(atst)) r_t(\theta) = \exp\Big( \log \pi_\theta(a_t \mid s_t) - \log \pi_{\theta_{\text{old}}}(a_t \mid s_t) \Big)

This ratio indicates the degree to which the new model’s likelihood of generating that token has changed relative to the old model.

  1. Constructing the PPO Loss (Per-Token Loss)
    The goal of PPO is to limit how much the new policy deviates from the old one. For each token, the loss is computed based on the advantage AtA_t and the probability ratio rt(θ)r_t(\theta):

Lt(θ)=min(rt(θ)At, clip(rt(θ),1ϵ,1+ϵ)At) L_t(\theta) = -\min\Big( r_t(\theta) A_t,\ \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \, A_t \Big)

In plain terms:

  • If the product of the probability ratio rt(θ)r_t(\theta) and the advantage AtA_t falls within the acceptable range (i.e., between 1ϵ1-\epsilon and 1+ϵ1+\epsilon), that value is used.
  • Otherwise, if the ratio is too high or too low, the clipped value is used to prevent excessively large updates.
  1. Averaging Over All Tokens and Updating θ\theta
    The losses Lt(θ)L_t(\theta) for all tokens are averaged to form the overall batch loss. You then update the model parameters θ\theta using gradient descent (or another optimizer).

    • At this point, θ\theta starts to diverge from θold\theta_{\text{old}}, meaning the new model has been "improved" over the old one.
  2. Updating θold\theta_{\text{old}}
    After completing a full PPO update (usually over several epochs on the same batch of data), you set θold\theta_{\text{old}} to the current θ\theta to prepare for the next round of sampling.

3. Pseudocode

Below is a pseudocode-style algorithm that outlines the token-per-token PPO update process:

# Initialization: Set the pre-trained LLM parameters as θ_old, and copy them to θ
θ_old = PretrainedLLM.parameters
θ = copy(θ_old)

# Sampling Phase: Generate a batch of data using θ_old
for each prompt in dataset:
    trajectory = []
    state = prompt
    while not end_of_sequence:
        token, logpi_old = θ_old.generate_token(state)
        # Record the current state, token, and the log-probability under θ_old
        trajectory.append( (state, token, logpi_old, reward, V(state)) )
        state = state + token  # Update the state (append token)
    store trajectory

# Compute Advantages (e.g., using GAE)
for each trajectory:
    for t from last token downto first:
        δ_t = reward[t] + γ * V(state[t+1]) - V(state[t])
        A_t = δ_t + γ * λ * A[t+1]  # Recursively compute the advantage

# PPO Update Phase: Multiple epochs
for each PPO update epoch:
    for each token data (state s_t, token a_t, logpi_old, A_t) in batch:
        # 1. Compute the log-probability under the current policy
        logpi_current = θ.log_probability(s_t, a_t)
        # 2. Calculate the probability ratio
        r_t = exp( logpi_current - logpi_old )
        # 3. Compute the unclipped and clipped objectives
        loss_unclipped = r_t * A_t
        loss_clipped = clip(r_t, 1-ε, 1+ε) * A_t
        # 4. The token loss is the negative of the minimum of these two values
        loss_token = -min(loss_unclipped, loss_clipped)
    # 5. Average the loss over all tokens and perform a gradient update
    θ = Update(θ, average(loss_token over batch))

# After updating, copy θ to θ_old for the next round of sampling
θ_old = copy(θ)

This pseudocode summarizes the token-per-token PPO update process for training an LLM, illustrating how the old and new policy parameters are used and updated iteratively.

5. DPO: Learning Chess by Studying Chess Manual

Earlier, we discussed how PPO is similar to having a coach on a live chessboard—guiding you in real time as you play and adapt your strategy (i.e., online learning). In contrast, DPO is more like sitting at home and studying a chess manual (i.e., using offline data), where you deduce how to improve your moves based on pre-existing win–loss comparisons. In this section, we will derive the mathematical principles behind DPO (Direct Preference Optimization) and explain its strengths and limitations compared to PPO (or, more generally, other RLHF approaches).

Before we dive in, keep in mind these three key objective functions:

  • rϕr_{\boldsymbol\phi} represents the Reward Model.

  • πθ\pi_{\boldsymbol\theta} is the aligned model (policy) we want to train.

  • πref\pi_{\text{ref}} is the reference model (which both PPO and DPO rely on to keep the policy from drifting too far).

These three objectives are defined as follows:

  • Reward Model Loss:

maxrϕ{Ex,yw,ylD[logσ(x,yw)logσ(x,yl)]} \max_{r_{\boldsymbol\phi}} \Bigl\{ \mathbb{E}_{x,\, y_{\text{w}},\, y_{\text{l}} \sim D} \Bigl[\log \sigma(x, y_{\text{w}}) - \log \sigma(x, y_{\text{l}})\Bigr] \Bigr\}

  • PPO Loss:

maxπθ{ExD,yπθ(yx)[rϕ(x,y)]    βDKL[πθ(yx)πref(yx)]} \max_{\pi_{\boldsymbol\theta}} \Biggl\{ \mathbb{E}_{x \sim D,\, y \sim \pi_{\boldsymbol\theta}(y\mid x)} \Bigl[r_{\boldsymbol\phi}(x, y)\Bigr] \;-\; \beta \,\mathbb{D}_{KL}\Bigl[\pi_{\boldsymbol\theta}(y\mid x) \,\Big\|\, \pi_{\text{ref}}(y\mid x)\Bigr] \Biggr\}

  • DPO Loss:

maxπθ{Ex,yw,ylD[logσ(βlogπθ(ywx)πref(ywx)    βlogπθ(ylx)πref(ylx))]} \max_{\pi_{\boldsymbol\theta}} \Bigl\{ \mathbb{E}_{x,\, y_{\text{w}},\, y_{\text{l}} \sim D} \Bigl[ \log \sigma\Bigl( \beta \log \frac{\pi_{\boldsymbol\theta} (y_{\text{w}} \mid x)}{\pi_\text{ref}(y_{\text{w}} \mid x)} \;-\; \beta \log \frac{\pi_{\boldsymbol\theta} (y_{\text{l}} \mid x)}{\pi_\text{ref}(y_{\text{l}} \mid x)} \Bigr) \Bigr] \Bigr\}

Here, the KL divergence is defined as:

DKL(PQ)=  iP(i)log ⁣(P(i)Q(i)). \mathbb{D}_{KL}(P\|Q) =\;\sum_i P(i)\,\log\!\Bigl(\frac{P(i)}{Q(i)}\Bigr).

In essence, while PPO adjusts the policy based on live feedback using a KL penalty to tether the new policy to the reference, DPO leverages pre-collected comparative data (the “chess game records”) to directly optimize the policy based on preference comparisons. This approach has its own advantages and challenges, which we will explore in the following sections.

5.1 Directly Solving for the Optimal Aligned Model from the Optimization Objective

Let’s begin with the PPO loss function and perform a series of mathematical transformations. This is analogous to a chess coach (the Reward Model) providing real-time feedback while applying a KL divergence penalty to prevent your policy from straying too far from a reference model.

  1. Substitute the KL Divergence Formula:

{ExD,yπθ(yx)[rϕ(x,y)    βlog(πθ(yx)πref(yx))]}. \underset{\pi_{\boldsymbol\theta}}{\max}\,\Bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\boldsymbol\theta}(y\mid x)}\Bigl[r_{\phi}(x,y) \;-\;\beta \log\Bigl(\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)\Bigr]\Bigr\}.

  1. Extract the Constant 1β-\tfrac{1}{\beta} and Apply an Identity Transformation:

{ExD,yπθ(yx)[logπθ(yx)πref(yx)    1βrϕ(x,y)]}. \underset{\pi_{\theta}}{\min}\,\Bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\theta}(y\mid x)} \Bigl[\log\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\;-\;\tfrac{1}{\beta}r_{\phi}(x,y)\Bigr] \Bigr\}.

  1. Continue the Transformation:

{ExD,yπθ(yx)[logπθ(yx)πref(yx)    logerϕ(x,y)/β]}. \underset{\pi_{\theta}}{\min}\,\Bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\theta}(y\mid x)} \Bigl[\log\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)}\;-\;\log e^{r_{\phi}(x,y)/\beta}\Bigr] \Bigr\}.

  1. Obtain:

{ExD,yπθ(yx)[logπθ(yx)πref(yx)exp ⁣(rϕ(x,y)/β)]}. \underset{\pi_{\theta}}{\min}\,\Bigl\{ \mathbb{E}_{x \sim \mathcal{D},\,y \sim \pi_{\theta}(y\mid x)} \Bigl[\log\tfrac{\pi_{\theta}(y\mid x)}{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(r_{\phi}(x,y)/\beta\Bigr)}\Bigr] \Bigr\}.

At this point, we have constructed a new distribution πr(yx)\pi_{r}(y\mid x) defined as:

πr(yx)=πref(yx)erϕ(x,y)/βZ(x) \pi_{r}(y\mid x) = \frac{\pi_{\text{ref}}(y\mid x) e^{r_{\phi}(x,y) / \beta}}{Z(x)}

where Z(x)Z(x) is a normalization constant that ensures πr\pi_r is a proper probability distribution (i.e., the probabilities sum to 1):

Z(x)=  yπref(yx)exp ⁣(rϕ(x,y)β). Z(x) =\;\sum_{y}\,\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(\tfrac{r_{\phi}(x,y)}{\beta}\Bigr).

In this expression, the numerator represents the expected reward for a given input pair (x,y)(x, y), while the denominator aggregates the reward expectations for all possible outputs yy given the same input xx. This structure performs a normalization, confining the values within the interval [0,1][0, 1] and thereby satisfying the basic requirement for forming a probability distribution.

Although we do not know the exact form of πr(yx)\pi_{r}(y\mid x), we do have the precise expression for the reference distribution πref(yx)\pi_{\text{ref}}(y\mid x). Leveraging this, one can approximate the distribution πr(yx)\pi_{r}(y\mid x) by feeding the input xx into the reference model and either iterating over all possible yy or sampling a sufficient number of yy values. Note, however, that this approach poses computational challenges in practice, which we will discuss further later on.

  1. Continue with an Equivalent Transformation of the PPO Loss:

ExD,yπθ(yx)[logπθ(yx)πref(yx)exp ⁣(rϕ(x,y)/β)Z(x)Z(x)]. \underset{\pi_{\theta}}{\min}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(y\mid x)}\Bigl[\log \tfrac{\pi_{\theta}(y\mid x)} {\tfrac{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(r_{\phi}(x,y)/\beta\Bigr)}{\,Z(x)\,}\,Z(x)}\Bigr].

  1. Simplify:

ExD,yπθ(yx)[logπθ(yx)πr(yx)    logZ(x)]. \underset{\pi_{\theta}}{\min}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(y\mid x)} \Bigl[\log \tfrac{\pi_{\theta}(y\mid x)}{\pi_{r}(y\mid x)} \;-\;\log Z(x)\Bigr].

  1. Ignore the logZ(x)\log Z(x) Term (as It Is Independent of πθ\pi_{\theta}) to Obtain:

ExD,yπθ(yx)[logπθ(yx)πr(yx)]. \underset{\pi_{\theta}}{\min}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(y\mid x)} \Bigl[\log \tfrac{\pi_{\theta}(y\mid x)}{\pi_{r}(y\mid x)}\Bigr].

  1. Express This in the Form of a KL Divergence:

ExD[DKL(πθ(yx)πr(yx))]. \underset{\pi_{\theta}}{\min}\,\mathbb{E}_{x \sim \mathcal{D}}\Bigl[ \mathbb{D}_{KL}\Bigl(\pi_{\theta}(y\mid x)\,\Big\|\,\pi_{r}(y\mid x)\Bigr)\Bigr].

At this point, our objective is reduced to focusing solely on the KL divergence. Since the KL divergence is always non-negative and equals zero only when the two distributions are identical, the optimal condition is achieved when πθ(yx)\pi_{\theta}(y\mid x) exactly equals πr(yx)\pi_{r}(y\mid x). This provides us with an explicit solution: the optimal probability distribution under PPO is precisely πr(yx)\pi_{r}(y\mid x).

In other words, if the parameters of the reward model rϕr_{\phi} are fixed, then the optimal solution for PPO is

πr(yx)=  πref(yx)exp ⁣(rϕ(x,y)β)Z(x). \pi_{r}(y\mid x) =\;\frac{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(\tfrac{r_{\phi}(x,y)}{\beta}\Bigr)}{Z(x)}.

In practice, however, the reward function r(x,y)r(x,y) used in alignment training is not arbitrarily chosen; it is obtained through data-driven training to yield an optimal reward model. That is, we first train an ideal reward model rϕr^{*}_{\phi} (analogous to a coach scoring a game), and then, based on this optimal reward model, we further train an aligned model that “plays chess well.” Consequently, the optimal reward model rϕr^{*}_{\phi} and the aligned model it produces, πr\pi^{*}_{r}, continue to satisfy the relationship:

πr(yx)=  πref(yx)exp ⁣(rϕ(x,y)β)Z(x). \pi^{*}_{r}(y\mid x) =\;\frac{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(\tfrac{r^{*}_{\phi}(x,y)}{\beta}\Bigr)}{Z(x)}.

Summary

  1. First, we defined an overall optimization objective for aligning with human preferences. This objective function assumes that we have a reward function and seeks to find an aligned model that maximizes this objective. It can be understood as, in a game of chess, striving to adopt a strategy (a way of playing) that maximizes your win rate based on the coach’s (reward function’s) evaluation. Mathematically, this objective is written as:

maxπθ{Ex,yw,ylD[logσ(βlogπθ(ywinx)πref(ywx)βlogπθ(ylx)πref(ylosex))]}. \max_{\pi_{\boldsymbol\theta}} \Biggl\{ \mathbb{E}_{x,\, y_{\text{w}},\, y_{\text{l}} \sim D} \Biggl[ \log \sigma \Biggl( \beta \log \frac{\pi_{\boldsymbol\theta} (y_{\text{win}} \mid x)}{\pi_\text{ref} (y_{\text{w}} \mid x)} - \beta \log \frac{\pi_{\boldsymbol\theta} (y_{\text{l}} \mid x)}{\pi_\text{ref} (y_{\text{lose}} \mid x)} \Biggr) \Biggr] \Biggr\}.

  1. Next, starting from this optimization objective, we derived the explicit solution for the aligned model π\pi when the reward function rr is fixed. Analogous to finding the optimal move in a chess game from historical records, this explicit solution is given by:

π(yx)=πref(yx)erϕ(x,y)/βZ(x), \pi^{*}(y \mid x) = \frac{\pi_{\text{ref}}(y \mid x)\,e^{r_{\phi}(x,y) / \beta}}{Z(x)},

where the normalization partition function Z(x)Z(x) is defined as

Z(x)=yπref(yx)erϕ(x,y)/β. Z(x) = \sum\limits_{y} \pi_{\text{ref}}(y \mid x)\,e^{r_{\phi}(x,y) / \beta}.

  1. Finally, in practical training, we typically do not train the reward model in isolation. Instead, we train the aligned model directly under the guidance of the optimal reward model rϕr^{*}_{\phi}. In other words, just as continuous practice and coach feedback eventually yield a strategy that truly wins games, we adjust the above formula slightly to obtain:

rϕ(x,y)=βlogπ(yx)Z(x)πref(yx)+βlogZ(x). r_{\phi}(x,y) = \beta \log\frac{\pi^{*}(y \mid x)\,Z(x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x).

These three steps illustrate the entire process—from defining the overall alignment objective, to deriving the optimal policy, and finally linking the reward model with the aligned model. The entire procedure is akin to first establishing the criteria for winning a game (the optimization objective), then deducing the best moves from historical records (the explicit policy solution), and ultimately, through continuous practice and feedback (reward model training), arriving at a strategy that both evaluates correctly and performs effectively.

5.2 Skipping the Training of the Reward Model

Although we have formally derived

πr(yx)=  πref(yx)exp ⁣(rϕ(x,y)β)Z(x), \pi_{r}(y\mid x) =\;\frac{\pi_{\text{ref}}(y\mid x)\,\exp\!\Bigl(\frac{r_{\phi}(x,y)}{\beta}\Bigr)}{Z(x)},

in practice this explicit solution is difficult to apply directly because:

  1. Estimating Z(x)Z(x) is challenging: It requires either exhaustively enumerating or adequately sampling all possible responses yy for a given prompt xx to compute and accumulate exp ⁣(rϕ(x,y))πref(yx)\exp\!\bigl(r_\phi(x,y)\bigr) \cdot \pi_{\text{ref}}(y\mid x), which is extremely costly.

  2. Our original goal was to bypass training the reward model: We intended to learn an aligned model in one step rather than first training a reward function. However, πr\pi_{r} still depends on the reward function rr, which is a step away from our desired outcome of directly “learning to play chess well” without an intermediate scoring phase.

This leads us to consider the inverse: if we have the optimal aligned model π\pi^{*}, can we derive its corresponding reward function rϕr_{\phi}^{*}? The answer is affirmative. Starting from

π(yx)Z(x)πref(yx)=  exp ⁣(rϕ(x,y)β), \frac{\pi^{*}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)} =\;\exp\!\Bigl(\frac{r_{\phi}^{*}(x,y)}{\beta}\Bigr),

we can equivalently transform it into:

rϕ(x,y)=  βlog ⁣(π(yx)Z(x)πref(yx))  +  βlogZ(x). r_{\phi}^{*}(x,y) =\;\beta\,\log\!\Bigl(\frac{\pi^{*}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)\;+\;\beta\,\log Z(x).

This expression allows us to represent π\pi^{*} in terms of rr^{*}, thereby bridging the gap between training the reward model and the aligned model.

Summary of this section:

Since we can express the optimal reward model rϕr^{*}_{\phi} in terms of the optimal aligned model π\pi^{*} (just as repeated gameplay and analysis can help you deduce the best evaluation criteria for chess moves), we can directly substitute π\pi^{*} into the reward model’s training objective. In other words, while it may appear that you are training a reward model, you are in fact directly obtaining the optimal aligned model in one fell swoop. This achieves our original goal of “having a model that can both evaluate and play chess well.”

The remaining challenge then shifts to training the reward model itself. Typically, we adopt a preference-ranking approach for data annotation, analogous to having annotated chess game records that indicate which moves are superior. Generally, there are two main methods:

  1. Generating Only Two Responses
    For a given prompt xx (or chess position), generate two responses (moves), for example, <prompt x, chosen y1, reject y2>. Human annotations indicate which move is better, and our objective is to have the reward model assign a higher score to the chosen move while scoring the rejected move lower.

  2. Generating K Responses (K > 2)
    For the same prompt xx, generate multiple responses (moves), such as <prompt x, y1, ..., yK>. Suppose human annotators provide a preference ranking τ\tau (e.g., y2 > y3 > y1 > ... > yK). We aim for the reward model to give the highest overall score to the true ranking τ\tau while scoring any other ordering lower.

In some training frameworks (such as ChatGPT’s implementation), when more than two responses are generated, the system breaks them down into pairwise comparisons so that the objective remains consistent with the two-response case. In more general scenarios, however, the entire set of preference rankings is treated as a whole, with the expectation that the true ranking τ\tau receives the highest score. The derivation of DPO is based on this holistic view of preference ranking, and in the following sections we will derive the final DPO objective function separately for the cases of K=2K=2 and K>2K>2.

BT Model: Generating Only Two Responses

Imagine that while playing chess your coach shows you only two candidate moves—one labeled “good” (chosen) and the other “bad” (reject). In this case, your goal is for your “scoring system” (the reward model) to evaluate the good move as significantly superior to the bad one. To model this, we can employ the classic Bradley-Terry (BT) model, originally proposed in 1952 to analyze pairwise comparisons, and widely used in sports, market research, and other fields. For a pair of items y1y_1 and y2y_2 (here representing the chosen and rejected responses), the BT model expresses the probability that y1y_1 beats y2y_2 as:

P(y1>y2x)  =  eλy1eλy1+eλy2, P(y_1 > y_2 \mid x) \;=\; \frac{e^{\lambda y_1}}{e^{\lambda y_1} + e^{\lambda y_2}},

where λy1\lambda y_1 and λy2\lambda y_2 denote their respective strength parameters—analogous to using historical win rates to gauge the strength of chess players. In our context, when y1y_1 and y2y_2 correspond to the chosen and rejected responses, these parameters can be interpreted as the scores given by the reward model.

Our objective is to maximize the probability that y1y_1 beats y2y_2 across the entire annotated dataset D={xi,ywi,yli}i=1ND = \{x^i, y_w^i, y_l^i\}_{i=1}^N. That is, we want the chosen responses to consistently outperform the rejected ones. Accordingly, the overall optimization objective for the reward function can be formulated as:

L(rϕ,D)=Ex,yw,ylD[log(P(yw>ylx))]=Ex,yw,ylD[log ⁣(er(x,yw)er(x,yw)+er(x,yl))]=Ex,yw,ylD[log ⁣(11+e(r(x,yw)r(x,yl)))]=Ex,yw,ylD[log ⁣(σ(r(x,yw)r(x,yl)))]. \begin{aligned} L(r_{\phi}, D) &= -\,\mathbb{E}_{x,\,y_w,\,y_l \in D}\Bigl[\log\bigl(P(y_w > y_l \mid x)\bigr)\Bigr] \\ &= -\,\mathbb{E}_{x,\,y_w,\,y_l \in D}\Bigl[\log\!\Bigl(\frac{e^{r(x, y_w)}}{\,e^{r(x, y_w)} + e^{r(x, y_l)}\,}\Bigr)\Bigr] \\ &= -\,\mathbb{E}_{x,\,y_w,\,y_l \in D}\Bigl[\log\!\Bigl(\frac{1}{1 + e^{- \bigl(r(x, y_w) - r(x, y_l)\bigr)}}\Bigr)\Bigr] \\ &= -\,\mathbb{E}_{x,\,y_w,\,y_l \in D}\Bigl[\log\!\Bigl(\sigma\bigl(r(x, y_w) - r(x, y_l)\bigr)\Bigr)\Bigr]. \end{aligned}

The final expression above is precisely the reward model optimization objective employed in systems like ChatGPT. Intuitively, this objective encourages the reward model to output scores such that the chosen response clearly outperforms the rejected one.

Assuming we have found the optimal reward model of the form

rϕ(x,y)  =  βlog ⁣(π(yx)Z(x)πref(yx))  +  βlogZ(x), r^{*}_{\phi}(x,y) \;=\; \beta\,\log\!\Bigl(\frac{\pi^{*}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr) \;+\; \beta\,\log Z(x),

substituting this optimal reward function into our earlier objective yields:

Reward Model Loss={E(x,yw,yl)D[logσ(βlogπ(ywinx)πref(ywinx)+βlogZ(x)    βlogπ(ylossx)πref(ylossx)βlogZ(x))]}={E(x,yw,yl)D[logσ(βlogπ(ywinx)πref(ywinx)βlogπ(ylossx)πref(ylossx))]}. \begin{aligned} & \text{Reward Model Loss} \\& =\underset{\pi^{*}}{\max}\,\Biggl\{\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}}) \sim \mathcal{D}} \Biggl[\log \sigma\Bigl( \beta\,\log\frac{\pi^{*}(y_{\text{win}} \mid x)}{\pi_{\text{ref}}(y_{\text{win}} \mid x)}+\beta\,\log Z(x) \;-\; \beta\,\log\frac{\pi^{*}(y_{\text{loss}} \mid x)}{\pi_{\text{ref}}(y_{\text{loss}} \mid x)}-\beta\,\log Z(x)\Bigr)\Biggr] \Biggr\} \\[6pt]& =\underset{\pi_{\theta}}{\max}\,\Biggl\{\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}})\sim \mathcal{D}} \Biggl[\log \sigma\Bigl( \beta\,\log\frac{\pi^{*}(y_{\text{win}} \mid x)}{\pi_{\text{ref}}(y_{\text{win}} \mid x)}-\beta\,\log\frac{\pi^{*}(y_{\text{loss}} \mid x)}{\pi_{\text{ref}}(y_{\text{loss}} \mid x)}\Bigr)\Biggr] \Biggr\}. \end{aligned}

This result shows that the reward model’s training objective has been transformed into one that depends solely on the aligned model π\pi. In effect, by this method, we can bypass separately training the reward model and directly use the annotated pairwise preference data to train the aligned model πθ\pi_\theta in one fell swoop—just as you would learn the best moves directly from chess records. Thus, after a minor adjustment and setting our trainable aligned model as πθ\pi_\theta, the final objective becomes:

Reward Model Loss={E(x,yw,yl)D[logσ(βlogπθ(ywinx)πref(ywinx)βlogπθ(ylossx)πref(ylossx))]}=DPO Loss. \begin{aligned} & \text{Reward Model Loss} \\& =\underset{\pi_{\theta}}{\max}\,\Biggl\{\mathbb{E}_{(x,y_{\text{w}},y_{\text{l}})\sim \mathcal{D}} \Biggl[\log \sigma\Bigl( \beta\,\log\frac{\pi_{\theta}(y_{\text{win}} \mid x)}{\pi_{\text{ref}}(y_{\text{win}} \mid x)}-\beta\,\log\frac{\pi_{\theta}(y_{\text{loss}} \mid x)}{\pi_{\text{ref}}(y_{\text{loss}} \mid x)}\Bigr)\Biggr] \Biggr\} \\[6pt]& = \text{DPO Loss}. \end{aligned}

PT Model: Generating K (K > 2) Responses

Now, imagine a chess match where, instead of considering just two candidate moves, you evaluate multiple possible moves simultaneously. For example, at a critical moment your coach might present you with K different moves and ask you to rank them based on their potential outcomes. This scenario corresponds to the RLHF setting in which K responses are generated and ranked by preference.

Unlike the BT model, which focuses on pairwise comparisons when only two responses are generated, here we use a statistical approach based on the Plackett-Luce model, which can rank multiple candidates. Let τ\tau denote the ground-truth ranking provided by human annotators—i.e., the ideal ranking of moves determined by the coach. We want this true ranking τ\tau to outperform any other possible ordering. To this end, we define the probability that the true ranking τ\tau beats all other orderings as:

P(τy1,y2,,yK,x)=  k=1Ker(x,yτk)j=kKer(x,yτj). P(\tau \mid y_1, y_2, \dots, y_K, x) =\;\prod_{k=1}^{K}\frac{e^{r(x, y_{\tau_k})}}{\sum_{j=k}^{K} e^{r(x, y_{\tau_j})}}.

Here, τk\tau_k denotes the k-th move in the true ranking τ\tau (for instance, τ1\tau_1 is the most preferred move, τ2\tau_2 is the second best, etc.), and xx represents the current game state (or prompt). Intuitively, we expect the top-ranked move (\(\tau_1\)) to score highest among all candidates; the second-ranked move (\(\tau_2\)) should score highly among the remaining moves; and so on.

Next, we substitute the optimal reward function rϕ(x,y)r^{*}_{\phi}(x,y) into the above equation. First, express rϕ(x,y)r^{*}_{\phi}(x,y) as

rϕ(x,y)=  βlog ⁣(π(yx)Z(x)πref(yx))  +  βlogZ(x), r^{*}_{\phi}(x,y) =\;\beta \,\log\!\Bigl(\frac{\pi^{*}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr) \;+\;\beta\,\log Z(x),

then the probability becomes

P(τy1,y2,,yK,x)=k=1Kerϕ(x,yτk)j=kKerϕ(x,yτj). P(\tau \mid y_1, y_2, \dots, y_K, x) = \prod_{k=1}^{K}\frac{e^{r^*_\phi(x, y_{\tau_k})}}{\sum_{j=k}^{K} e^{r^*_\phi(x, y_{\tau_j})}}.

We then express rϕr^{*}_{\phi} in terms of π\pi^{*} (note: here Z(x)Z(x) can be regarded as a normalization constant independent of π\pi and omitted in subsequent analysis), so that the expression can be rewritten as

P(τy1,y2,,yK,x)=k=1Kexp ⁣(βlogπ(yx)Z(x)πref(yx))j=kKexp ⁣(βlogπ(yx)Z(x)πref(yx)). P(\tau \mid y_1, y_2, \dots, y_K, x) = \prod_{k=1}^{K}\frac{\exp\!\Bigl(\beta\,\log\frac{\pi^{*}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)}{\sum_{j=k}^{K}\exp\!\Bigl(\beta\,\log\frac{\pi^{*}(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)}.

Finally, for the entire dataset, we desire that the average probability of the true ranking τ\tau is as high as possible—in other words, our goal is to maximize the probability of the true ranking over the dataset. Thus, for the multi-response case, the DPO objective function can be written as:

LDPO(πθ,πref)=    Eτ,y1,y2,,yK,xD ⁣[logk=1Kexp ⁣(βlogπθ(yx)Z(x)πref(yx))j=kKexp ⁣(βlogπθ(yx)Z(x)πref(yx))]. \mathcal{L}_{\text{DPO}}(\pi_\theta, \pi_{\text{ref}}) =\;-\;\mathbb{E}_{\tau,\,y_1,\,y_2,\,\dots,\,y_K,\,x \sim D}\!\left[\log \prod_{k=1}^{K}\frac{\exp\!\Bigl(\beta\,\log\frac{\pi_\theta(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)}{\sum_{j=k}^{K}\exp\!\Bigl(\beta\,\log\frac{\pi_\theta(y\mid x)\,Z(x)}{\pi_{\text{ref}}(y\mid x)}\Bigr)}\right].

References:

5.3 Limitations of DPO

Having derived the mathematical foundations of DPO, we now turn to its limitations. Much like in chess, where merely studying game records without playing does not guarantee that you will play well, DPO aims to train the model to evaluate responses using the reward model rather than directly replicating PPO. This means that during training, the data and loss function are completely aligned with the reward model.

This raises a critical question: Does the model have the capacity to improve both its "evaluation" and "generation" abilities simultaneously? In other words, DPO's training process focuses solely on teaching the model how to "score" responses, akin to learning how to assess moves from game records. However, this does not necessarily ensure that the model will make optimal decisions during actual generation—it does not prove that evaluation ability will seamlessly translate into effective generative performance. If this assumption fails, then the DPO training process loses its purpose. Just as one does not expect to become a chess master solely by reading game records, the validity of this assumption also directly impacts the rationale behind other methods like SPIN and self-reward.

Moreover, because the DPO optimization objective depends exclusively on the reward model's scoring, it only cares about whether the change in scores relative to the reference model meets expectations, not whether the generated sentences are fluent or appealing. In other words, DPO focuses more on enlarging the loss margin rather than on ensuring that the model generates high-quality outputs. This can lead to an awkward phenomenon during DPO training: the losses for both good and bad responses may increase simultaneously, forcing us to adjust hyperparameters or add additional constraints to stabilize training.

From another perspective, the limitations of DPO can be summarized as follows:

  • Disconnection Between Evaluation and Generation
    DPO’s training process teaches the model only to “evaluate”—acting like a static chess move scoring system—without incorporating the online generation process necessary for actual play. In contrast, PPO learns through online generation and trial-and-error, transforming evaluation ability into generative ability. Without this online exploration, a model trained with DPO might score well on offline data yet perform poorly during real generation.

  • Limitations of Offline Training
    RLHF is fundamentally an online learning method, as it requires continuous correction of the model’s existing knowledge—much like a chess player must practice regularly to hone their skills. DPO, however, is entirely offline; it forces the model to rely solely on what annotators deem “correct” (e.g., the best moves from game records) and follow a predetermined optimal path, leaving little room for exploration. In practice, techniques such as initial supervised fine-tuning (SFT) on preferred responses or augmenting the preference data with diverse outputs are often used to introduce some elements of online learning and exploration.

  • High Data Quality Requirements
    Since DPO training is entirely dependent on offline preference data, its effectiveness is highly sensitive to the quality and coverage of this data. If the training data is not comprehensive or does not match the actual generative distribution, the model may generate responses with the correct relative proportions of positive and negative examples, but the absolute probabilities could be diluted, or even bizarre outputs might emerge that are not present in the training data. For example, in a Q&A scenario, if the positive sample is “Pasta should be mixed with tomato meat sauce” and the negative sample is “Pasta should be mixed with chili oil,” the model after DPO optimization might output “Pasta should be mixed with concrete grade 42.” Such deviations underscore the critical importance of data quality.

In summary, the limitations of DPO stem from its exclusive focus on transferring the reward model’s “scoring” capability to the aligned model, without incorporating actual online generation and exploration. This is analogous to learning chess moves solely by studying game records without playing; consequently, even if you can theoretically evaluate every move accurately, it does not guarantee that you will make the best decisions during live play. Thus, although DPO can directly train an aligned model in one step, its performance often falls short compared to a complete RLHF system (i.e., Reward Model + PPO) if it lacks the complement of online generation and exploration.

Community

Sign up or log in to comment