DPO (2023)

Paper Title: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Summary: Introduces DPO, which fine-tunes a language model directly on preference data without needing an explicit reward model. It simplifies the RLHF pipeline and often outperforms PPO.

Related Works:

PPO (2017, General-purpose RL algorithm)
RLHF (2017, Original idea of learning from preferences, )
InstructGPT (2022, Combined SFT + reward model + PPO. Its a LLM that uses preference learning via PPO, a direct precursor to DPO.)

DPO is designed to *replace* the RLHF (PPO-based) part of the pipeline, and it does so remarkably well.

Aspect	PPO (RLHF)	DPO
Needs reward model?	✅ Yes	❌ No
Training stability	⚠️ Fragile	✅ Stable
Easy to implement	❌ No (RL required)	✅ Yes (supervised style)
Performance	⚠️ Good if tuned well	✅ Strong and robust
Compute efficiency	❌ Sample-heavy	✅ More efficient

Why PPO is hard to get right for LLMs

Stability issues:

PPO is sensitive to hyperparameters like KL penalties and learning rates.

You can easily get mode collapse or generate degenerate outputs if improperly tuned.

Requires a reward model:

You need to train a separate reward model from human preference data.

This reward model is inaccurate, introduces bias, and must be frozen carefully.

Training it is an extra engineering burden and a source of noise.

Sampling inefficiency:

PPO updates require sampling from the LM, evaluating with the reward model, and then optimizing the policy.

This makes it slow and computationally expensive.

Why DPO is easier and often better

No explicit reward model:

DPO skips training a reward model entirely.

It directly fine-tunes the base LM to prefer preferred samples over rejected ones via a binary logistic loss.

Supervised-style training:

DPO is formulated as a classification-style objective between two LM outputs (chosen vs. rejected).

This makes it simple, stable, and easily scalable.

Better performance:

Empirically, DPO can match or outperform PPO in preference-based fine-tuning, especially when scaled.

Minimal changes:

You just need pairwise preference data: (prompt, chosen response, rejected response).

It’s as plug-and-play as supervised fine-tuning.

Intuition

DPO is a preference-based fine-tuning method for LLMs that avoids reinforcement learning. Instead of training a reward model and doing policy optimization (like PPO in RLHF), DPO directly adjusts the language model to prefer preferred responses over rejected ones, using a simple classification-style loss.

✅ No reward model needed
✅ No sampling in the training loop (like PPO)
✅ Just use pairs of completions + preferences
✅ Simple, stable, efficient

Imagine you have this kind of data:

prompt $x$
Two model completions: a preferred one $y_p$ and a rejected one $y_r$

Core Idea of Bradley-Terry (BT) Preference Model

We assume human preferences follow this softmax-like distribution:

$P\left(y_p \succ y_r \mid x\right)=\frac{e^{r\left(x, y_p\right)}}{e^{r\left(x, y_p\right)}+e^{r\left(x, y_r\right)}}$

Where $r(x, y)$ is a latent reward function. This says: the higher the reward of $y_p$ compared to $y_r$ , the more likely a human prefers $\mathrm{y}_{\mathrm{p}}$ .

The DPO Trick: Reparametrize Reward as Log Probabilities

Instead of learning a reward model $r(x, y)$ , DPO reuses the language model’s own probabilities. Let:

$\pi_\theta(y \mid x)$ be the current model (the policy we’re updating).
$\pi_{r e f}(y \mid x)$ be the reference model (e.g., the original SFT model).
$\beta$ is a temperature-like hyperparameter.

We define:

$r(x, y):=\beta \cdot \log \left(\frac{\pi_\theta(y \mid x)}{\pi_{r e f}(y \mid x)}\right)$

This means: the reward is how much more likely the current model is compared to the reference model.

The DPO Loss and its Gradient Interpretation

Plug the above into the Bradley-Terry model and take the log-loss:

$\mathcal{L}_{\mathrm{DPO}}=-\log \sigma\left(\beta \cdot\left[\log \frac{\pi_\theta\left(y_p \mid x\right)}{\pi_{r e f}\left(y_p \mid x\right)}-\log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_{r e f}\left(y_r \mid x\right)}\right]\right)$

This is just a binary cross-entropy loss saying: make the preferred response more likely, compared to the rejected one, relative to the reference.

The gradient update for DPO is:

$\nabla \theta \mathcal{L}_{\mathrm{DPO}} \propto \text { increase } \log \pi_\theta\left(y_p \mid x\right) \text { and decrease } \log \pi_\theta\left(y_r \mid x\right)$

The more wrong the model is about the preference (i.e., when it prefers the rejected one), the larger the gradient to fix it.

Core Idea of Plackett–Luce (PL) Preference Model

the Bradley-Terry (BT) model used in DPO is a special case of a more general class of Plackett–Luce (PL) models. Both are probabilistic models of preferences, but they differ in how many items they rank and the structure of the preference data.

Bradley-Terry (BT) Model

Handles pairwise preferences.

Given a prompt $x$ and two completions $y_1$ and $y_2$ , the probability that $y_1>y_2$ is:

$P\left(y_1 \succ y_2 \mid x\right)=\frac{e^{r\left(x, y_1\right)}}{e^{r\left(x, y_1\right)}+e^{r\left(x, y_2\right)}}$

Plackett-Luce (PL) Model

Handles full rankings or more than two items. This formulation works for any value of k (≥ 2)
Given k completions $y_1, y_2, \ldots, y_k$ , where humans rank them as $y_1 \succ y_2 \succ \ldots \succ y_k$ , the model says:

$P\left(\text { ranking } y_1 \succ y_2 \succ \cdots \succ y_k \mid x\right)=\prod_{i=1}^k \frac{e^{r\left(x, y_i\right)}}{\sum_{j=i}^k e^{r\left(x, y_j\right)}}$

Reduces to BT when $k=2$ .

You can extend it to PL with similar reparameterization trick, where:

$r(x, y)=\beta \cdot \log \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}$

and define the PL likelihood over full rankings directly in terms of this $r(x, y)$ or in terms of $\pi /$ $\pi \_r e f$ .

The key insight still holds: you avoid learning an explicit reward function and instead express everything in terms of your model’s policy.

Code Sketch

A code sketch of how to implement BT-DPO loss vs PL-DPO loss (Generated by ChatGPT)

import torch
import torch.nn.functional as F

# Model log probabilities: [batch, num_candidates]
log_probs_model = model.log_prob(x, y_candidates)        # π_θ(y | x)
log_probs_ref   = ref_model.log_prob(x, y_candidates)    # π_ref(y | x)
β = 0.1  # temperature hyperparameter

def bt_dpo_loss(log_probs_model, log_probs_ref, preferred_idx, rejected_idx, beta):
    # Shape: [batch]
    reward_diff = beta * (
        log_probs_model[:, preferred_idx] - log_probs_ref[:, preferred_idx]
        - log_probs_model[:, rejected_idx] + log_probs_ref[:, rejected_idx]
    )
    loss = F.binary_cross_entropy_with_logits(reward_diff, torch.ones_like(reward_diff))
    return loss

# Usage: For each batch, you have exactly two completions: yₚ and yᵣ
# bt_loss = bt_dpo_loss(log_probs_model, log_probs_ref, 0, 1, beta=0.1)


def pl_dpo_loss(log_probs_model, log_probs_ref, ranks, beta):
    """
    log_probs_model, log_probs_ref: [batch, k] for k ranked completions
    ranks: List of indices sorted from best to worst (e.g., [2, 0, 1] means y₂ ≻ y₀ ≻ y₁)
    """
    batch_size, k = log_probs_model.shape
    loss = 0.0

    for i in range(k - 1):
        # Current top-ranked
        idx_i = ranks[:, i]     # shape [batch]
        # Candidates ranked lower
        idx_j = ranks[:, i+1:]  # shape [batch, k - i - 1]

        r_i = beta * (log_probs_model.gather(1, idx_i.unsqueeze(1)) - log_probs_ref.gather(1, idx_i.unsqueeze(1)))  # [batch, 1]
        r_j = beta * (log_probs_model.gather(1, idx_j) - log_probs_ref.gather(1, idx_j))                            # [batch, k-i-1]

        # Log-softmax term from PL model
        logits = torch.cat([r_i, r_j], dim=1)  # [batch, 1 + (k - i - 1)]
        log_softmax = F.log_softmax(logits, dim=1)[:, 0]  # take log-prob of top-ranked item

        loss += -log_softmax.mean()

    return loss / (k - 1)  # normalize across comparisons


# Usage: Suppose you have 3 completions ranked as: y2 ≻ y0 ≻ y1
ranks = torch.tensor([[2, 0, 1]]).repeat(batch_size, 1)  # [batch, 3]
pl_loss = pl_dpo_loss(log_probs_model, log_probs_ref, ranks, beta=0.1)

Extended Reads / Videos

TIGER-AI-Lab talks: Scalable and efficient reinforcement learning methods for LLM post training

HelpSteer2 (2024, Nvidia)

HelpSteer2-Preference: Complementing Ratings with Preferences

Shows Regession + BT can form a better preference model.

HPSv3 (Aug 2025)

HPSv3: Towards Wide-Spectrum Human Preference Score

This deterministic modeling approach assigns equal confidence to all output predictions. During the training process, the model blindly assigns scores to samples without considering the uncertainty of predictions.

The uncertainty-aware ranking loss addresses a key problem in human preference annotation: the inherent inconsistencies and potential errors in human judgments, especially in “hard cases” where it’s difficult to assign a definitive preference score.

Here’s a breakdown of the problems it solves:

Inconsistent Annotations: Human annotators can be subjective, and their preferences might not always be perfectly consistent. Traditional ranking models that assign a single, deterministic score might struggle when faced with these inconsistencies, leading to biased judgments.
Difficulty with Subtle Differences: When images have very subtle differences in quality or alignment with a prompt, human annotators may struggle to pick a clear winner, leading to low confidence in their choices. A deterministic model would still output a single score, potentially overstating its confidence in these ambiguous situations.
Ignoring Annotator Uncertainty: Prior models often treat all annotations with equal confidence. However, some annotations might be more reliable than others. The uncertainty-aware ranking explicitly models this variability by predicting a mean ( $\mu$ ) and a variance ( $\sigma$ ) for the preference score, rather than just a single score. This allows the model to “know” when it’s less certain about a prediction, which is crucial when dealing with noisy human labels.
Improved Ranking Accuracy: By accounting for the inherent uncertainty in human preferences, the model can leverage the underlying distribution of pairwise data more effectively. This leads to a more robust and reliable evaluation framework that better captures the nuances of human judgments, ultimately improving the overall ranking accuracy.

The paper mentions that uncertainty-aware ranking, unlike traditional RankNet, uses the last two linear layers to predict $\mu$ and $\sigma$ , modeling the output score $r$ as a one-dimensional Gaussian distribution $r \sim \mathcal{N}(\mu, \sigma)$ . This introduces an uncertainty aspect to the output score, helping to alleviate issues caused by annotator labeling uncertainty or errors. So, yes, it would require an architectural change to have two heads to output $\mu$ and $\sigma$ .