DPO (2023)

Paper Title: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Link: https://arxiv.org/abs/2305.18290

Summary: Introduces DPO, which fine-tunes a language model directly on preference data without needing an explicit reward model. It simplifies the RLHF pipeline and often outperforms PPO.

Related Works:

  • PPO (2017, General-purpose RL algorithm)
  • RLHF (2017, Original idea of learning from preferences, )
  • InstructGPT (2022, Combined SFT + reward model + PPO. Its a LLM that uses preference learning via PPO, a direct precursor to DPO.)

DPO is designed to *replace* the RLHF (PPO-based) part of the pipeline, and it does so remarkably well.

Aspect PPO (RLHF) DPO
Needs reward model? ✅ Yes ❌ No
Training stability ⚠️ Fragile ✅ Stable
Easy to implement ❌ No (RL required) ✅ Yes (supervised style)
Performance ⚠️ Good if tuned well ✅ Strong and robust
Compute efficiency ❌ Sample-heavy ✅ More efficient

Why PPO is hard to get right for LLMs

  1. Stability issues:
    • PPO is sensitive to hyperparameters like KL penalties and learning rates.
    • You can easily get mode collapse or generate degenerate outputs if improperly tuned.
  2. Requires a reward model:
    • You need to train a separate reward model from human preference data.
    • This reward model is inaccurate, introduces bias, and must be frozen carefully.
    • Training it is an extra engineering burden and a source of noise.
  3. Sampling inefficiency:
    • PPO updates require sampling from the LM, evaluating with the reward model, and then optimizing the policy.
    • This makes it slow and computationally expensive.

Why DPO is easier and often better

  1. No explicit reward model:
    • DPO skips training a reward model entirely.
    • It directly fine-tunes the base LM to prefer preferred samples over rejected ones via a binary logistic loss.
  2. Supervised-style training:
    • DPO is formulated as a classification-style objective between two LM outputs (chosen vs. rejected).
    • This makes it simple, stable, and easily scalable.
  3. Better performance:
    • Empirically, DPO can match or outperform PPO in preference-based fine-tuning, especially when scaled.
  4. Minimal changes:
    • You just need pairwise preference data: (prompt, chosen response, rejected response).
    • It’s as plug-and-play as supervised fine-tuning.

Intuition

DPO is a preference-based fine-tuning method for LLMs that avoids reinforcement learning. Instead of training a reward model and doing policy optimization (like PPO in RLHF), DPO directly adjusts the language model to prefer preferred responses over rejected ones, using a simple classification-style loss.

  • ✅ No reward model needed
  • ✅ No sampling in the training loop (like PPO)
  • ✅ Just use pairs of completions + preferences
  • ✅ Simple, stable, efficient

Imagine you have this kind of data:

  • prompt xx
  • Two model completions: a preferred one ypy_p and a rejected one yry_r

Core Idea of Bradley-Terry (BT) Preference Model

We assume human preferences follow this softmax-like distribution:

P(ypyrx)=er(x,yp)er(x,yp)+er(x,yr)P\left(y_p \succ y_r \mid x\right)=\frac{e^{r\left(x, y_p\right)}}{e^{r\left(x, y_p\right)}+e^{r\left(x, y_r\right)}}

Where r(x,y)r(x, y) is a latent reward function. This says: the higher the reward of ypy_p compared to yry_r, the more likely a human prefers yp\mathrm{y}_{\mathrm{p}}.

The DPO Trick: Reparametrize Reward as Log Probabilities

Instead of learning a reward model r(x,y)r(x, y), DPO reuses the language model’s own probabilities. Let:

  • πθ(yx)\pi_\theta(y \mid x) be the current model (the policy we’re updating).
  • πref(yx)\pi_{r e f}(y \mid x) be the reference model (e.g., the original SFT model).
  • β\beta is a temperature-like hyperparameter.

We define:

r(x,y):=βlog(πθ(yx)πref(yx))r(x, y):=\beta \cdot \log \left(\frac{\pi_\theta(y \mid x)}{\pi_{r e f}(y \mid x)}\right)

This means: the reward is how much more likely the current model is compared to the reference model.

The DPO Loss and its Gradient Interpretation

Plug the above into the Bradley-Terry model and take the log-loss:

LDPO=logσ(β[logπθ(ypx)πref(ypx)logπθ(yrx)πref(yrx)])\mathcal{L}_{\mathrm{DPO}}=-\log \sigma\left(\beta \cdot\left[\log \frac{\pi_\theta\left(y_p \mid x\right)}{\pi_{r e f}\left(y_p \mid x\right)}-\log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_{r e f}\left(y_r \mid x\right)}\right]\right)

This is just a binary cross-entropy loss saying: make the preferred response more likely, compared to the rejected one, relative to the reference.

The gradient update for DPO is:

θLDPO increase logπθ(ypx) and decrease logπθ(yrx)\nabla \theta \mathcal{L}_{\mathrm{DPO}} \propto \text { increase } \log \pi_\theta\left(y_p \mid x\right) \text { and decrease } \log \pi_\theta\left(y_r \mid x\right)

The more wrong the model is about the preference (i.e., when it prefers the rejected one), the larger the gradient to fix it.

Core Idea of Plackett–Luce (PL) Preference Model

the Bradley-Terry (BT) model used in DPO is a special case of a more general class of Plackett–Luce (PL) models. Both are probabilistic models of preferences, but they differ in how many items they rank and the structure of the preference data.

Bradley-Terry (BT) Model

  • Handles pairwise preferences.
  • Given a prompt xx and two completions y1y_1 and y2y_2, the probability that y1>y2y_1>y_2 is:

P(y1y2x)=er(x,y1)er(x,y1)+er(x,y2)P\left(y_1 \succ y_2 \mid x\right)=\frac{e^{r\left(x, y_1\right)}}{e^{r\left(x, y_1\right)}+e^{r\left(x, y_2\right)}}

Plackett-Luce (PL) Model

  • Handles full rankings or more than two items. This formulation works for any value of k (≥ 2)
  • Given k completions y1,y2,,yky_1, y_2, \ldots, y_k, where humans rank them as y1y2yky_1 \succ y_2 \succ \ldots \succ y_k, the model says:

P( ranking y1y2ykx)=i=1ker(x,yi)j=iker(x,yj) P\left(\text { ranking } y_1 \succ y_2 \succ \cdots \succ y_k \mid x\right)=\prod_{i=1}^k \frac{e^{r\left(x, y_i\right)}}{\sum_{j=i}^k e^{r\left(x, y_j\right)}}

  • Reduces to BT when k=2k=2.

You can extend it to PL with similar reparameterization trick, where:

r(x,y)=βlogπ(yx)πref(yx)r(x, y)=\beta \cdot \log \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}

and define the PL likelihood over full rankings directly in terms of this r(x,y)r(x, y) or in terms of π/\pi / π_ref\pi \_r e f.

The key insight still holds: you avoid learning an explicit reward function and instead express everything in terms of your model’s policy.

Code Sketch

A code sketch of how to implement BT-DPO loss vs PL-DPO loss (Generated by ChatGPT)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import torch.nn.functional as F

# Model log probabilities: [batch, num_candidates]
log_probs_model = model.log_prob(x, y_candidates) # π_θ(y | x)
log_probs_ref = ref_model.log_prob(x, y_candidates) # π_ref(y | x)
β = 0.1 # temperature hyperparameter

def bt_dpo_loss(log_probs_model, log_probs_ref, preferred_idx, rejected_idx, beta):
# Shape: [batch]
reward_diff = beta * (
log_probs_model[:, preferred_idx] - log_probs_ref[:, preferred_idx]
- log_probs_model[:, rejected_idx] + log_probs_ref[:, rejected_idx]
)
loss = F.binary_cross_entropy_with_logits(reward_diff, torch.ones_like(reward_diff))
return loss

# Usage: For each batch, you have exactly two completions: yₚ and yᵣ
# bt_loss = bt_dpo_loss(log_probs_model, log_probs_ref, 0, 1, beta=0.1)


def pl_dpo_loss(log_probs_model, log_probs_ref, ranks, beta):
"""
log_probs_model, log_probs_ref: [batch, k] for k ranked completions
ranks: List of indices sorted from best to worst (e.g., [2, 0, 1] means y₂ ≻ y₀ ≻ y₁)
"""
batch_size, k = log_probs_model.shape
loss = 0.0

for i in range(k - 1):
# Current top-ranked
idx_i = ranks[:, i] # shape [batch]
# Candidates ranked lower
idx_j = ranks[:, i+1:] # shape [batch, k - i - 1]

r_i = beta * (log_probs_model.gather(1, idx_i.unsqueeze(1)) - log_probs_ref.gather(1, idx_i.unsqueeze(1))) # [batch, 1]
r_j = beta * (log_probs_model.gather(1, idx_j) - log_probs_ref.gather(1, idx_j)) # [batch, k-i-1]

# Log-softmax term from PL model
logits = torch.cat([r_i, r_j], dim=1) # [batch, 1 + (k - i - 1)]
log_softmax = F.log_softmax(logits, dim=1)[:, 0] # take log-prob of top-ranked item

loss += -log_softmax.mean()

return loss / (k - 1) # normalize across comparisons


# Usage: Suppose you have 3 completions ranked as: y2 ≻ y0 ≻ y1
ranks = torch.tensor([[2, 0, 1]]).repeat(batch_size, 1) # [batch, 3]
pl_loss = pl_dpo_loss(log_probs_model, log_probs_ref, ranks, beta=0.1)

Extended Reads / Videos

HelpSteer2 (2024, Nvidia)

HelpSteer2-Preference: Complementing Ratings with Preferences

  • Shows Regession + BT can form a better preference model.

HPSv3 (Aug 2025)

HPSv3: Towards Wide-Spectrum Human Preference Score

Pr

  • This deterministic modeling approach assigns equal confidence to all output predictions. During the training process, the model blindly assigns scores to samples without considering the uncertainty of predictions.

The uncertainty-aware ranking loss addresses a key problem in human preference annotation: the inherent inconsistencies and potential errors in human judgments, especially in “hard cases” where it’s difficult to assign a definitive preference score.

Here’s a breakdown of the problems it solves:

  • Inconsistent Annotations: Human annotators can be subjective, and their preferences might not always be perfectly consistent. Traditional ranking models that assign a single, deterministic score might struggle when faced with these inconsistencies, leading to biased judgments.
  • Difficulty with Subtle Differences: When images have very subtle differences in quality or alignment with a prompt, human annotators may struggle to pick a clear winner, leading to low confidence in their choices. A deterministic model would still output a single score, potentially overstating its confidence in these ambiguous situations.
  • Ignoring Annotator Uncertainty: Prior models often treat all annotations with equal confidence. However, some annotations might be more reliable than others. The uncertainty-aware ranking explicitly models this variability by predicting a mean (μ\mu) and a variance (σ\sigma) for the preference score, rather than just a single score. This allows the model to “know” when it’s less certain about a prediction, which is crucial when dealing with noisy human labels.
  • Improved Ranking Accuracy: By accounting for the inherent uncertainty in human preferences, the model can leverage the underlying distribution of pairwise data more effectively. This leads to a more robust and reliable evaluation framework that better captures the nuances of human judgments, ultimately improving the overall ranking accuracy.

The paper mentions that uncertainty-aware ranking, unlike traditional RankNet, uses the last two linear layers to predict μ\mu and σ\sigma, modeling the output score rr as a one-dimensional Gaussian distribution rN(μ,σ)r \sim \mathcal{N}(\mu, \sigma). This introduces an uncertainty aspect to the output score, helping to alleviate issues caused by annotator labeling uncertainty or errors. So, yes, it would require an architectural change to have two heads to output μ\mu and σ\sigma.