PPO vs DPO in RLHF
PPO (Traditional RLHF) #
Core Concept #
“Let me generate new text, score it with the reward model, and update my policy based on those scores.”
Pipeline #
Step 1: Train Reward Model (Ch7)
Preference Data → Reward Model
Step 2: RL Optimization (Ch11)
Policy generates text → RM scores it
→ PPO updates policy
(Repeat with new generations)
Chapter Flow #
Ch6 (Preference Data)
↓
Ch7 (Reward Model Training)
↓
Ch11 (PPO RL Optimization)
↓
Final Model
Example Code #
# PPO Training Loop
for batch in prompts:
# Generate NEW responses
responses = policy.generate(batch)
# Score with reward model
rewards = reward_model(responses)
# Compute advantages
advantages = compute_advantages(
rewards, values
)
# Update with PPO
policy.update_with_ppo(advantages)
The Math #
J(θ) = E[min(ratio * A,
clip(ratio, 1-ε, 1+ε) * A)]
- β * D_KL(π || π_ref)
Where:
- ratio = π_θ(a|s) / π_old(a|s)
- A = advantage
- clip prevents large updates
- D_KL = distance from reference
Pros #
- ✅ Online learning: Discover new responses
- ✅ Best performance: Highest quality
- ✅ Adaptive: Uses current policy
- ✅ Proven: ChatGPT, Claude, etc.
Cons #
- ❌ Complex: RM + value + PPO
- ❌ Expensive: Multiple models
- ❌ High memory: RM + policy + ref
- ❌ Hard to tune: Many hyperparameters
Best For #
- 🎯 Maximum performance
- 🎯 Large compute budgets
- 🎯 Frontier models
- 🎯 Continuous improvement
Examples #
- ChatGPT (OpenAI)
- Claude (Anthropic)
- InstructGPT (OpenAI)
- DeepSeek R1
DPO (Direct Alignment) #
Core Concept #
“Let me directly learn from the preference data without generating anything new.”
Pipeline #
Single Step: Direct Optimization
Preference Data → DPO Loss
→ Update Policy
(No RM, no generation,
just gradient updates)
Chapter Flow #
Ch6 (Preference Data)
↓
Ch12 (DPO Direct Optimization)
↓
Final Model
(Skip Ch7 and Ch11!)
Example Code #
# DPO Training Loop
for (prompt, chosen, rejected) \
in preference_data:
# No generation! Just compute loss
loss = -log(σ(
β * log(P(chosen)/P_ref(chosen))
- β * log(P(rejected)/P_ref(rejected))
))
# Simple gradient descent
policy.update(loss)
The Math #
L = -E[log σ(
β * log(π(y_c)/π_ref(y_c))
- β * log(π(y_r)/π_ref(y_r))
)]
Where:
- σ = sigmoid function
- β = controls KL penalty
- π_ref = reference (frozen)
Secret: Implicit reward
r(x,y) = β * log(π(y|x) / π_ref(y|x))
Pros #
- ✅ Simple: One loss function
- ✅ Memory efficient: No separate RM
- ✅ Cheaper: No RM inference
- ✅ Easy to implement: Standard training
- ✅ Accessible: For academic labs
Cons #
- ❌ Offline only: Can’t generate new data
- ❌ Data dependent: Limited by dataset
- ❌ Lower ceiling: May not match PPO
- ❌ Preference displacement: Both ↓
Best For #
- 🎯 Research & experimentation
- 🎯 Limited compute
- 🎯 Academic settings
- 🎯 Quick iterations
- 🎯 Good fixed dataset
Examples #
- Zephyr-7B (HuggingFace)
- Llama 3 Instruct (Meta)
- Tülu 2 & 3 (AI2)
- Many open-source models
#
| Aspect | PPO (Proximal Policy Optimization) | DPO (Direct Preference Optimization) |
|---|---|---|
| Approach | Two-stage: Train RM → Use RL | One-stage: Direct optimization from preferences |
| Reward Model | ✅ Needs separate RM (Ch7) | ❌ No separate RM needed (implicit) |
| Training Data | Generates new responses during training | Uses fixed preference dataset |
| Algorithm Type | Online RL (policy-gradient) | Offline optimization (gradient ascent) |
| Complexity | More complex: RM + value function + PPO | Simpler: Just one loss function |
| Memory | High: Need RM + policy + reference model | Lower: Just policy + reference model |
| Data Usage | On-policy: Generates fresh data each step | Off-policy: Uses pre-collected data |
| Cost | More expensive: Forward passes through RM | Cheaper: No RM inference needed |
| When Popular | 2017-2023 (ChatGPT, InstructGPT) | 2023-present (Llama 3, Zephyr, Tülu) |