PPO vs DPO in RLHF

AI Reasoning Logo PPO vs DPO in RLHF


PPO (Traditional RLHF) #

Core Concept #

“Let me generate new text, score it with the reward model, and update my policy based on those scores.”

Pipeline #

Step 1: Train Reward Model (Ch7)
   Preference Data → Reward Model
   
Step 2: RL Optimization (Ch11)
   Policy generates text → RM scores it 
   → PPO updates policy
   (Repeat with new generations)

Chapter Flow #

Ch6 (Preference Data)
    ↓
Ch7 (Reward Model Training)
    ↓
Ch11 (PPO RL Optimization)
    ↓
Final Model

Example Code #

# PPO Training Loop
for batch in prompts:
    # Generate NEW responses
    responses = policy.generate(batch)
    
    # Score with reward model
    rewards = reward_model(responses)
    
    # Compute advantages
    advantages = compute_advantages(
        rewards, values
    )
    
    # Update with PPO
    policy.update_with_ppo(advantages)

The Math #

J(θ) = E[min(ratio * A, 
             clip(ratio, 1-ε, 1+ε) * A)] 
       - β * D_KL(π || π_ref)

Where:
- ratio = π_θ(a|s) / π_old(a|s)
- A = advantage
- clip prevents large updates
- D_KL = distance from reference

Pros #

  • Online learning: Discover new responses
  • Best performance: Highest quality
  • Adaptive: Uses current policy
  • Proven: ChatGPT, Claude, etc.

Cons #

  • Complex: RM + value + PPO
  • Expensive: Multiple models
  • High memory: RM + policy + ref
  • Hard to tune: Many hyperparameters

Best For #

  • 🎯 Maximum performance
  • 🎯 Large compute budgets
  • 🎯 Frontier models
  • 🎯 Continuous improvement

Examples #

  • ChatGPT (OpenAI)
  • Claude (Anthropic)
  • InstructGPT (OpenAI)
  • DeepSeek R1

DPO (Direct Alignment) #

Core Concept #

“Let me directly learn from the preference data without generating anything new.”

Pipeline #

Single Step: Direct Optimization
   Preference Data → DPO Loss 
   → Update Policy
   (No RM, no generation, 
    just gradient updates)

Chapter Flow #

Ch6 (Preference Data)
    ↓
Ch12 (DPO Direct Optimization)
    ↓
Final Model

(Skip Ch7 and Ch11!)

Example Code #

# DPO Training Loop
for (prompt, chosen, rejected) \
    in preference_data:
    
    # No generation! Just compute loss
    loss = -log(σ(
        β * log(P(chosen)/P_ref(chosen)) 
        - β * log(P(rejected)/P_ref(rejected))
    ))
    
    # Simple gradient descent
    policy.update(loss)

The Math #

L = -E[log σ(
      β * log(π(y_c)/π_ref(y_c)) 
      - β * log(π(y_r)/π_ref(y_r))
    )]

Where:
- σ = sigmoid function
- β = controls KL penalty
- π_ref = reference (frozen)

Secret: Implicit reward

r(x,y) = β * log(π(y|x) / π_ref(y|x))

Pros #

  • Simple: One loss function
  • Memory efficient: No separate RM
  • Cheaper: No RM inference
  • Easy to implement: Standard training
  • Accessible: For academic labs

Cons #

  • Offline only: Can’t generate new data
  • Data dependent: Limited by dataset
  • Lower ceiling: May not match PPO
  • Preference displacement: Both ↓

Best For #

  • 🎯 Research & experimentation
  • 🎯 Limited compute
  • 🎯 Academic settings
  • 🎯 Quick iterations
  • 🎯 Good fixed dataset

Examples #

  • Zephyr-7B (HuggingFace)
  • Llama 3 Instruct (Meta)
  • Tülu 2 & 3 (AI2)
  • Many open-source models

#

Aspect PPO (Proximal Policy Optimization) DPO (Direct Preference Optimization)
Approach Two-stage: Train RM → Use RL One-stage: Direct optimization from preferences
Reward Model Needs separate RM (Ch7) No separate RM needed (implicit)
Training Data Generates new responses during training Uses fixed preference dataset
Algorithm Type Online RL (policy-gradient) Offline optimization (gradient ascent)
Complexity More complex: RM + value function + PPO Simpler: Just one loss function
Memory High: Need RM + policy + reference model Lower: Just policy + reference model
Data Usage On-policy: Generates fresh data each step Off-policy: Uses pre-collected data
Cost More expensive: Forward passes through RM Cheaper: No RM inference needed
When Popular 2017-2023 (ChatGPT, InstructGPT) 2023-present (Llama 3, Zephyr, Tülu)