The Complete Instruct_GPT Recipe (Ch 4.2.1)
Sequential and Two Separate Datasets for RLHF #
| Attribute | Dataset 1: comes 1st | Dataset 2: comes next |
|---|---|---|
| Size | ~10K examples (InstructGPT), ~1M modern | ~100K preference pairs (InstructGPT) |
| Format | (prompt, good_response) - SINGLE examples |
(prompt, chosen, rejected) - PAIRWISE comparisons |
| Source | Human-written OR synthetic from strong models | Human labelers comparing SFT model outputs |
| Purpose | Teach the model HOW TO RESPOND in chat format | Teach what GOOD vs BAD responses look like |
| When Collected | BEFORE preference data collection | AFTER SFT model exists (use SFT to generate responses) |
| Used to Train | SFT Model (Ch9) | Reward Model (Ch7) |
| Example | { "prompt": "What is machine learning?", "response": "Machine learning is a branch of AI that..." } |
{"prompt": "What is machine learning?", "chosen": "Machine learning is a branch of AI that enables...", "rejected": "ML is when computers learn stuff." } |
when to prepare 1st and 2nd dataset #
┌──────────┐
│ Ch9 SFT │← FIRST: Prepare SFT data
└───┬──────┘
│
↓
[Train SFT Model]
│
├──────────────┐
↓ ↓
┌────────┐ Generate responses
│ Use in │ for humans to compare
│ Ch7 │ │
│ RM │ ↓
└───┬────┘ ┌──────────────┐
│ │Ch6 Pref Data │← SECOND: Collect preferences
│ └───┬──────────┘
│ │
↓ ↓
Ch7 RM Use RM in Ch11
│ │
└──────┬───────┘
↓
Ch11 RL(PPO)
The Complete InstructGPT Recipe (Ch 4.2.1) #
┌─────────────────────────────────────────────────────────┐
│ Pretrained Base Model │
└──────────────────────┬──────────────────────────────────┘
│
↓
┌────────────────────────────────────────────────────────┐
│ STEP 1: Ch9 SFT │
│ • Prepare single good examples │
│ • Train model on (prompt, response) pairs │
│ • Output: SFT Model (can generate responses) │
└──────────────────────┬─────────────────────────────────┘
│
├──────────────────┐
↓ ↓
┌──────────────────────────────┐ ┌─────────────────────┐
│ STEP 2a: Ch6 Data Collection│ │ STEP 2b: Ch7 RM │
│ • Use SFT to generate │ │ • Start from SFT │
│ • Humans compare outputs │─→│ • Train on Ch6 data │
│ • Create preference pairs │ │ • Output: RM │
└──────────────────────────────┘ └──────────┬──────────┘
│
┌─────────────────────┘
↓
┌────────────────────────────────────────────────────────┐
│ STEP 3: Ch11 RL Optimization │
│ • Policy: SFT Model (from Step 1) │
│ • Scorer: Reward Model (from Step 2) │
│ • Optimize policy using RM feedback │
│ • Output: Final RLHF-trained Model │
└────────────────────────────────────────────────────────┘