The Complete InstructGPT Recipe (Ch 4.2.1)

AI Reasoning Logo The Complete Instruct_GPT Recipe (Ch 4.2.1)

Sequential and Two Separate Datasets for RLHF #

Attribute	Dataset 1: comes 1st	Dataset 2: comes next
Size	~10K examples (InstructGPT), ~1M modern	~100K preference pairs (InstructGPT)
Format	`(prompt, good_response)` - SINGLE examples	`(prompt, chosen, rejected)` - PAIRWISE comparisons
Source	Human-written OR synthetic from strong models	Human labelers comparing SFT model outputs
Purpose	Teach the model HOW TO RESPOND in chat format	Teach what GOOD vs BAD responses look like
When Collected	BEFORE preference data collection	AFTER SFT model exists (use SFT to generate responses)
Used to Train	SFT Model (Ch9)	Reward Model (Ch7)
Example	`{ "prompt": "What is machine learning?", "response": "Machine learning is a branch of AI that..." }`	`{"prompt": "What is machine learning?", "chosen": "Machine learning is a branch of AI that enables...", "rejected": "ML is when computers learn stuff." }`

when to prepare 1st and 2nd dataset #

         ┌──────────┐
         │  Ch9 SFT │← FIRST: Prepare SFT data
         └───┬──────┘
             │
             ↓
      [Train SFT Model]
             │
             ├──────────────┐
             ↓              ↓
      ┌────────┐     Generate responses
      │ Use in │     for humans to compare
      │  Ch7   │            │
      │  RM    │            ↓
      └───┬────┘     ┌──────────────┐
          │          │Ch6 Pref Data │← SECOND: Collect preferences
          │          └───┬──────────┘
          │              │
          ↓              ↓
         Ch7 RM         Use RM in Ch11
          │              │
          └──────┬───────┘
                 ↓
               Ch11 RL(PPO)

The Complete InstructGPT Recipe (Ch 4.2.1) #

┌─────────────────────────────────────────────────────────┐
│                    Pretrained Base Model                │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ↓
┌────────────────────────────────────────────────────────┐
│  STEP 1: Ch9 SFT                                       │
│  • Prepare single good examples                        │
│  • Train model on (prompt, response) pairs             │
│  • Output: SFT Model (can generate responses)          │
└──────────────────────┬─────────────────────────────────┘
                       │
                       ├──────────────────┐
                       ↓                  ↓
┌──────────────────────────────┐  ┌─────────────────────┐
│  STEP 2a: Ch6 Data Collection│  │ STEP 2b: Ch7 RM     │
│  • Use SFT to generate       │  │ • Start from SFT    │
│  • Humans compare outputs    │─→│ • Train on Ch6 data │
│  • Create preference pairs   │  │ • Output: RM        │
└──────────────────────────────┘  └──────────┬──────────┘
                                             │
                       ┌─────────────────────┘
                       ↓
┌────────────────────────────────────────────────────────┐
│  STEP 3: Ch11 RL Optimization                          │
│  • Policy: SFT Model (from Step 1)                     │
│  • Scorer: Reward Model (from Step 2)                  │
│  • Optimize policy using RM feedback                   │
│  • Output: Final RLHF-trained Model                    │
└────────────────────────────────────────────────────────┘