Data Preparation in RLHF -- Ch6 (Preference Data) vs Ch9 (SFT Data)

AI Reasoning Logo Data Prep in RLHF - Ch6 (Preference Data) vs Ch9 (SFT Data)

Data Preparation Comparison: Ch6 vs Ch9 #

Aspect	Ch6: Preference Data	Ch9: SFT/IFT Data
Data Structure	PAIRWISE comparisons: `(prompt, chosen_response, rejected_response)`	SINGLE examples: `(prompt, good_response)`
Purpose	Learn to JUDGE which response is better	Learn to GENERATE good responses
Data Format Example	`{"prompt": "What is 2+2?",` `"chosen": "The answer is 4",` `"rejected": "5"}`	`{"prompt": "What is 2+2?",` `"response": "The answer is 4"}`
Collection Method	- Side-by-side comparison UI - Likert scales (5-point, 8-point) - Thumbs up/down - ChatBotArena	- Human-written high-quality examples - Curated Q&A pairs - Single demonstrations
Data Source	Human labelers comparing responses OR Structured/Synthetic: - Correct vs incorrect (math) - With constraint vs without	Human-written completions OR Curated high-quality examples OR Synthetic (from stronger models)
Signal Type	Comparative/Relative: Which is better?	Absolute: This is a good response
Typical Dataset Size	~100K preference pairs (InstructGPT)	~10K-1M examples (InstructGPT: 10K, modern: ~1M)
Multi-turn Handling	- Preference on final turn only - Continue with “chosen” answer - Mask previous turns from loss	- Each turn = separate training example - Unroll N-turn → N examples - Mask prompts/previous turns
Used to Train	Ch7: Reward Model	Ch9: SFT Model (initial policy)
What Gets Trained	`L = -log(σ(r(chosen) - r(rejected)))`	`L = -log P(response\|prompt)`
Next Stage Usage	Ch11: RL training - Same/similar prompts can be reused - RM provides scores	Ch7: RM base model Ch11: RL starting policy - Policy generates responses

Key Insight: The DATA TYPES are FUNDAMENTALLY DIFFERENT #

Ch9 SFT Data Says: “Here’s a good response. Learn to generate this.”

{
  "prompt": "Write a poem about goldfish",
  "response": "Golden swimmer, circling slow..."
}

Ch6 Preference Data Says: “Between these two responses, A is better than B. Learn to prefer A.”

{
  "prompt": "Write a poem about goldfish",
  "chosen": "Golden swimmer, circling slow... (follows constraint)",
  "rejected": "In circles bright, the goldfish glides... (violates constraint)"
}

The Complete RLHF Pipeline #

┌─────────────────────┐
│   Pretrained Model  │
└──────────┬──────────┘
           │
           ├──────────────────────────────┐
           ↓                              ↓
    ┌─────────────┐              ┌───────────────┐
    │  Ch9: SFT   │              │  Ch6: Collect │
    │             │              │  Preference   │
    │ Data:       │              │  Data         │
    │ (prompt,    │              │               │
    │  response)  │              │ Data:         │
    │             │              │ (prompt,      │
    │ Single good │              │  chosen,      │
    │ examples    │              │  rejected)    │
    └──────┬──────┘              │               │
           │                     │ Comparisons   │
           │                     └───────┬───────┘
           │                             │
           ↓                             ↓
    ┌─────────────┐              ┌───────────────┐
    │  SFT Model  │──base model─→│   Ch7: RM     │
    │  (Policy    │              │   Training    │
    │  starting   │              │               │
    │  point)     │              │ Uses Ch6 data │
    └──────┬──────┘              └───────┬───────┘
           │                             │
           │                             │
           └──────────┬──────────────────┘
                      ↓
              ┌───────────────┐
              │  Ch11: RL     │
              │  Optimization │
              │               │
              │ Policy: Ch9   │
              │ Scorer: Ch7   │
              │ Prompts: Ch6  │
              │ (or similar)  │
              └───────────────┘

Summary #

The fundamental difference between Ch6 and Ch9 data lies in their learning objectives:

Ch6 (Preference Data): Teaches models to discriminate between better and worse responses through pairwise comparisons, ultimately training a Reward Model
Ch9 (SFT Data): Teaches models to generate appropriate responses through demonstration, creating the initial policy for RL optimization

Both are essential but serve distinct roles in the RLHF pipeline, with Ch9 establishing generation capabilities and Ch6 enabling quality assessment.