Data Preparation in RLHF -- Ch6 (Preference Data) vs Ch9 (SFT Data)

AI Reasoning Logo Data Prep in RLHF - Ch6 (Preference Data) vs Ch9 (SFT Data)


Data Preparation Comparison: Ch6 vs Ch9 #

Aspect Ch6: Preference Data Ch9: SFT/IFT Data
Data Structure PAIRWISE comparisons: (prompt, chosen_response, rejected_response) SINGLE examples: (prompt, good_response)
Purpose Learn to JUDGE which response is better Learn to GENERATE good responses
Data Format Example {"prompt": "What is 2+2?",
"chosen": "The answer is 4",
"rejected": "5"}
{"prompt": "What is 2+2?",
"response": "The answer is 4"}
Collection Method - Side-by-side comparison UI
- Likert scales (5-point, 8-point)
- Thumbs up/down
- ChatBotArena
- Human-written high-quality examples
- Curated Q&A pairs
- Single demonstrations
Data Source Human labelers comparing responses
OR
Structured/Synthetic:
- Correct vs incorrect (math)
- With constraint vs without
Human-written completions
OR
Curated high-quality examples
OR
Synthetic (from stronger models)
Signal Type Comparative/Relative: Which is better? Absolute: This is a good response
Typical Dataset Size ~100K preference pairs (InstructGPT) ~10K-1M examples (InstructGPT: 10K, modern: ~1M)
Multi-turn Handling - Preference on final turn only
- Continue with “chosen” answer
- Mask previous turns from loss
- Each turn = separate training example
- Unroll N-turn → N examples
- Mask prompts/previous turns
Used to Train Ch7: Reward Model Ch9: SFT Model (initial policy)
What Gets Trained L = -log(σ(r(chosen) - r(rejected))) L = -log P(response|prompt)
Next Stage Usage Ch11: RL training
- Same/similar prompts can be reused
- RM provides scores
Ch7: RM base model
Ch11: RL starting policy
- Policy generates responses

Key Insight: The DATA TYPES are FUNDAMENTALLY DIFFERENT #

  • Ch9 SFT Data Says: “Here’s a good response. Learn to generate this.”
{
  "prompt": "Write a poem about goldfish",
  "response": "Golden swimmer, circling slow..."
}
  • Ch6 Preference Data Says: “Between these two responses, A is better than B. Learn to prefer A.”
{
  "prompt": "Write a poem about goldfish",
  "chosen": "Golden swimmer, circling slow... (follows constraint)",
  "rejected": "In circles bright, the goldfish glides... (violates constraint)"
}

The Complete RLHF Pipeline #

┌─────────────────────┐
│   Pretrained Model  │
└──────────┬──────────┘
           │
           ├──────────────────────────────┐
           ↓                              ↓
    ┌─────────────┐              ┌───────────────┐
    │  Ch9: SFT   │              │  Ch6: Collect │
    │             │              │  Preference   │
    │ Data:       │              │  Data         │
    │ (prompt,    │              │               │
    │  response)  │              │ Data:         │
    │             │              │ (prompt,      │
    │ Single good │              │  chosen,      │
    │ examples    │              │  rejected)    │
    └──────┬──────┘              │               │
           │                     │ Comparisons   │
           │                     └───────┬───────┘
           │                             │
           ↓                             ↓
    ┌─────────────┐              ┌───────────────┐
    │  SFT Model  │──base model─→│   Ch7: RM     │
    │  (Policy    │              │   Training    │
    │  starting   │              │               │
    │  point)     │              │ Uses Ch6 data │
    └──────┬──────┘              └───────┬───────┘
           │                             │
           │                             │
           └──────────┬──────────────────┘
                      ↓
              ┌───────────────┐
              │  Ch11: RL     │
              │  Optimization │
              │               │
              │ Policy: Ch9   │
              │ Scorer: Ch7   │
              │ Prompts: Ch6  │
              │ (or similar)  │
              └───────────────┘

Summary #

The fundamental difference between Ch6 and Ch9 data lies in their learning objectives:

  • Ch6 (Preference Data): Teaches models to discriminate between better and worse responses through pairwise comparisons, ultimately training a Reward Model
  • Ch9 (SFT Data): Teaches models to generate appropriate responses through demonstration, creating the initial policy for RL optimization

Both are essential but serve distinct roles in the RLHF pipeline, with Ch9 establishing generation capabilities and Ch6 enabling quality assessment.