Data Prep in RLHF - Ch6 (Preference Data) vs Ch9 (SFT Data)
Data Preparation Comparison: Ch6 vs Ch9 #
| Aspect | Ch6: Preference Data | Ch9: SFT/IFT Data |
|---|---|---|
| Data Structure | PAIRWISE comparisons: (prompt, chosen_response, rejected_response) |
SINGLE examples: (prompt, good_response) |
| Purpose | Learn to JUDGE which response is better | Learn to GENERATE good responses |
| Data Format Example | {"prompt": "What is 2+2?","chosen": "The answer is 4","rejected": "5"} |
{"prompt": "What is 2+2?","response": "The answer is 4"} |
| Collection Method | - Side-by-side comparison UI - Likert scales (5-point, 8-point) - Thumbs up/down - ChatBotArena |
- Human-written high-quality examples - Curated Q&A pairs - Single demonstrations |
| Data Source | Human labelers comparing responses OR Structured/Synthetic: - Correct vs incorrect (math) - With constraint vs without |
Human-written completions OR Curated high-quality examples OR Synthetic (from stronger models) |
| Signal Type | Comparative/Relative: Which is better? | Absolute: This is a good response |
| Typical Dataset Size | ~100K preference pairs (InstructGPT) | ~10K-1M examples (InstructGPT: 10K, modern: ~1M) |
| Multi-turn Handling | - Preference on final turn only - Continue with “chosen” answer - Mask previous turns from loss |
- Each turn = separate training example - Unroll N-turn → N examples - Mask prompts/previous turns |
| Used to Train | Ch7: Reward Model | Ch9: SFT Model (initial policy) |
| What Gets Trained | L = -log(σ(r(chosen) - r(rejected))) |
L = -log P(response|prompt) |
| Next Stage Usage | Ch11: RL training - Same/similar prompts can be reused - RM provides scores |
Ch7: RM base model Ch11: RL starting policy - Policy generates responses |
Key Insight: The DATA TYPES are FUNDAMENTALLY DIFFERENT #
- Ch9 SFT Data Says: “Here’s a good response. Learn to generate this.”
{
"prompt": "Write a poem about goldfish",
"response": "Golden swimmer, circling slow..."
}
- Ch6 Preference Data Says: “Between these two responses, A is better than B. Learn to prefer A.”
{
"prompt": "Write a poem about goldfish",
"chosen": "Golden swimmer, circling slow... (follows constraint)",
"rejected": "In circles bright, the goldfish glides... (violates constraint)"
}
The Complete RLHF Pipeline #
┌─────────────────────┐
│ Pretrained Model │
└──────────┬──────────┘
│
├──────────────────────────────┐
↓ ↓
┌─────────────┐ ┌───────────────┐
│ Ch9: SFT │ │ Ch6: Collect │
│ │ │ Preference │
│ Data: │ │ Data │
│ (prompt, │ │ │
│ response) │ │ Data: │
│ │ │ (prompt, │
│ Single good │ │ chosen, │
│ examples │ │ rejected) │
└──────┬──────┘ │ │
│ │ Comparisons │
│ └───────┬───────┘
│ │
↓ ↓
┌─────────────┐ ┌───────────────┐
│ SFT Model │──base model─→│ Ch7: RM │
│ (Policy │ │ Training │
│ starting │ │ │
│ point) │ │ Uses Ch6 data │
└──────┬──────┘ └───────┬───────┘
│ │
│ │
└──────────┬──────────────────┘
↓
┌───────────────┐
│ Ch11: RL │
│ Optimization │
│ │
│ Policy: Ch9 │
│ Scorer: Ch7 │
│ Prompts: Ch6 │
│ (or similar) │
└───────────────┘
Summary #
The fundamental difference between Ch6 and Ch9 data lies in their learning objectives:
- Ch6 (Preference Data): Teaches models to discriminate between better and worse responses through pairwise comparisons, ultimately training a Reward Model
- Ch9 (SFT Data): Teaches models to generate appropriate responses through demonstration, creating the initial policy for RL optimization
Both are essential but serve distinct roles in the RLHF pipeline, with Ch9 establishing generation capabilities and Ch6 enabling quality assessment.