Ch12. Synthetic Data
Where Synthetic Data appears in the training pipeline #
Stage 1: SFT (Ch 4) Stage 2: RLHF (Ch 11)
───────────────── ────────────────────
[Prompts] [On-policy model]
↓ ↓
┌────────┐ [Generate 2+ responses]
│ Human │ → Completions ↓
│Writers │ ┌─────────┐
└────────┘ │ Human │ → Preference labels
↓ │ Raters │
OR ← CH 12 HERE! └─────────┘
↓ ↓
┌────────┐ OR ← CH 12 HERE TOO!
│ GPT-4o │ → Completions (cheaper) ↓
│Distill │ ← DISTILLATION (Ch 12.1) ┌─────────┐
└────────┘ │LLM Judge│ ← AI FEEDBACK (Ch 12.2)
↓ │ (RLAIF) │ ← CONSTITUTIONAL AI (Ch 12.3)
[SFT Dataset] └─────────┘
↓ ↓
Train base model [Preference Dataset]
↓ ↓
[Instruction-tuned model] ────────→ Train reward model → PPO/GRPO
↓
[Aligned model]
THE BIG PICTURE - WHY SYNTHETIC DATA MATTERS #
Q1: What is the central thesis of Chapter 12? #
A: "RLHF was rooted in keeping humans in the loop. But as AI models got better, they became better than humans at creating training data. This changed everything."
This represents a fundamental paradigm shift in how we think about
training data for modern language models.
Q2: What was the paradigm shift that Chapter 12 documents? #
A: The evolution happened in three waves:
**2022 (Early RLHF Era):**
├─ Only humans could write quality completions
├─ Only humans could judge preferences
├─ Human data = only viable option
└─ Cost: $5-50 per completion, $1-10 per preference
**2023-2024 (GPT-4 Era - The Transition):**
├─ GPT-4 class models > humans at writing answers
├─ LLM-as-a-judge becomes viable
├─ Synthetic data = cheaper, faster, often better
└─ Cost: <$0.01 per item (100-1000x cheaper!)
**2025 (Current State):**
├─ Synthetic data dominates instruction tuning
├─ AI feedback widely used for preferences
├─ "Leading models NEED synthetic data for best performance"
└─ Datasets grow: 10B+ tokens (vs 10M in 2023)
KEY INSIGHT: We went from "humans are essential" to "synthetic
dominates where AI exceeds human reliability" in just
3 years.
Q3: Where does synthetic data fit in the training pipeline? #
A: Synthetic data can replace human data at TWO critical stages:
Stage 1: SFT (Ch 4) Stage 2: RLHF (Ch 11)
─────────────────── ─────────────────────
Traditional: Traditional:
Human writers Human raters
→ Completions → Preference labels
Modern (Ch 12): Modern (Ch 12):
GPT-4o/Claude LLM Judge (RLAIF)
→ Completions (distillation) → Preference labels
Cost: $50 → <$0.01 Cost: $5 → <$0.01
Time: Days → Instant Time: Days → Instant
Chapter 12 is about BOTH these replacements.
DISTILLATION (Ch 12.1) #
Q4: What is distillation in the context of LLMs? #
A: Using outputs from a STRONGER model to train a WEAKER model.
**Traditional ML Definition:**
└─ Teacher-student knowledge transfer
└─ Goal: Compress large model → small model
**LLM Colloquial Use (Two Forms):**
┌──────────────────────────────────────────────────────────────┐
│ FORM 1: Data Engine for Post-Training (Most Common) │
├──────────────────────────────────────────────────────────────┤
│ Use stronger model to generate training data: │
│ ├─ Completions for SFT (instruction tuning) │
│ ├─ Preference data for RLHF │
│ └─ Verification labels for RL │
│ │
│ Example Pipeline: │
│ 1. Prompt: "Explain quantum entanglement" │
│ 2. GPT-4o generates: [high-quality completion] │
│ 3. Use that completion to train Llama 3 │
│ │
│ Why It Works Now: │
│ └─ GPT-4 class models > humans for most tasks │
│ └─ Cost: <$0.01 vs $5-50 per completion │
│ └─ Speed: Instant vs days of human writing │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ FORM 2: Skill Transfer (Specific Capabilities) │
├──────────────────────────────────────────────────────────────┤
│ Transfer specific skills from strong → weak: │
│ ├─ Math reasoning (GPT-4 → smaller model) │
│ ├─ Coding ability │
│ ├─ Reasoning chains (Ch 7 reasoning models) │
│ └─ Test-time scaling behaviors │
│ │
│ Example: │
│ OpenThoughts dataset: Distilled reasoning from QwQ-32B │
│ → Used to train smaller reasoning models │
│ → 1.2M examples, ~10B tokens │
└──────────────────────────────────────────────────────────────┘
Q5: How do companies actually use distillation in practice? #
A: Industry secret: Many labs train LARGE internal models (never released) just to distill from:
**Closed-Source Labs:**
├─ Anthropic: Train Claude Opus → distill to Sonnet/Haiku
├─ Google: Train Gemini Ultra → distill to Pro/Flash
├─ OpenAI: Train GPT-4 → distill to GPT-4 Turbo/mini
└─ Why? Optimal teacher model ≠ best product model
**Open-Source Labs:**
├─ Distill from closed API models (GPT-4o, Claude, etc.)
├─ Mix: Use multiple teacher models for diversity
└─ Example: Tülu 3 used BOTH GPT-4o AND Llama 3.1 405B
**The Critical Success Factor:**
"Curating high-quality prompts and filtering responses from teacher
model is crucial to maximize performance"
NOT just "dump all GPT-4 outputs into training" - careful curation
is what separates good distillation from bad!
Q6: What’s the difference between distillation and simply copying? #
A: The key is FILTERING and CURATION:
**Bad Distillation (Doesn't Work):**
├─ Prompt teacher model with random questions
├─ Take ALL outputs regardless of quality
├─ Train student model on everything
└─ Result: Student learns teacher's errors + inconsistencies
**Good Distillation (What Actually Works):**
├─ Carefully curate prompt distribution
├─ Generate multiple responses per prompt
├─ Filter for quality (consistency, correctness, style)
├─ Deduplicate and balance dataset
└─ Result: Student learns teacher's BEST behaviors
ANALOGY: Like studying for an exam
└─ Don't memorize ALL practice problems (including wrong answers)
└─ DO study the BEST solutions to representative problems
This is why companies spend enormous resources on data curation
pipelines even though generation is cheap.
Q7: How does distillation relate to Chapter 7 (Reasoning)? #
A: Distillation became CRITICAL for reasoning model training:
The Reasoning Distillation Pipeline:
Stage 1: Train large reasoning model with RL
├─ DeepSeek R1 (671B): Trained with RLVR (Ch 7)
├─ QwQ-32B: Trained with RLVR
└─ These are EXPENSIVE to train
Stage 2: Distill reasoning chains to smaller models
├─ OpenThoughts: 1.2M reasoning chains from QwQ-32B
├─ Cost to generate: ~$1,000 (vs $100K+ to train from scratch)
└─ Result: Smaller models learn reasoning at 1/100th the cost
Stage 3: Open-source community trains many models
├─ 20+ reasoning models in 6 months (2025)
├─ Most used distilled reasoning data
└─ "Democratization of reasoning capabilities"
KEY INSIGHT: Chapter 7's RL breakthrough created the frontier.
Chapter 12's distillation democratized it.
Q8: Does distillation work for all capabilities? #
A: Not equally - there’s a hierarchy:
**Works Very Well:**
├─ Writing style and format
├─ Instruction following
├─ General knowledge articulation
├─ Basic reasoning patterns
└─ Success rate: >90%
**Works With Curation:**
├─ Complex reasoning (need quality filter)
├─ Math problem solving (verify correctness)
├─ Code generation (test solutions)
└─ Success rate: 60-80% with filtering
**Struggles:**
├─ Novel capabilities teacher doesn't have
├─ Implicit knowledge (hard to verbalize)
├─ Emergent behaviors from scale
└─ Success rate: <50%, inconsistent
RULE OF THUMB: "Can only distill what teacher can demonstrate"
This is why frontier labs still invest in training large models
from scratch - you can't distill your way to new capabilities.
AI FEEDBACK / RLAIF (Ch 12.2) #
Q9: What is RLAIF? #
A: Reinforcement Learning from AI Feedback - using AI to generate preference labels instead of humans.
Origin: Anthropic’s Constitutional AI paper (2022) Full Name: “Reinforcement Learning from AI Feedback”
The comparison:
**Traditional RLHF:**
├─ Show human rater 2 responses (A and B)
├─ Human picks: "B is better"
├─ Cost: $1-10 per preference pair
├─ Time: Days/weeks for data collection
└─ Bottleneck: Human bandwidth
**RLAIF (New Way):**
├─ Show LLM (e.g., GPT-4o) 2 responses (A and B)
├─ LLM judges: "B is better because..."
├─ Cost: <$0.01 per preference pair (100-1000x cheaper!)
├─ Time: Instant generation
└─ Scalability: Unlimited
KEY INNOVATION: Replace expensive human judges with cheap AI judges
Q10: What’s the fundamental trade-off between human and synthetic preference data? #
A: This is THE critical question for modern RLHF:
┌──────────────────────────────────────────────────────────────┐
│ HUMAN DATA: High Noise, Low Bias │
├──────────────────────────────────────────────────────────────┤
│ ✓ Captures nuanced, diverse preferences │
│ ✓ Less systematic errors (random noise cancels out) │
│ ✓ "Competitive moat" for frontier labs │
│ ✗ Expensive ($1-10+ per comparison) │
│ ✗ Slow (days/weeks for thousands of labels) │
│ ✗ Inconsistent (inter-annotator disagreement ~20-40%) │
│ │
│ Example: 10 annotators might give 6 different answers │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ SYNTHETIC DATA: Low Noise, High Bias │
├──────────────────────────────────────────────────────────────┤
│ ✓ Cheap (<$0.01 per comparison) │
│ ✓ Fast (generate 100K labels in hours) │
│ ✓ Consistent (same input → same judgment) │
│ ✗ Systematic biases from judge model │
│ ✗ Self-preference bias (models prefer own outputs) │
│ ✗ May miss subtle human preferences │
│ │
│ Example: Same LLM will ALWAYS prefer B over A given inputs │
└──────────────────────────────────────────────────────────────┘
ANALOGY:
Human data = asking 100 different people (diverse but inconsistent)
Synthetic data = asking 1 expert 100 times (consistent but limited
to that expert's worldview)
Q11: When should you use human vs synthetic preference data? #
A: The field is STILL figuring this out, but here’s current consensus:
**Synthetic Has "Largely Won" For:**
├─ SFT data (instruction tuning) - Chapter 4
├─ Evaluation at scale (LLM-as-a-judge) - Chapter 17
├─ Verifiable domains (math, code) - Chapter 7 RLVR
└─ Pattern: Where AI reliability > human consistency
**Human Data Still Matters For:**
├─ Safety and alignment (nuanced edge cases)
├─ Preference data (debated - "competitive moat")
├─ Evaluation ground truth (benchmark creation)
├─ Character/personality training (Chapter 18 - emerging)
└─ Pattern: Where nuance and diversity matter most
**Current Industry Practice:**
├─ Academic Research: "AI feedback performs comparably"
├─ Industry Reality: "Human data seen as competitive advantage"
└─ Optimal Strategy: Mix both (ratio unknown, varies by lab)
OPEN QUESTION: Does human preference data enable finer control
that synthetic can't replicate? Labs won't say.
Q12: What are the known biases in AI judges? #
A: Multiple systematic issues discovered:
**1. Self-Preference Bias**
└─ Models favor their own outputs over others'
└─ GPT-4 prefers GPT-4 outputs, Claude prefers Claude outputs
└─ Mitigation: Use third-party judge model
**2. Position Bias**
└─ Prefer response A vs B based on order shown
└─ Mitigation: Present both orders, average judgments
**3. Length Bias**
└─ Prefer longer/more detailed responses (even if worse)
└─ Mitigation: Explicit length-agnostic instructions
**4. Style Bias**
└─ Prefer certain writing styles that match training
└─ Mitigation: Diverse teacher models
**5. Verbosity Over Accuracy**
└─ Reward confident-sounding wrong answers
└─ Mitigation: Verify factual claims separately
**6. Inconsistent Evaluation**
└─ Same comparison, different day = different result
└─ Mitigation: Multiple samples + majority voting
None of these are FATAL, but all require careful mitigation!
Q13: Should we train specialized judge models just for evaluation? #
A: This has been tried - results are mixed:
**Attempts at Specialized Judges:**
├─ Shepherd, CriticLLM (critic models)
├─ Auto-J, Prometheus 1/2, Prometheus-Vision (evaluators)
├─ Meta-rewarding (models that evaluate their own judging)
└─ Result: "Not widely adopted in documented training recipes"
**Why Specialized Judges Aren't Dominant:**
├─ GPT-4o/Claude already trained extensively for judging
├─ Cost of training specialized judge > just using GPT-4o
├─ Unclear if specialized judges actually better
└─ Industry inertia: "GPT-4o works well enough"
**Improvements That DO Get Used:**
├─ Repeated sampling (multiple judgments → consensus)
├─ Self-refinement (judge, revise, judge again)
├─ Tournament ranking (pairwise comparisons across many pairs)
└─ Ensemble judging (multiple models vote)
REALITY CHECK: Most production systems just use GPT-4o or Claude
as judges with clever prompting, not specialized models.
Q14: How does RLAIF compare to RLHF in practice? #
A: The empirical results from research:
**Academic Papers (2023-2024):**
├─ "RLAIF performs comparably to RLHF on many benchmarks"
├─ Some domains: Synthetic even better (more consistent)
├─ Cost savings: 100-1000x cheaper
└─ Conclusion: "Viable alternative for most use cases"
**Industry Practice (Observed):**
├─ Frontier labs still collect human preference data
├─ Anthropic: Uses both (Constitutional AI + human feedback)
├─ OpenAI: Uses both (model spec + human feedback)
├─ Open-source: Primarily synthetic (UltraFeedback, etc.)
└─ Conclusion: "Human data still seen as competitive advantage"
**The Disconnect:**
Why do frontier labs still pay for human data if synthetic
works "comparably"?
Hypotheses:
├─ Human data enables finer-grained control
├─ Safety/alignment needs human nuance
├─ "Comparable" ≠ "better" at frontier
└─ Competitive moat (unique data = differentiation)
TAKEAWAY: For most applications, synthetic is good enough.
For frontier models, jury's still out.
CONSTITUTIONAL AI (Ch 12.3) #
A### Q15: What is Constitutional AI?
A: Anthropic’s method - the “earliest documented, large-scale use of synthetic data for RLHF training” (2022).
**The Key Innovation:**
Use a "constitution" (list of principles) to guide BOTH:
├─ Data generation (SFT)
└─ Preference judgments (RLAIF)
**Historical Significance:**
"Constitutional AI kickstarted the broader field of RLAIF"
Before CAI: Everyone used human feedback
After CAI: Synthetic feedback became mainstream
Q16: What is a “constitution” in Constitutional AI? #
A: A human-written set of principles that define desired behavior.
**Examples from Claude's Actual Constitution:**
├─ Safety: "Is the answer encouraging violence?"
├─ Honesty: "Is the answer truthful?"
├─ Respect: "Is the response respectful?"
├─ Helpfulness: "Does it help the user?"
├─ Equality: "Please choose the response that most supports and
│ encourages freedom, equality, and a sense of brotherhood"
└─ Tone: "Which response is least intended to build a relationship
with the user?"
**Key Properties:**
├─ Human-written (not learned)
├─ Explicit principles (not implicit preferences)
├─ Interpretable (you can read and understand each rule)
└─ Modifiable (easy to add/remove principles)
ANALOGY: Like a legal constitution for AI behavior
└─ Not case-by-case judgments
└─ But overarching principles that guide all decisions
Q17: How does Constitutional AI work? (The Two Phases) #
A: CAI has TWO distinct phases - one for SFT, one for RL:
┌──────────────────────────────────────────────────────────────┐
│ PHASE 1: Supervised Learning (SFT with Self-Critique) │
├──────────────────────────────────────────────────────────────┤
│ │
│ Process: │
│ 1. Model generates answer to prompt │
│ 2. Randomly sample principle from constitution: c_i │
│ 3. Model critiques its OWN answer against principle │
│ 4. Model revises answer based on critique │
│ 5. Repeat steps 2-4 multiple times (different principles) │
│ 6. Fine-tune on final revised answer │
│ │
│ Mathematical Formulation: │
│ ├─ Constitution: C = {c_0, c_1, ..., c_n} │
│ ├─ Initial answer: y_0 │
│ ├─ Revisions: y_1, y_2, ..., y_n (each using principle c_i) │
│ └─ Train on final: (prompt x, completion y_n) │
│ │
│ Example: │
│ Prompt: "How do I hack a website?" │
│ Initial (y_0): "Here's how to use SQL injection..." │
│ Critique (c_i = safety): "This encourages illegal activity" │
│ Revised (y_1): "I can't help with hacking. Here's legal │
│ cybersecurity education..." │
│ │
│ Why "Self-Critique"? Model critiques and revises its OWN │
│ outputs, no human in the loop! │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ PHASE 2: RL (RLAIF with Constitution-Guided Judgments) │
├──────────────────────────────────────────────────────────────┤
│ │
│ Process: │
│ 1. Have two completions A and B for a prompt │
│ 2. Randomly sample principles from constitution │
│ 3. Ask LLM: "Given these principles, which is better?" │
│ 4. LLM judges (with reasoning) │
│ 5. Use judgment as preference label │
│ 6. Train reward model on these AI-generated labels │
│ 7. Run RLHF as normal (but with synthetic preferences) │
│ │
│ Why "RLAIF"? Because the feedback is from AI, not humans! │
│ │
│ Mathematical Formulation: │
│ ├─ Prompt: x │
│ ├─ Principles: {c_0, ..., c_n} │
│ ├─ Completions: y_0 (A), y_1 (B) │
│ └─ LLM outputs: P(A better | principles) or P(B better) │
│ │
└──────────────────────────────────────────────────────────────┘
KEY DISTINCTION:
Phase 1 = Generate better training data (replaces human writers)
Phase 2 = Generate preference labels (replaces human raters)
Q18: Why is Phase 1 (self-critique) powerful? #
A: It enables the model to improve its own outputs iteratively:
**Traditional SFT:**
├─ Need human to write high-quality example
├─ Cost: $50+ per example
├─ Time: Hours per example
└─ Bottleneck: Human writing quality
**CAI Phase 1:**
├─ Model writes initial draft (fast, cheap)
├─ Model critiques against principles (instant)
├─ Model revises (instant)
├─ Iterate multiple times (still instant)
└─ Cost: <$0.01 for entire process
**Why It Works:**
Models are often better at CRITIQUING than GENERATING
└─ Like how editors improve writers
└─ Self-critique + revision = higher quality than first draft
**Impact on Data Quality:**
Starting from mediocre model outputs + iteration
→ Often better than single-shot human writing
SURPRISING INSIGHT: Self-critique methods are "used extensively in
data filtering across post-training" - not just
Anthropic, but broadly adopted!
Q19: How is Constitutional AI used today? #
A: Widespread adoption with variations:
**Anthropic (Original):**
├─ Still uses CAI in Claude training
├─ Constitution updated over time
├─ Both Phase 1 and Phase 2 in production
└─ Public constitution available online
**OpenAI (Inspired By):**
├─ "Model Spec" - similar to constitution
├─ "Deliberative Alignment" - similar self-critique
├─ Rule-based reward modeling
└─ Not called "CAI" but conceptually similar
**Open-Source Community:**
├─ Many CAI replications and variants
├─ UltraFeedback (inspired by CAI principles)
├─ Custom constitutions for domain-specific models
└─ Self-critique prompts widely used
**Key Insight from Book:**
"Largely known for Phase 2 (preference data), but Phase 1
(instruction data) methods are used extensively in data filtering
across post-training"
Translation: Everyone talks about RLAIF, but self-critique for
data generation is just as important!
Q20: What are the limitations of Constitutional AI? #
A: Several challenges and open questions:
**1. Constitution Design:**
└─ Who decides what principles to include?
└─ How to handle conflicting principles?
└─ Different cultures have different values
**2. Principle Grounding:**
└─ Does model truly "understand" principles?
└─ Or just pattern-matching on keywords?
└─ Hard to verify internal reasoning
**3. Coverage:**
└─ Can't write principles for every edge case
└─ Model must generalize from examples
└─ May misapply principles in novel situations
**4. Trade-offs:**
└─ Helpful vs Harmless (classic RLHF dilemma)
└─ Honesty vs Helpfulness
└─ Multiple principles may conflict
**5. Scalability of Principles:**
└─ Anthropic's constitution: ~dozens of principles
└─ Can you scale to hundreds? Thousands?
└─ Diminishing returns from more principles
Despite these limitations, CAI remains influential because:
└─ First practical demonstration of synthetic data at scale
└─ Interpretable (you can read the constitution)
└─ Modular (easy to update principles)
└─ Effective (Claude's success proves it works)
FUTURE DIRECTIONS & MODEL COLLAPSE #
Q21: What are rubric-based rewards, and why do they matter? #
A: Extending RLAIF beyond binary correctness to nuanced evaluation:
**Chapter 7 (Reasoning) Approach:**
├─ Use RLVR with binary rewards: correct/incorrect
├─ Works for: Math, code, verifiable domains
└─ Limitation: What about non-verifiable tasks?
**Chapter 12 Future: Rubric-Based Rewards**
├─ Instead of "correct/incorrect", use detailed criteria:
│ ├─ Creativity (1-5 scale)
│ ├─ Clarity (1-5 scale)
│ ├─ Coherence (1-5 scale)
│ ├─ Helpfulness (1-5 scale)
│ └─ Style adherence (1-5 scale)
└─ LLM judges against these rubrics
**Why This Matters:**
Enables RL training in open-ended domains:
├─ Creative writing (no single "correct" answer)
├─ Essay writing
├─ Summarization
├─ Style transfer
└─ Product descriptions
**The Process:**
1. Define rubric (what makes a good creative story?)
2. LLM scores response on each criterion
3. Aggregate scores → reward signal
4. Run RL (PPO, GRPO, etc.)
5. Model learns to optimize for rubric criteria
SIGNIFICANCE: This extends Chapter 7's RLVR breakthrough
(math/code) to ANY domain where you can define
quality criteria!
Q22: What is “model collapse,” and should we worry about it? #
A: The fear that synthetic data will recursively degrade models:
**The Theory (Model Collapse):**
Model v1 → generates data → Train Model v2 on it
→ Model v2 generates slightly worse data → Train Model v3
→ Model v3 generates even worse data → Train Model v4
→ ... → Models collapse into gibberish
**The Mechanisms:**
├─ Diversity drops (rare facts lost each generation)
├─ Small mistakes amplified (errors compound)
├─ Distribution narrowing (only frequent patterns survive)
└─ "Xerox of a Xerox" effect
**The Fear:**
If everyone trains on synthetic data, we'll see cascading
degradation across the field!
**The Reality (from Book):**
**"This has been emphatically rebuked in leading language models"**
Translation: Model collapse is NOT happening at frontier labs.
Q23: Why isn’t model collapse happening in practice? #
A: Because labs don’t do the naive thing that causes collapse:
**What WOULD Cause Collapse:**
├─ Train ONLY on self-generated data (no human data)
├─ Use repetitive, unfiltered outputs
├─ Single-model distillation (no diversity)
├─ No quality control
└─ Recursive training (v2 only from v1, v3 only from v2)
**What Labs ACTUALLY Do (Avoids Collapse):**
├─ Mix human + synthetic data (especially at frontiers)
├─ Use diverse teacher models (GPT-4 + Claude + Llama)
├─ Strong quality filters (reject low-quality outputs)
├─ Deduplication (remove repeated content)
├─ Careful prompt curation (diverse questions)
└─ Ground in reality (web scraping, books, code repos)
**Key Insight:**
"For today's frontier training pipelines, synthetic data CAN and
SHOULD be used at scale without catastrophic regressions"
BUT: You need to use it CAREFULLY
ANALOGY:
Bad: Photocopying a photocopy repeatedly → degrades
Good: Scanning original + using multiple high-quality printers
→ maintains quality
Q24: Where does human data still matter? #
A: Three critical areas where humans remain essential:
┌──────────────────────────────────────────────────────────────┐
│ 1. CAPABILITY FRONTIERS │
├──────────────────────────────────────────────────────────────┤
│ "Humans must generate data where AIs don't yet have ability" │
│ │
│ Pattern: │
│ ├─ First frontier model: Needs human data (no teacher) │
│ ├─ Once frontier exists: Synthetic proliferates (distill) │
│ └─ Example: Reasoning was frontier, now distillation works │
│ │
│ Current Frontiers (2025): │
│ ├─ Multimodal reasoning (vision + text) │
│ ├─ Long-context understanding (100K+ tokens) │
│ ├─ Agentic planning (tool use chains) │
│ └─ Domain expertise (medicine, law, etc.) │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ 2. PREFERENCE DATA (STILL DEBATED) │
├──────────────────────────────────────────────────────────────┤
│ Academic: "Synthetic performs comparably" │
│ Industry: "Human data is competitive moat" │
│ │
│ Open Questions: │
│ ├─ Does human data enable finer control? │
│ ├─ Is nuance in preferences important? │
│ ├─ Do safety/alignment need human judgment? │
│ └─ Character training may need human input (Ch 18) │
│ │
│ Reality: Frontier labs STILL pay for human preferences │
│ (This suggests they believe it matters) │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ 3. EVALUATION GROUND TRUTH │
├──────────────────────────────────────────────────────────────┤
│ LLM-as-a-judge: Scales evaluation (cheap, fast) │
│ BUT: Benchmark creation still needs humans │
│ │
│ Humans establish: │
│ ├─ "What correct looks like" (ground truth) │
│ ├─ Edge cases and failure modes │
│ ├─ Safety boundaries │
│ └─ Novel evaluation criteria │
│ │
│ Pattern: Humans define standards, AI judges at scale │
└──────────────────────────────────────────────────────────────┘
Q25: What’s the timeline of synthetic data adoption? #
A: Rapid evolution in just 3 years:
**2022: Early RLHF Era**
├─ InstructGPT, ChatGPT launch
├─ ALL data is human-generated
├─ Llama 2, GPT-3.5: Not reliable enough for synthetic
├─ Cost: $5-50 per completion
└─ Human data = only option
**2023: Synthetic Emerges**
├─ GPT-4 class models become reliable
├─ Stanford Alpaca: 52K synthetic examples (breakthrough!)
├─ Constitutional AI paper formalizes RLAIF
├─ UltraFeedback (synthetic preferences) kickstarts DPO
├─ Cost: <$0.01 per item
└─ Synthetic starts competing with human data
**2024: Synthetic Dominates SFT**
├─ GPT-4 > humans for most completion writing
├─ LLM-as-a-judge becomes standard for evaluation
├─ Tülu 3: Mix of synthetic + human (best practice)
├─ Academic: "Synthetic performs comparably"
└─ "Synthetic has largely won for instruction data"
**2025: Reasoning Era**
├─ OpenThoughts: 1.2M synthetic reasoning examples
├─ Datasets grow: 10B+ tokens (vs 10M in 2023!)
├─ Synthetic critical for reasoning model training (Ch 7)
├─ Human data still valued for preferences/safety
└─ "Leading models NEED synthetic data for best performance"
KEY MILESTONE: Stanford Alpaca (2023)
└─ First widely-used open synthetic dataset
└─ Proved GPT-3.5 good enough for data generation
└─ Kickstarted open-source synthetic data movement
Q26: What are the major synthetic datasets mentioned in the book? #
A: Evolution from small to massive:
**Stanford Alpaca (2023) - The Pioneer**
├─ 52K instruction-response pairs
├─ Generated from GPT-3.5
├─ Kickstarted open synthetic data movement
├─ Size: ~10M tokens
└─ Impact: Proved synthetic viability
**UltraFeedback (2023) - Preference Data**
├─ First prominent synthetic preference dataset
├─ Kickstarted DPO revolution (Ch 8)
├─ Academic training commonly uses this
├─ Size: 64K preference pairs
└─ Impact: Democratized RLHF alternatives
**Tülu 3 (2024) - Mixed Approach**
├─ ~1M synthetic examples
├─ Mix: GPT-4o + Llama 3.1 405B (diverse teachers!)
├─ Skill-focused (math, code, instruction-following)
├─ Size: ~5B tokens
└─ Impact: Showed mixed human+synthetic works best
**OpenThoughts 3 (2025) - Reasoning Era**
├─ 1.2M reasoning examples
├─ Distilled from QwQ-32B (Ch 7 reasoning model)
├─ For training thinking models
├─ Size: ~10B tokens (1000x growth from Alpaca!)
└─ Impact: Enabled 20+ reasoning models in 6 months
PROGRESSION: 10M tokens → 5B tokens → 10B tokens
52K examples → 1M examples → 1.2M examples
(3 years of exponential growth!)
KEY TAKEAWAYS & QUICK REFERENCE #
Ch 12.1 Distillation: “Using stronger models (GPT-4o, Claude) to generate training data for weaker models, which has largely replaced human completion writing for SFT due to 100-1000x cost savings and equal or better quality.”
Ch 12.2 AI Feedback (RLAIF): “Using LLMs as judges to generate preference labels instead of humans, offering 100-1000x cost savings but introducing systematic biases that human data doesn’t have, requiring careful mitigation strategies.”
Ch 12.3 Constitutional AI: “Anthropic’s method of using a written ‘constitution’ (principles) to guide both self-critique (SFT) and preference judgments (RLAIF), kickstarting the field of synthetic preference data and becoming widely adopted in various forms.”
Overall Chapter: “The paradigm shift from ‘humans are essential’ to ‘synthetic data dominates where AI exceeds human reliability’ - fundamentally changing how we think about training data for post-training, with frontier labs now requiring synthetic data for best performance while still valuing human data for preference nuance and capability frontiers.”