Ch7. REASONING & INFERENCE-TIME SCALING
THE BIG PICTURE - WHY IS THIS BREAKTHROUGH? #
Q1: What actually happened in 2025 that makes Chapter 7 revolutionary? #
A: Two seismic events changed everything:
- OpenAI o1 (Sep 2024) - Showed reasoning models work at scale
- DeepSeek R1 (Jan 2025) - Fully documented the recipe openly
Timeline:
- Before: “RL doesn’t work” (2018 famous blog post)
- After: 20+ reasoning models released in 6 months (Jan-Jul 2025)
This is like going from “flight is impossible” to “commercial airlines everywhere” in 6 months. The paradigm shifted THAT fast.
Q2: Why couldn’t we do this before 2024? #
A: Three critical barriers were overcome:
BARRIER 1: RL Stability #
- Problem: RL training was “fickle” - crashed, failed randomly
- Solution: Open-source tools matured (TRL, veRL, OpenRLHF)
- Result: “Technical barriers to entry at an all-time low”
BARRIER 2: Model Capability Threshold #
- Problem: Smaller/weaker models couldn’t learn reasoning via RL
- Solution: Base models from ~2024 onwards were “capable enough”
- Result: RL training could finally elicit reasoning behaviors
BARRIER 3: Verifiable Rewards #
- Problem: How do we know if reasoning is correct?
- Solution: Math/coding have binary correctness (RLVR)
- Result: Train without expensive preference data!
Q3: What’s the difference between RLHF and RLVR? #
A: Critical distinction that defines the new era:
RLHF (Old Way) #
- Reward: Human preference labels (expensive, subjective)
- Use case: Chat, safety, style, “vibes”
- Cost: $1-10 per preference pair
- Training: 1-2 epochs, careful not to overfit
- Example: ChatGPT’s politeness, Claude’s helpfulness
RLVR (New Way) - “Reinforcement Learning from Verifiable Rewards” #
- Reward: Binary correctness (right/wrong answer)
- Use case: Math, coding, reasoning, STEM
- Cost: Nearly free (automated checking)
- Training: Hundreds/thousands of epochs until convergence
- Example: DeepSeek R1’s math ability, o1’s coding
KEY INSIGHT: RLVR doesn’t need humans in the loop for correctness! Just check if answer == ground_truth
CORE CONCEPTS - WHAT IS INFERENCE-TIME SCALING? #
Q4: What is “inference-time scaling”? #
A: Using MORE COMPUTE at inference to get BETTER answers.
Traditional LLM: #
├─ User asks: "Solve this math problem"
├─ Model generates: [answer in 100 tokens]
└─ Done. Fixed cost.
Reasoning Model (Inference-time Scaling): #
├─ User asks: "Solve this math problem"
├─ Model thinks: [2000 tokens of <think>...</think> reasoning]
├─ Model generates: [final answer]
└─ Uses 20x more tokens = 20x more compute = better accuracy!
ANALOGY:
- Traditional = “quick guess”
- Reasoning = “show your work” (like you did in school)
Q5: What’s the difference between training-time and inference-time scaling? #
A: Where you spend the compute:
TRAINING-TIME SCALING (Traditional) #
- Spend compute: During training (one-time cost)
- Method: Bigger models, more training data, longer training
- Result: Smarter model for ALL users
- Example: GPT-3 (175B) → GPT-4 (1.7T) = bigger = smarter
- Tradeoff: Expensive upfront, cheap at inference
INFERENCE-TIME SCALING (New Era) #
- Spend compute: During EACH query (per-user cost)
- Method: Model “thinks” longer (more tokens) for hard Qs
- Result: Same model, but “tries harder” on demand
- Example: o1 generates 2K-30K thinking tokens per hard Q
- Tradeoff: Cheaper training, expensive inference (but better!)
BREAKTHROUGH: You can now trade inference cost for accuracy! (Like buying “thinking time” on demand)
Q6: How does RL training create inference-time scaling? #
A: RL teaches the model to “think out loud” for hard problems:
The Process: #
Step 1: Model encounters hard math problem
Step 2: Model generates reasoning chain:
<think>
Let me break this down...
First, I'll try approach A...
Wait, that doesn't work...
Let me try approach B...
Ah! That works because...
Therefore the answer is...
</think>
Step 3: Final answer: [correct]
Step 4: RL reward: +1 (correct!) → Reinforce this “thinking behavior”
After thousands of epochs: #
- Model learns: “Hard problems = think longer = higher accuracy”
- Correlation emerges: More tokens → Better performance
- This is NOT length bias (old RLHF problem)
- This is PRODUCTIVE reasoning
THE CANONICAL RECIPE - DEEPSEEK R1 #
Q7: What is DeepSeek R1’s training recipe? #
A: 4-stage process that became the blueprint for 2025:
STAGE 1: “Cold-Start” (100K+ samples) #
- What: Sample from earlier RL checkpoint (R1-Zero)
- Filter: Keep only high-quality reasoning chains
- Goal: Teach model the PROCESS of reasoning
- Why “cold-start”? Learning RL from minimal supervised data (Unlike traditional SFT which needs millions of examples)
STAGE 2: Large-Scale RL (The Core) #
- What: Run RLVR “until convergence”
- Data: Reasoning problems (math, coding, STEM)
- Epochs: HUNDREDS (not 1-2 like traditional fine-tuning!)
- Reward: Binary correctness (right/wrong)
- Note: This is where the magic happens - model learns to reason!
STAGE 3: Rejection Sampling (Transition to General) #
- Mix: 3/4 reasoning problems + 1/4 general queries
- Goal: Don’t lose general chat ability
- Method: Sample multiple answers, keep best ones
STAGE 4: Mixed RL (Polish) #
- RLVR: For reasoning domains (verifiable)
- RLHF: For general domains (human preferences)
- Goal: Final model that’s both smart AND pleasant
Q8: What’s revolutionary about this recipe vs previous RLHF? #
A: Complete inversion of traditional priorities:
Traditional RLHF (InstructGPT, ChatGPT): #
├─ Stage 1: SFT (1M examples, 1-2 epochs) ← MAIN TRAINING
├─ Stage 2: RLHF (100K pref pairs) ← "Cherry on top"
└─ Focus: Behavior, style, safety
DeepSeek R1 (Reasoning Era): #
├─ Stage 1: Cold-start (100K, minimal SFT)
├─ Stage 2: RLVR (until convergence) ← MAIN TRAINING
├─ Stage 3-4: Polish with SFT+RLHF
└─ Focus: Capability, performance, correctness
PARADIGM SHIFT: RL went from “polish” to “core capability builder”
THE EXPLOSION - 20+ MODELS IN 6 MONTHS #
Q9: Who released reasoning models in 2025? #
A: Everyone. Literally everyone:
- Jan 2025: DeepSeek R1, Kimi 1.5
- Mar 2025: OpenReasoner-Zero (first fully open!)
- Apr 2025: Seed-Thinking 1.5, Phi-4 Reasoning
- May 2025: Llama-Nemotron, INTELLECT-2, Xiaomi MiMo, Qwen 3
- Jun 2025: Hunyuan-TurboS, Skywork OR-1, OpenThoughts, Magistral
- Jul 2025+: Kimi K2, GLM-4.5, Nemotron Nano 2, MiniMax-M1…
20+ models in ~6 months!
Organizations: ByteDance, Microsoft, Meta, Alibaba, Mistral, Moonshot AI, OpenBMB, Zhipu AI, Nvidia, MiniMax…
KEY INSIGHT: This isn’t one lab’s secret sauce. This is a FIELD-WIDE revolution.
Q10: What are the common training techniques across these models? #
A: 10 techniques repeatedly used (with examples):
1. OFFLINE DIFFICULTY FILTERING #
- What: Pre-filter training data by difficulty
- Why: Model can only learn from 20-80% solvable problems
- Who: Seed-Thinking, OpenReasoner-Zero, Phi-4, Qwen 3
- Logic:
- Too easy (100% solve) = no gradient
- Too hard (0% solve) = no gradient
- Just right (20-80%) = optimal learning!
2. PER-BATCH ONLINE FILTERING #
- What: During training, filter problems dynamically
- Why: Model capability changes as it learns
- Who: Kimi 1.5, Magistral, Llama-Nemotron
- Example:
- Week 1: Model solves 30% → include these
- Week 4: Model solves 80% → filter out (too easy)
3. REMOVE KL PENALTY #
- What: Turn off KL divergence regularization
- Why: Let model explore reasoning space freely
- Who: RAGEN, Magistral, OpenReasoner-Zero
- Insight:
- Traditional RLHF: KL penalty keeps model close to SFT init
- Reasoning: Remove KL → model can learn NEW behaviors
4. FORMAT REWARDS #
- What: Reward model for using
<think>...</think>tags - Why: Ensure consistent, parseable reasoning format
- Who: DeepSeek R1, OpenReasoner-Zero, Magistral
- Impact:
- Without: Model might reason in inconsistent ways
- With: Guaranteed structured output for downstream systems
5. LANGUAGE CONSISTENCY REWARDS #
- What: Penalize switching languages mid-reasoning
- Why: Better UX, easier to debug
- Who: DeepSeek R1, Magistral (multilingual models)
- Example:
- Bad: “Let me solve… 让我们计算… donc la réponse est…”
- Good: Stick to one language throughout reasoning
6. LENGTH PENALTIES #
- What: Penalize overthinking (too many reasoning tokens)
- Why: Combat diminishing returns on long chains
- Who: Kimi 1.5, INTELLECT-2
- Problem: Model might generate 50K tokens to solve 2+2
- Solution: Progressive length limits or small penalties
7. LOSS NORMALIZATION (Batch vs Group) #
- What: Normalize advantages at batch level (not group)
- Why: Avoid bias towards low-variance problems
- Who: Magistral, MiMo
- Comparison:
- GRPO original: Normalize per group of responses
- Alternative: Normalize across entire batch
8. PARALLEL TEST-TIME COMPUTE #
- What: Generate N answers, pick best via majority/scorer
- Why: Boosts accuracy without retraining
- Who: DeepSeek R1, Phi-4, Claude 4, DeepSeek-GRM
- Methods:
- Method 1: Majority voting (most common answer)
- Method 2: Scorer model (trained to pick best)
9. TEXT-ONLY REASONING BOOSTS MULTIMODAL #
- What: Train reasoning on text, improves vision+text too!
- Why: Reasoning transfers across modalities
- Who: Magistral, MiMo-VL
- Surprising finding: Don’t need vision data for vision boost!
10. TOGGLEABLE REASONING (System Prompt) #
- What: User controls reasoning depth via system prompt
- Why: Fast answers for easy Q’s, deep for hard Q’s
- Who: Llama-Nemotron, Qwen 3
- Example:
- System: “Use minimal thinking”
- Model: [short chain] → fast, cheap
- System: “Think deeply”
- Model: [long chain] → accurate, expensive
- System: “Use minimal thinking”
FUTURE DIRECTIONS #
Q11: Where is this field going next? #
A: Three frontier areas (from the book):
1. BEYOND MATH/CODE #
- Current: RLVR works great for verifiable domains
- Next: How to apply to non-verifiable domains?
- Challenge: Need new reward signal types
2. DISTILLATION #
- Current: Train big reasoning model → distill to small
- Next: Can small models learn reasoning directly?
- Trend: OpenThoughts dataset (distilled reasoning chains)
3. MULTIMODAL REASONING #
- Current: Text-only reasoning boosts vision (surprising!)
- Next: End-to-end multimodal reasoning training
- Example: MiMo-VL, Xiaomi’s multimodal reasoning
Q12: What’s the one thing experts still debate? #
A: Whether RL is necessary or if distillation is enough:
CAMP 1: “RL is essential” #
- Evidence: DeepSeek R1 cold-start shows RL creates novel behaviors
- Claim: Can’t just imitate, must explore via RL
CAMP 2: “Distillation is enough” #
- Evidence: OpenThoughts (distilled from QwQ) works well
- Claim: Cheaper, faster, good enough for most uses
REALITY: Probably both have roles #
- Frontier: Use RL to discover new reasoning patterns
- Production: Distill to smaller models for deployment
QUICK REFERENCE SUMMARY #
Key Timeline #
- 2018: “RL doesn’t work” conventional wisdom
- Sep 2024: OpenAI o1 proves reasoning at scale
- Jan 2025: DeepSeek R1 releases full recipe
- Jan-Jul 2025: 20+ reasoning models released
Core Concepts #
- RLHF: Human preferences, chat/safety, expensive, 1-2 epochs
- RLVR: Binary correctness, math/code, free, hundreds of epochs
- Inference-time scaling: Trade compute for accuracy at test time
- Reasoning chains: Model “thinks out loud” in
<think>tags
DeepSeek R1 Recipe (4 Stages) #
- Cold-start (100K samples)
- Large-scale RL (core training)
- Rejection sampling (generalization)
- Mixed RL (polish)
Top 10 Training Techniques #
- Offline difficulty filtering (20-80% sweet spot)
- Per-batch online filtering (dynamic adjustment)
- Remove KL penalty (exploration freedom)
- Format rewards (structured output)
- Language consistency (better UX)
- Length penalties (combat overthinking)
- Loss normalization (batch vs group)
- Parallel test-time compute (majority voting)
- Text-only boosts multimodal (transfer learning)
- Toggleable reasoning (user control)