Ch7. Reasoning & Inference-time Scaling

AI Reasoning Logo Ch7. REASONING & INFERENCE-TIME SCALING



THE BIG PICTURE - WHY IS THIS BREAKTHROUGH? #

Q1: What actually happened in 2025 that makes Chapter 7 revolutionary? #

A: Two seismic events changed everything:

  1. OpenAI o1 (Sep 2024) - Showed reasoning models work at scale
  2. DeepSeek R1 (Jan 2025) - Fully documented the recipe openly

Timeline:

  • Before: “RL doesn’t work” (2018 famous blog post)
  • After: 20+ reasoning models released in 6 months (Jan-Jul 2025)

This is like going from “flight is impossible” to “commercial airlines everywhere” in 6 months. The paradigm shifted THAT fast.

Q2: Why couldn’t we do this before 2024? #

A: Three critical barriers were overcome:

BARRIER 1: RL Stability #

  • Problem: RL training was “fickle” - crashed, failed randomly
  • Solution: Open-source tools matured (TRL, veRL, OpenRLHF)
  • Result: “Technical barriers to entry at an all-time low”

BARRIER 2: Model Capability Threshold #

  • Problem: Smaller/weaker models couldn’t learn reasoning via RL
  • Solution: Base models from ~2024 onwards were “capable enough”
  • Result: RL training could finally elicit reasoning behaviors

BARRIER 3: Verifiable Rewards #

  • Problem: How do we know if reasoning is correct?
  • Solution: Math/coding have binary correctness (RLVR)
  • Result: Train without expensive preference data!

Q3: What’s the difference between RLHF and RLVR? #

A: Critical distinction that defines the new era:

RLHF (Old Way) #

  • Reward: Human preference labels (expensive, subjective)
  • Use case: Chat, safety, style, “vibes”
  • Cost: $1-10 per preference pair
  • Training: 1-2 epochs, careful not to overfit
  • Example: ChatGPT’s politeness, Claude’s helpfulness

RLVR (New Way) - “Reinforcement Learning from Verifiable Rewards” #

  • Reward: Binary correctness (right/wrong answer)
  • Use case: Math, coding, reasoning, STEM
  • Cost: Nearly free (automated checking)
  • Training: Hundreds/thousands of epochs until convergence
  • Example: DeepSeek R1’s math ability, o1’s coding

KEY INSIGHT: RLVR doesn’t need humans in the loop for correctness! Just check if answer == ground_truth



CORE CONCEPTS - WHAT IS INFERENCE-TIME SCALING? #

Q4: What is “inference-time scaling”? #

A: Using MORE COMPUTE at inference to get BETTER answers.

Traditional LLM: #

├─ User asks: "Solve this math problem"
├─ Model generates: [answer in 100 tokens]
└─ Done. Fixed cost.

Reasoning Model (Inference-time Scaling): #

├─ User asks: "Solve this math problem"
├─ Model thinks: [2000 tokens of <think>...</think> reasoning]
├─ Model generates: [final answer]
└─ Uses 20x more tokens = 20x more compute = better accuracy!

ANALOGY:

  • Traditional = “quick guess”
  • Reasoning = “show your work” (like you did in school)

Q5: What’s the difference between training-time and inference-time scaling? #

A: Where you spend the compute:

TRAINING-TIME SCALING (Traditional) #

  • Spend compute: During training (one-time cost)
  • Method: Bigger models, more training data, longer training
  • Result: Smarter model for ALL users
  • Example: GPT-3 (175B) → GPT-4 (1.7T) = bigger = smarter
  • Tradeoff: Expensive upfront, cheap at inference

INFERENCE-TIME SCALING (New Era) #

  • Spend compute: During EACH query (per-user cost)
  • Method: Model “thinks” longer (more tokens) for hard Qs
  • Result: Same model, but “tries harder” on demand
  • Example: o1 generates 2K-30K thinking tokens per hard Q
  • Tradeoff: Cheaper training, expensive inference (but better!)

BREAKTHROUGH: You can now trade inference cost for accuracy! (Like buying “thinking time” on demand)

Q6: How does RL training create inference-time scaling? #

A: RL teaches the model to “think out loud” for hard problems:

The Process: #

Step 1: Model encounters hard math problem

Step 2: Model generates reasoning chain:

<think>
Let me break this down...
First, I'll try approach A...
Wait, that doesn't work...
Let me try approach B...
Ah! That works because...
Therefore the answer is...
</think>

Step 3: Final answer: [correct]

Step 4: RL reward: +1 (correct!) → Reinforce this “thinking behavior”

After thousands of epochs: #

  • Model learns: “Hard problems = think longer = higher accuracy”
  • Correlation emerges: More tokens → Better performance
  • This is NOT length bias (old RLHF problem)
  • This is PRODUCTIVE reasoning


THE CANONICAL RECIPE - DEEPSEEK R1 #

Q7: What is DeepSeek R1’s training recipe? #

A: 4-stage process that became the blueprint for 2025:

STAGE 1: “Cold-Start” (100K+ samples) #

  • What: Sample from earlier RL checkpoint (R1-Zero)
  • Filter: Keep only high-quality reasoning chains
  • Goal: Teach model the PROCESS of reasoning
  • Why “cold-start”? Learning RL from minimal supervised data (Unlike traditional SFT which needs millions of examples)

STAGE 2: Large-Scale RL (The Core) #

  • What: Run RLVR “until convergence”
  • Data: Reasoning problems (math, coding, STEM)
  • Epochs: HUNDREDS (not 1-2 like traditional fine-tuning!)
  • Reward: Binary correctness (right/wrong)
  • Note: This is where the magic happens - model learns to reason!

STAGE 3: Rejection Sampling (Transition to General) #

  • Mix: 3/4 reasoning problems + 1/4 general queries
  • Goal: Don’t lose general chat ability
  • Method: Sample multiple answers, keep best ones

STAGE 4: Mixed RL (Polish) #

  • RLVR: For reasoning domains (verifiable)
  • RLHF: For general domains (human preferences)
  • Goal: Final model that’s both smart AND pleasant

Q8: What’s revolutionary about this recipe vs previous RLHF? #

A: Complete inversion of traditional priorities:

Traditional RLHF (InstructGPT, ChatGPT): #

├─ Stage 1: SFT (1M examples, 1-2 epochs) ← MAIN TRAINING
├─ Stage 2: RLHF (100K pref pairs) ← "Cherry on top"
└─ Focus: Behavior, style, safety

DeepSeek R1 (Reasoning Era): #

├─ Stage 1: Cold-start (100K, minimal SFT)
├─ Stage 2: RLVR (until convergence) ← MAIN TRAINING
├─ Stage 3-4: Polish with SFT+RLHF
└─ Focus: Capability, performance, correctness

PARADIGM SHIFT: RL went from “polish” to “core capability builder”



THE EXPLOSION - 20+ MODELS IN 6 MONTHS #

Q9: Who released reasoning models in 2025? #

A: Everyone. Literally everyone:

  • Jan 2025: DeepSeek R1, Kimi 1.5
  • Mar 2025: OpenReasoner-Zero (first fully open!)
  • Apr 2025: Seed-Thinking 1.5, Phi-4 Reasoning
  • May 2025: Llama-Nemotron, INTELLECT-2, Xiaomi MiMo, Qwen 3
  • Jun 2025: Hunyuan-TurboS, Skywork OR-1, OpenThoughts, Magistral
  • Jul 2025+: Kimi K2, GLM-4.5, Nemotron Nano 2, MiniMax-M1…

20+ models in ~6 months!

Organizations: ByteDance, Microsoft, Meta, Alibaba, Mistral, Moonshot AI, OpenBMB, Zhipu AI, Nvidia, MiniMax…

KEY INSIGHT: This isn’t one lab’s secret sauce. This is a FIELD-WIDE revolution.

Q10: What are the common training techniques across these models? #

A: 10 techniques repeatedly used (with examples):

1. OFFLINE DIFFICULTY FILTERING #

  • What: Pre-filter training data by difficulty
  • Why: Model can only learn from 20-80% solvable problems
  • Who: Seed-Thinking, OpenReasoner-Zero, Phi-4, Qwen 3
  • Logic:
    • Too easy (100% solve) = no gradient
    • Too hard (0% solve) = no gradient
    • Just right (20-80%) = optimal learning!

2. PER-BATCH ONLINE FILTERING #

  • What: During training, filter problems dynamically
  • Why: Model capability changes as it learns
  • Who: Kimi 1.5, Magistral, Llama-Nemotron
  • Example:
    • Week 1: Model solves 30% → include these
    • Week 4: Model solves 80% → filter out (too easy)

3. REMOVE KL PENALTY #

  • What: Turn off KL divergence regularization
  • Why: Let model explore reasoning space freely
  • Who: RAGEN, Magistral, OpenReasoner-Zero
  • Insight:
    • Traditional RLHF: KL penalty keeps model close to SFT init
    • Reasoning: Remove KL → model can learn NEW behaviors

4. FORMAT REWARDS #

  • What: Reward model for using <think>...</think> tags
  • Why: Ensure consistent, parseable reasoning format
  • Who: DeepSeek R1, OpenReasoner-Zero, Magistral
  • Impact:
    • Without: Model might reason in inconsistent ways
    • With: Guaranteed structured output for downstream systems

5. LANGUAGE CONSISTENCY REWARDS #

  • What: Penalize switching languages mid-reasoning
  • Why: Better UX, easier to debug
  • Who: DeepSeek R1, Magistral (multilingual models)
  • Example:
    • Bad: “Let me solve… 让我们计算… donc la réponse est…”
    • Good: Stick to one language throughout reasoning

6. LENGTH PENALTIES #

  • What: Penalize overthinking (too many reasoning tokens)
  • Why: Combat diminishing returns on long chains
  • Who: Kimi 1.5, INTELLECT-2
  • Problem: Model might generate 50K tokens to solve 2+2
  • Solution: Progressive length limits or small penalties

7. LOSS NORMALIZATION (Batch vs Group) #

  • What: Normalize advantages at batch level (not group)
  • Why: Avoid bias towards low-variance problems
  • Who: Magistral, MiMo
  • Comparison:
    • GRPO original: Normalize per group of responses
    • Alternative: Normalize across entire batch

8. PARALLEL TEST-TIME COMPUTE #

  • What: Generate N answers, pick best via majority/scorer
  • Why: Boosts accuracy without retraining
  • Who: DeepSeek R1, Phi-4, Claude 4, DeepSeek-GRM
  • Methods:
    • Method 1: Majority voting (most common answer)
    • Method 2: Scorer model (trained to pick best)

9. TEXT-ONLY REASONING BOOSTS MULTIMODAL #

  • What: Train reasoning on text, improves vision+text too!
  • Why: Reasoning transfers across modalities
  • Who: Magistral, MiMo-VL
  • Surprising finding: Don’t need vision data for vision boost!

10. TOGGLEABLE REASONING (System Prompt) #

  • What: User controls reasoning depth via system prompt
  • Why: Fast answers for easy Q’s, deep for hard Q’s
  • Who: Llama-Nemotron, Qwen 3
  • Example:
    • System: “Use minimal thinking”
      • Model: [short chain] → fast, cheap
    • System: “Think deeply”
      • Model: [long chain] → accurate, expensive


FUTURE DIRECTIONS #

Q11: Where is this field going next? #

A: Three frontier areas (from the book):

1. BEYOND MATH/CODE #

  • Current: RLVR works great for verifiable domains
  • Next: How to apply to non-verifiable domains?
  • Challenge: Need new reward signal types

2. DISTILLATION #

  • Current: Train big reasoning model → distill to small
  • Next: Can small models learn reasoning directly?
  • Trend: OpenThoughts dataset (distilled reasoning chains)

3. MULTIMODAL REASONING #

  • Current: Text-only reasoning boosts vision (surprising!)
  • Next: End-to-end multimodal reasoning training
  • Example: MiMo-VL, Xiaomi’s multimodal reasoning

Q12: What’s the one thing experts still debate? #

A: Whether RL is necessary or if distillation is enough:

CAMP 1: “RL is essential” #

  • Evidence: DeepSeek R1 cold-start shows RL creates novel behaviors
  • Claim: Can’t just imitate, must explore via RL

CAMP 2: “Distillation is enough” #

  • Evidence: OpenThoughts (distilled from QwQ) works well
  • Claim: Cheaper, faster, good enough for most uses

REALITY: Probably both have roles #

  • Frontier: Use RL to discover new reasoning patterns
  • Production: Distill to smaller models for deployment


QUICK REFERENCE SUMMARY #

Key Timeline #

  • 2018: “RL doesn’t work” conventional wisdom
  • Sep 2024: OpenAI o1 proves reasoning at scale
  • Jan 2025: DeepSeek R1 releases full recipe
  • Jan-Jul 2025: 20+ reasoning models released

Core Concepts #

  • RLHF: Human preferences, chat/safety, expensive, 1-2 epochs
  • RLVR: Binary correctness, math/code, free, hundreds of epochs
  • Inference-time scaling: Trade compute for accuracy at test time
  • Reasoning chains: Model “thinks out loud” in <think> tags

DeepSeek R1 Recipe (4 Stages) #

  1. Cold-start (100K samples)
  2. Large-scale RL (core training)
  3. Rejection sampling (generalization)
  4. Mixed RL (polish)

Top 10 Training Techniques #

  1. Offline difficulty filtering (20-80% sweet spot)
  2. Per-batch online filtering (dynamic adjustment)
  3. Remove KL penalty (exploration freedom)
  4. Format rewards (structured output)
  5. Language consistency (better UX)
  6. Length penalties (combat overthinking)
  7. Loss normalization (batch vs group)
  8. Parallel test-time compute (majority voting)
  9. Text-only boosts multimodal (transfer learning)
  10. Toggleable reasoning (user control)