Ch7. Reasoning & Inference-time Scaling

AI Reasoning Logo Ch7. REASONING & INFERENCE-TIME SCALING

THE BIG PICTURE - WHY IS THIS BREAKTHROUGH? #

Q1: What actually happened in 2025 that makes Chapter 7 revolutionary? #

A: Two seismic events changed everything:

OpenAI o1 (Sep 2024) - Showed reasoning models work at scale
DeepSeek R1 (Jan 2025) - Fully documented the recipe openly

Timeline:

Before: “RL doesn’t work” (2018 famous blog post)
After: 20+ reasoning models released in 6 months (Jan-Jul 2025)

This is like going from “flight is impossible” to “commercial airlines everywhere” in 6 months. The paradigm shifted THAT fast.

Q2: Why couldn’t we do this before 2024? #

A: Three critical barriers were overcome:

BARRIER 1: RL Stability #

Problem: RL training was “fickle” - crashed, failed randomly
Solution: Open-source tools matured (TRL, veRL, OpenRLHF)
Result: “Technical barriers to entry at an all-time low”

BARRIER 2: Model Capability Threshold #

Problem: Smaller/weaker models couldn’t learn reasoning via RL
Solution: Base models from ~2024 onwards were “capable enough”
Result: RL training could finally elicit reasoning behaviors

BARRIER 3: Verifiable Rewards #

Problem: How do we know if reasoning is correct?
Solution: Math/coding have binary correctness (RLVR)
Result: Train without expensive preference data!

Q3: What’s the difference between RLHF and RLVR? #

A: Critical distinction that defines the new era:

RLHF (Old Way) #

Reward: Human preference labels (expensive, subjective)
Use case: Chat, safety, style, “vibes”
Cost: $1-10 per preference pair
Training: 1-2 epochs, careful not to overfit
Example: ChatGPT’s politeness, Claude’s helpfulness

RLVR (New Way) - “Reinforcement Learning from Verifiable Rewards” #

Reward: Binary correctness (right/wrong answer)
Use case: Math, coding, reasoning, STEM
Cost: Nearly free (automated checking)
Training: Hundreds/thousands of epochs until convergence
Example: DeepSeek R1’s math ability, o1’s coding

KEY INSIGHT: RLVR doesn’t need humans in the loop for correctness! Just check if answer == ground_truth

CORE CONCEPTS - WHAT IS INFERENCE-TIME SCALING? #

Q4: What is “inference-time scaling”? #

A: Using MORE COMPUTE at inference to get BETTER answers.

Traditional LLM: #

├─ User asks: "Solve this math problem"
├─ Model generates: [answer in 100 tokens]
└─ Done. Fixed cost.

Reasoning Model (Inference-time Scaling): #

├─ User asks: "Solve this math problem"
├─ Model thinks: [2000 tokens of <think>...</think> reasoning]
├─ Model generates: [final answer]
└─ Uses 20x more tokens = 20x more compute = better accuracy!

ANALOGY:

Traditional = “quick guess”
Reasoning = “show your work” (like you did in school)

Q5: What’s the difference between training-time and inference-time scaling? #

A: Where you spend the compute:

TRAINING-TIME SCALING (Traditional) #

Spend compute: During training (one-time cost)
Method: Bigger models, more training data, longer training
Result: Smarter model for ALL users
Example: GPT-3 (175B) → GPT-4 (1.7T) = bigger = smarter
Tradeoff: Expensive upfront, cheap at inference

INFERENCE-TIME SCALING (New Era) #

Spend compute: During EACH query (per-user cost)
Method: Model “thinks” longer (more tokens) for hard Qs
Result: Same model, but “tries harder” on demand
Example: o1 generates 2K-30K thinking tokens per hard Q
Tradeoff: Cheaper training, expensive inference (but better!)

BREAKTHROUGH: You can now trade inference cost for accuracy! (Like buying “thinking time” on demand)

Q6: How does RL training create inference-time scaling? #

A: RL teaches the model to “think out loud” for hard problems:

The Process: #

Step 1: Model encounters hard math problem

Step 2: Model generates reasoning chain:

<think>
Let me break this down...
First, I'll try approach A...
Wait, that doesn't work...
Let me try approach B...
Ah! That works because...
Therefore the answer is...
</think>

Step 3: Final answer: [correct]

Step 4: RL reward: +1 (correct!) → Reinforce this “thinking behavior”

After thousands of epochs: #

Model learns: “Hard problems = think longer = higher accuracy”
Correlation emerges: More tokens → Better performance
This is NOT length bias (old RLHF problem)
This is PRODUCTIVE reasoning

THE CANONICAL RECIPE - DEEPSEEK R1 #

Q7: What is DeepSeek R1’s training recipe? #

A: 4-stage process that became the blueprint for 2025:

STAGE 1: “Cold-Start” (100K+ samples) #

What: Sample from earlier RL checkpoint (R1-Zero)
Filter: Keep only high-quality reasoning chains
Goal: Teach model the PROCESS of reasoning
Why “cold-start”? Learning RL from minimal supervised data (Unlike traditional SFT which needs millions of examples)

STAGE 2: Large-Scale RL (The Core) #

What: Run RLVR “until convergence”
Data: Reasoning problems (math, coding, STEM)
Epochs: HUNDREDS (not 1-2 like traditional fine-tuning!)
Reward: Binary correctness (right/wrong)
Note: This is where the magic happens - model learns to reason!

STAGE 3: Rejection Sampling (Transition to General) #

Mix: 3/4 reasoning problems + 1/4 general queries
Goal: Don’t lose general chat ability
Method: Sample multiple answers, keep best ones

STAGE 4: Mixed RL (Polish) #

RLVR: For reasoning domains (verifiable)
RLHF: For general domains (human preferences)
Goal: Final model that’s both smart AND pleasant

Q8: What’s revolutionary about this recipe vs previous RLHF? #

A: Complete inversion of traditional priorities:

Traditional RLHF (InstructGPT, ChatGPT): #

├─ Stage 1: SFT (1M examples, 1-2 epochs) ← MAIN TRAINING
├─ Stage 2: RLHF (100K pref pairs) ← "Cherry on top"
└─ Focus: Behavior, style, safety

DeepSeek R1 (Reasoning Era): #

├─ Stage 1: Cold-start (100K, minimal SFT)
├─ Stage 2: RLVR (until convergence) ← MAIN TRAINING
├─ Stage 3-4: Polish with SFT+RLHF
└─ Focus: Capability, performance, correctness

PARADIGM SHIFT: RL went from “polish” to “core capability builder”

THE EXPLOSION - 20+ MODELS IN 6 MONTHS #

Q9: Who released reasoning models in 2025? #

A: Everyone. Literally everyone:

Jan 2025: DeepSeek R1, Kimi 1.5
Mar 2025: OpenReasoner-Zero (first fully open!)
Apr 2025: Seed-Thinking 1.5, Phi-4 Reasoning
May 2025: Llama-Nemotron, INTELLECT-2, Xiaomi MiMo, Qwen 3
Jun 2025: Hunyuan-TurboS, Skywork OR-1, OpenThoughts, Magistral
Jul 2025+: Kimi K2, GLM-4.5, Nemotron Nano 2, MiniMax-M1…

20+ models in ~6 months!

Organizations: ByteDance, Microsoft, Meta, Alibaba, Mistral, Moonshot AI, OpenBMB, Zhipu AI, Nvidia, MiniMax…

KEY INSIGHT: This isn’t one lab’s secret sauce. This is a FIELD-WIDE revolution.

Q10: What are the common training techniques across these models? #

A: 10 techniques repeatedly used (with examples):

1. OFFLINE DIFFICULTY FILTERING #

What: Pre-filter training data by difficulty
Why: Model can only learn from 20-80% solvable problems
Who: Seed-Thinking, OpenReasoner-Zero, Phi-4, Qwen 3
Logic:
- Too easy (100% solve) = no gradient
- Too hard (0% solve) = no gradient
- Just right (20-80%) = optimal learning!

2. PER-BATCH ONLINE FILTERING #

What: During training, filter problems dynamically
Why: Model capability changes as it learns
Who: Kimi 1.5, Magistral, Llama-Nemotron
Example:
- Week 1: Model solves 30% → include these
- Week 4: Model solves 80% → filter out (too easy)

3. REMOVE KL PENALTY #

What: Turn off KL divergence regularization
Why: Let model explore reasoning space freely
Who: RAGEN, Magistral, OpenReasoner-Zero
Insight:
- Traditional RLHF: KL penalty keeps model close to SFT init
- Reasoning: Remove KL → model can learn NEW behaviors

4. FORMAT REWARDS #

What: Reward model for using <think>...</think> tags
Why: Ensure consistent, parseable reasoning format
Who: DeepSeek R1, OpenReasoner-Zero, Magistral
Impact:
- Without: Model might reason in inconsistent ways
- With: Guaranteed structured output for downstream systems

5. LANGUAGE CONSISTENCY REWARDS #

What: Penalize switching languages mid-reasoning
Why: Better UX, easier to debug
Who: DeepSeek R1, Magistral (multilingual models)
Example:
- Bad: “Let me solve… 让我们计算… donc la réponse est…”
- Good: Stick to one language throughout reasoning

6. LENGTH PENALTIES #

What: Penalize overthinking (too many reasoning tokens)
Why: Combat diminishing returns on long chains
Who: Kimi 1.5, INTELLECT-2
Problem: Model might generate 50K tokens to solve 2+2
Solution: Progressive length limits or small penalties

7. LOSS NORMALIZATION (Batch vs Group) #

What: Normalize advantages at batch level (not group)
Why: Avoid bias towards low-variance problems
Who: Magistral, MiMo
Comparison:
- GRPO original: Normalize per group of responses
- Alternative: Normalize across entire batch

8. PARALLEL TEST-TIME COMPUTE #

What: Generate N answers, pick best via majority/scorer
Why: Boosts accuracy without retraining
Who: DeepSeek R1, Phi-4, Claude 4, DeepSeek-GRM
Methods:
- Method 1: Majority voting (most common answer)
- Method 2: Scorer model (trained to pick best)

9. TEXT-ONLY REASONING BOOSTS MULTIMODAL #

What: Train reasoning on text, improves vision+text too!
Why: Reasoning transfers across modalities
Who: Magistral, MiMo-VL
Surprising finding: Don’t need vision data for vision boost!

10. TOGGLEABLE REASONING (System Prompt) #

What: User controls reasoning depth via system prompt
Why: Fast answers for easy Q’s, deep for hard Q’s
Who: Llama-Nemotron, Qwen 3
Example:
- System: “Use minimal thinking”
  - Model: [short chain] → fast, cheap
- System: “Think deeply”
  - Model: [long chain] → accurate, expensive

FUTURE DIRECTIONS #

Q11: Where is this field going next? #

A: Three frontier areas (from the book):

1. BEYOND MATH/CODE #

Current: RLVR works great for verifiable domains
Next: How to apply to non-verifiable domains?
Challenge: Need new reward signal types

2. DISTILLATION #

Current: Train big reasoning model → distill to small
Next: Can small models learn reasoning directly?
Trend: OpenThoughts dataset (distilled reasoning chains)

3. MULTIMODAL REASONING #

Current: Text-only reasoning boosts vision (surprising!)
Next: End-to-end multimodal reasoning training
Example: MiMo-VL, Xiaomi’s multimodal reasoning

Q12: What’s the one thing experts still debate? #

A: Whether RL is necessary or if distillation is enough:

CAMP 1: “RL is essential” #

Evidence: DeepSeek R1 cold-start shows RL creates novel behaviors
Claim: Can’t just imitate, must explore via RL

CAMP 2: “Distillation is enough” #

Evidence: OpenThoughts (distilled from QwQ) works well
Claim: Cheaper, faster, good enough for most uses

REALITY: Probably both have roles #

Frontier: Use RL to discover new reasoning patterns
Production: Distill to smaller models for deployment

QUICK REFERENCE SUMMARY #

Key Timeline #

2018: “RL doesn’t work” conventional wisdom
Sep 2024: OpenAI o1 proves reasoning at scale
Jan 2025: DeepSeek R1 releases full recipe
Jan-Jul 2025: 20+ reasoning models released

Core Concepts #

RLHF: Human preferences, chat/safety, expensive, 1-2 epochs
RLVR: Binary correctness, math/code, free, hundreds of epochs
Inference-time scaling: Trade compute for accuracy at test time
Reasoning chains: Model “thinks out loud” in <think> tags

DeepSeek R1 Recipe (4 Stages) #

Cold-start (100K samples)
Large-scale RL (core training)
Rejection sampling (generalization)
Mixed RL (polish)

Top 10 Training Techniques #

Offline difficulty filtering (20-80% sweet spot)
Per-batch online filtering (dynamic adjustment)
Remove KL penalty (exploration freedom)
Format rewards (structured output)
Language consistency (better UX)
Length penalties (combat overthinking)
Loss normalization (batch vs group)
Parallel test-time compute (majority voting)
Text-only boosts multimodal (transfer learning)
Toggleable reasoning (user control)