Ch9. Instruction Fine-Tuning (IFT/SFT)

AI Reasoning Logo Ch9: Instruction Fine-Tuning (IFT/SFT)

Q1: What are chat templates and why do they matter? #

A: Chat templates are formatting systems that structure conversations into a format language models can process. They use special tokens (like <|im_start|>, <|im_end|>) to mark boundaries between different parts of the conversation.

Example:

<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
The answer is 4<|im_end|>

Q2: What are the three message roles and how do they differ? #

System: Sets persistent context/instructions for the entire conversation. Applied once at the beginning. Think of it as “background instructions” that influence all responses.
User: Messages from the person using the AI
Assistant: Individual responses from the AI model

Key point: System provides context that affects all assistant responses, but each assistant message is a separate turn in the conversation.

Q3: What is prompt masking in IFT/SFT? #

A: Prompt masking means the model sees all tokens (prompt + response) but the loss is only calculated on the assistant’s response tokens. The prompt tokens are excluded from loss computation.

Why? We want the model to learn to generate good responses, not to predict user queries.

Example:

User: "What is 2+2?" ← SEEN but NO LOSS applied
Assistant: "The answer is 4" ← SEEN and LOSS applied

Q4: How is “masking” different in IFT/SFT vs BERT/GPT pretraining? #

Training Type	Masked Tokens Visibility	Loss Applied To	Purpose
BERT	❌ Hidden/Not seen	✅ Masked tokens	Predict missing words
GPT Pretraining	❌ Hidden (future tokens)	✅ All tokens	Predict next token
IFT/SFT	✅ Fully visible	✅ Only assistant responses	Generate good responses

Critical Difference:

BERT/GPT: Masked tokens NOT SEEN during training, INCLUDED in loss
IFT/SFT: Masked tokens FULLY SEEN during training, EXCLUDED from loss

Q5: Why use multi-turn conversations instead of just single-turn? #

A: Multi-turn data teaches the model:

Context tracking - understand references to previous turns
Conversation coherence - maintain consistency across dialogue
Real conversation skills - handle follow-ups and clarifications

Example why this matters:

Turn 1
User: "I have a dog"
Assistant: "What breed?"

Turn 2
User: "He's very playful" ← needs to understand "he" = the dog

Single-turn only would make models bad at maintaining conversational context.

Q6: What are the supervised learning pairs in IFT/SFT? #

A: The pairs are: [Full conversation context] → [Next assistant response]

Single-turn:

Input: [System + User1] → Target: [Assistant1]

Multi-turn (unrolled into multiple training examples):

Example 1: [System + User1] → Target: [Assistant1]
Example 2: [System + User1 + Assistant1 + User2] → Target: [Assistant2]
Example 3: [System + User1 + Assistant1 + User2 + Assistant2 + User3] → Target: [Assistant3]

Key insight: One N-turn conversation creates N training examples, each predicting a different assistant response.

Q7: What exactly gets “masked” in multi-turn training? #

For Turn 2 example:

Input sequence (what model sees): [System + User1 + Assistant1 + User2 + Assistant2]
Masked from loss: [System + User1 + Assistant1 + User2]
Loss applied to: [Assistant2] only

The model sees everything for context, but only learns to generate the current assistant response.

Q8: What are the key implementation differences from pretraining? #

Aspect	Pretraining	IFT/SFT
Batch size	Large (1024-2048)	Smaller (256)
Loss applied to	All tokens	Only assistant responses
Training data	Raw text	Structured conversations

Key differences:

Smaller batch sizes - fewer GPUs needed
Prompt masking - loss only on responses
Multi-turn masking - only final assistant turn per example

Q9: What are the best practices for instruction tuning? #

Quality over quantity - High-quality completions are crucial (model learns from responses)
~1M prompts sufficient for excellent results (diminishing returns after)
Data distribution matters - Use prompts similar to target use cases
Overall optimization - Models can recover from noise; focus on complete pipeline

Content Summary #

Core Concept: Instruction fine-tuning transforms pretrained language models into conversational assistants by teaching them to generate appropriate responses to user queries.

Key Mechanisms:

Chat templates structure conversations with role-based formatting
Prompt masking ensures models learn response generation, not query prediction
Multi-turn training develops conversational coherence and context tracking
Supervised learning pairs full context with target responses

Critical Insight: The “masking” terminology is overloaded:

In IFT/SFT: “masked” = excluded from loss (but still visible to model)
In BERT/GPT: “masked” = hidden from model (and included in loss for prediction)

Most Important Takeaway: In IFT/SFT, the model sees the entire conversation history for context, but only learns to predict the assistant’s responses. This creates models that can follow instructions while maintaining conversational context.