Day 1 - Foundational LLMs & Text Generation

Day 1 - Foundational LLMs & Text Generation" #


Foundations of LLMs #

1. Why LLMs Matter #

Traditional NLP systems were narrow, but Large Language Models (LLMs) offer general-purpose capabilities like translation, Q&A, and summarization—all without explicit task-specific programming.

→ How do LLMs work under the hood?

2. What Powers LLMs: The Transformer #

The Transformer is the core architecture enabling LLMs. Unlike RNNs that process data sequentially, Transformers handle inputs in parallel using self-attention, allowing them to model long-range dependencies more efficiently and scale training.

→ But to understand how Transformers process input, we need to examine how input data is prepared.

3. Input Preparation & Embedding #

Before data enters the transformer, it’s tokenized, embedded into high-dimensional vectors, and enhanced with positional encodings to preserve word order. These embeddings become the input that feeds into attention mechanisms.

→ So, once we have these embeddings—how does the model understand relationships within the input?

4. Self-Attention and Multi-Head Attention #

The self-attention mechanism calculates how each word relates to every other word. Multi-head attention expands on this by letting the model attend to different relationships in parallel (e.g., syntax, co-reference). This enables rich, contextual understanding.

→ To manage this complexity across layers, the architecture needs stabilization techniques.

5. Layer Normalization and Residual Connections #

To avoid training instability and gradient issues, Transformers use residual connections and layer normalization, ensuring smooth learning across deep layers.

→ After stabilizing, each layer further transforms the data with an extra module…

6. Feedforward Layers #

Each token’s representation is independently refined using position-wise feedforward networks that add depth and non-linearity—enhancing the model’s ability to capture abstract patterns.

→ With these components, we now have building blocks for the full Transformer structure.

7. Encoder-Decoder Architecture #

In the original Transformer, the encoder turns input text into a contextual representation, and the decoder autoregressively generates output using that context. However, modern LLMs like GPT simplify this by using decoder-only models for direct generation.

→ As LLMs scale, new architectures emerge to improve efficiency and specialization.

8. Mixture of Experts (MoE) #

MoE architectures use specialized sub-models (experts) activated selectively via a gating mechanism. This allows LLMs to scale massively while using only a portion of the model per input—enabling high performance with lower cost.

→ But performance isn’t just about architecture—reasoning capabilities are equally vital.

9. Building Reasoning into LLMs #

Reasoning is enabled via multiple strategies:

  • Chain-of-Thought prompting: Guide the model to generate intermediate steps.
  • Tree-of-Thoughts: Explore reasoning paths via branching.
  • Least-to-Most: Build up from simpler subproblems.
  • Fine-tuning on reasoning datasets and RLHF further optimize for correctness and coherence.

→ To train these reasoning patterns, we need carefully prepared data and efficient training pipelines.

10. Training the Transformer #

Training involves:

  • Data preparation: Clean, tokenize, and build vocabulary.
  • Loss calculation: Compare outputs to targets using cross-entropy.
  • Backpropagation: Update weights with optimizers (e.g., Adam).

→ Depending on the architecture, training objectives differ.

11. Model-Specific Training Strategies #

  • Decoder-only (e.g., GPT): Predict next token from prior sequence.
  • Encoder-only (e.g., BERT): Mask tokens and reconstruct.
  • Encoder-decoder (e.g., T5): Learn input-to-output mapping for tasks like translation or summarization.

Training quality is influenced by context length—longer context allows better modeling of dependencies, but at higher compute cost.

12. The Evolution Begins: From Attention to Transformers #

It started with the 2017 “Attention Is All You Need” paper—laying the groundwork for all Transformer-based models. This sparked a sequence of breakthroughs in model architecture and training methods.

13. GPT-1: Unsupervised Pre-training Breakthrough #

GPT-1 was a decoder-only model trained on BooksCorpus with a pioneering strategy: pre-train on unlabeled data, fine-tune on supervised tasks. This showed that unsupervised learning scales better than purely supervised models and inspired the unified transformer approach.

14. BERT: Deep Understanding through Masking #

BERT introduced the encoder-only model that trains on masked tokens and sentence relationships. Unlike GPT, it’s not a generator but an understander, excelling in classification, NLU, and inference tasks.

15. GPT-2: Scaling Up Leads to Zero-Shot Learning #

By expanding to 1.5B parameters and using a diverse dataset (WebText), GPT-2 revealed that larger models can generalize better, even to unseen tasks—zero-shot prompting emerged as a surprising capability.

16. GPT-3 to GPT-4: Generalist Reasoners with Instruction-Tuning #

GPT-3 scaled to 175B parameters and needed no fine-tuning for many tasks. Later versions (GPT-3.5, GPT-4) added coding, longer context windows, multimodal inputs, and improved instruction following via InstructGPT and RLHF.

17. LaMDA: Dialogue-Focused Language Modeling #

Google’s LaMDA was purpose-built for open-ended conversations, emphasizing turn-based flow and topic diversity—unlike GPT, which handled more general tasks.

18. Gopher: Bigger is Smarter (Sometimes) #

DeepMind’s Gopher used high-quality MassiveText data. It scaled to 280B parameters and showed that size improves knowledge-intensive tasks, but not always reasoning—hinting at the importance of data quality and task balance.

19. GLaM: Efficient Scaling with Mixture-of-Experts #

GLaM pioneered sparse activation, using only parts of a trillion-parameter network per input. It demonstrated that MoE architectures can outperform dense ones using far less compute.

20. Chinchilla: The Scaling Laws Revolution #

Chinchilla showed that previous scaling laws (Kaplan et al.) were suboptimal. DeepMind proved that data-to-parameter ratio matters—a smaller model trained on more data can outperform much larger ones.

21. PaLM and PaLM 2: Distributed and Smarter #

PaLM (540B) used Google’s TPU Pathways for efficient large-scale training. PaLM 2 reduced parameters but improved performance via architectural tweaks, showcasing that smarter design beats brute force.

22. Gemini Family: Multimodal, Efficient, and Scalable #

Gemini models support text, images, audio, and video inputs. Key innovations include:

  • Mixture-of-experts backbone
  • Context windows up to 10M tokens (Gemini 1.5 Pro)
  • Versions for cloud (Pro), mobile (Nano), and ultra-scale inference (Ultra)
  • Gemini 2.0 Flash enables fast, explainable reasoning for science/math tasks.

23. Gemma: Open-Sourced and Lightweight #

Built on Gemini tech, Gemma models are optimized for accessibility. The 2B and 27B variants balance performance and efficiency, with Gemma 3 offering 128K token windows and 140-language support.

24. LLaMA Series: Meta’s Open Challenger #

Meta’s LLaMA models evolved with increased context length and safety. LLaMA 2 introduced chat-optimized variants; LLaMA 3.2 added multilingual and visual capabilities with quantization for on-device use.

25. Mixtral: Sparse Experts and Open Access #

Mistral AI’s Mixtral 8x7B uses sparse MoE with only 13B active params per token, excelling in code and long-context tasks. Instruction-tuned variants rival closed-source models.

26. OpenAI O1: Internal Chain-of-Thought #

OpenAI’s “o1” models use deliberate internal CoT reasoning to excel in programming, science, and Olympiad-level tasks, aiming for thoughtful, high-accuracy outputs.

27. DeepSeek: RL Without Labels #

DeepSeek-R1 uses pure reinforcement learning without labeled data. Their GRPO method enables self-supervised reasoning with rejection sampling and multi-stage fine-tuning, matching “o1” performance.

28. The Open Frontier #

Multiple open models are pushing the boundaries:

  • Qwen 1.5 (Alibaba): up to 72B params, strong multilingual support.
  • Yi (01.AI): 3.1T token dataset, 200k context length, vision support.
  • Grok 3 (xAI): 1M context tokens, trained with RL for strategic reasoning.

29. Comparing the Giants #

Transformer models have scaled in size, context, and capability. From 117M to 1T+ parameters, from 512-token limits to 10M-token contexts. Key insights:

  • Bigger is not always better—efficiency, data quality, and training methods matter more.
  • Reasoning and instruction-following are now central.
  • Multimodality and retrieval-augmented generation are shaping next-gen LLMs.

Fine-Tuning and Using LLMs #

30. From Pretraining to Specialization: Why Fine-Tune? #

LLMs are pretrained on broad data to learn general language patterns. But for real-world use, we often need them to follow specific instructions, engage in safe dialogues, or behave reliably. This is where fine-tuning comes in.

31. Supervised Fine-Tuning (SFT): The First Specialization Step #

SFT improves LLM behavior using high-quality labeled datasets. Typical goals:

  • Better instruction-following
  • Multi-turn dialogue (chat)
  • Safer, less toxic outputs

Example formats: Q&A, summarization, translations—each with clear input-output training pairs.

32. Reinforcement Learning from Human Feedback (RLHF) #

SFT gives positive examples. But what about discouraging bad outputs? RLHF introduces a reward model trained on human preferences, which then helps guide the LLM via reinforcement learning to:

  • Prefer helpful, safe, and fair responses
  • Avoid toxic or misleading completions

Advanced variants include RLAIF (AI feedback) and DPO (direct preference optimization) to reduce reliance on human labels.

33. Parameter Efficient Fine-Tuning (PEFT): Adapting Without Full Retraining #

Full fine-tuning is costly. PEFT methods train small, targeted modules instead:

  • Adapters: Mini-modules injected into LLM layers, trained separately
  • LoRA: Low-rank matrices update original weights efficiently
  • QLoRA: Quantized LoRA for even lower memory
  • Soft Prompting: Trainable vectors (not full prompts) condition the frozen model

PEFT enables plug-and-play modules across tasks, saving memory and time.

34. Fine-Tuning in Practice (Code Example) #

Google Cloud’s Vertex AI supports SFT using Gemini models with JSONL datasets and APIs. A few lines of code initialize the model, start fine-tuning, and use the new endpoint—all on cloud infrastructure.

35. Using LLMs Effectively: Prompt Engineering #

LLMs respond differently based on how you ask:

  • Zero-shot: Just the instruction
  • Few-shot: Add 2–5 examples
  • Chain-of-thought: Show step-by-step reasoning

Effective prompting is key to controlling tone, factuality, or creativity.

36. Sampling Techniques: Controlling Output Style #

After generating probabilities, sampling chooses the next token:

  • Greedy: Always highest prob (safe but repetitive)
  • Random/Temperature: More creativity
  • Top-K / Top-P: Add diversity while maintaining focus
  • Best-of-N: Generate multiple candidates, choose best

Choose based on your goal: safety, creativity, or logic.

37. Task-Based Evaluation: Beyond Accuracy #

As LLMs become foundational platforms, reliable evaluation is critical:

  • Custom datasets: Reflect real production use
  • System-level context: Include RAG and workflows, not just model
  • Multi-dimensional “good”: Not just matching ground truth but business outcomes

38. Evaluation Methods #

  1. Traditional metrics: Fast but rigid
  2. Human evaluation: Gold standard, but costly
  3. LLM-powered autoraters: Scalable evaluations with rubrics, rationales, and subtasks

Meta-evaluation calibrates autoraters to human preferences—essential for trust.

Conclusion #

This section links training, fine-tuning, and usage of LLMs in a production-ready loop:

  • Train generally → fine-tune specifically
  • Prompt smartly → sample selectively
  • Evaluate robustly

Together, these techniques ensure LLMs are accurate, safe, helpful, and aligned with real-world needs.

Accelerating Inference in LLMs #

39. Scaling vs Efficiency: Why Speed Matters Now #

LLMs have grown 1000x in parameter count. While quality has improved, cost and latency of inference have also skyrocketed. Developers now face an essential tradeoff: balancing performance with resource efficiency for real-world deployments.

40. The Big Tradeoffs #

a. Quality vs Latency/Cost #

  • Sacrifice a bit of quality for big speed gains (e.g., smaller models, quantization).
  • Works well for simpler tasks where top-tier quality isn’t needed.

b. Latency vs Cost (Throughput) #

  • Trade speed for bulk efficiency (or vice versa).
  • Useful in scenarios like chatbots (low latency) vs offline processing (high throughput).

41. Output-Approximating Methods #

These techniques may slightly affect output quality, but yield major gains in performance.

🔹 Quantization #

  • Reduce weight/activation precision (e.g., 32-bit → 8-bit).
  • Saves memory and accelerates math operations.
  • Some quality loss, but often negligible with tuning.

🔹 Distillation #

  • Use a smaller student model trained to mimic a larger teacher model.
  • Techniques:
    • Data distillation: Generate synthetic data with teacher.
    • Knowledge distillation: Match student output distributions.
    • On-policy distillation: Reinforcement learning feedback per token.

42. Output-Preserving Methods #

These do not degrade quality and should be prioritized.

🔹 Flash Attention #

  • Optimizes memory movement during attention.
  • 2–4x latency improvement with exact same output.

🔹 Prefix Caching #

  • Cache attention computations (KV Cache) for unchanged inputs.
  • Ideal for chat histories or uploaded documents across multiple queries.

🔹 Speculative Decoding #

  • A small “drafter” model predicts tokens ahead.
  • Main model verifies in parallel.
  • Huge speed-up with no quality loss, if drafter is well aligned.

43. Batching and Parallelization #

Beyond ML-specific tricks, use general system-level methods:

  • Batching: Handle multiple decode requests at once.
  • Parallelization: Distribute heavy compute ops across TPUs/GPUs.

Decode is memory-bound and can benefit from parallel batching as long as memory limits aren’t exceeded.

Summary #

Inference optimization is about smarter engineering, not just faster chips. You can:

  • Trade off quality when it’s safe.
  • Preserve output via caching and algorithmic improvements.
  • Use hybrid setups like speculative decoding + batching.
  • Choose methods based on your task: low-latency chat, high-volume pipelines, or edge deployment.

Speed and cost matter—especially at scale.

Applications and Outlook #

44. LLMs in Action: Real-World Applications #

After mastering training, inference, and prompting, the final step is applying LLMs to real tasks. These models have transformed how we interact with information across modalities—text, code, images, audio, and video.

43. Core Text-Based Applications #

🔹 Code and Mathematics #

LLMs support:

  • Code generation, completion, debugging, refactoring
  • Test case and documentation generation
  • Language translation between programming languages
  • Tools like AlphaCode 2, FunSearch, and AlphaGeometry push competitive coding and theorem solving to new heights.

🔹 Machine Translation #

LLMs understand idioms and context:

  • Chat translations in apps
  • Culturally-aware e-commerce descriptions
  • Voice translations in travel apps

🔹 Text Summarization #

Use cases:

  • Summarizing news with tone
  • Creating abstracts for scientific research
  • Thread summaries in chat apps

🔹 Question-Answering #

LLMs reason through queries with:

  • Personalization (e.g. in customer support)
  • Depth (e.g. in academic platforms)
  • RAG-enhanced factuality and improved prompts

🔹 Chatbots #

Unlike rule-based bots, LLMs handle:

  • Fashion + support on retail sites
  • Sentiment-aware entertainment moderation

🔹 Content Generation #

  • Ads, marketing, blogs, scriptwriting
  • Use creativity-vs-correctness sampling tuning

🔹 Natural Language Inference #

  • Legal analysis, diagnosis, sentiment detection
  • LLMs bridge subtle context to derive conclusions

🔹 Text Classification #

  • Spam detection, news topic tagging
  • Feedback triage, model scoring as “autoraters”

🔹 Text Analysis #

  • Market trends from social media
  • Thematic and character analysis in literature

44. Multimodal Applications #

Beyond text, multimodal LLMs analyze and generate across data types:

  • Creative: Narrate stories from images or video
  • Educational: Personalized visual+audio content
  • Business: Chatbots using both image+text inputs
  • Medical: Scans + notes = richer diagnostics
  • Research: Drug discovery using cross-data fusion

Multimodal systems build on unimodal strengths, scaling to more sensory and intelligent interactions.

Summary #

  • Transformer is the backbone of modern LLMs.
  • Model performance depends on size and training data diversity.
  • Fine-tuning strategies like SFT, RLHF, and safety tuning personalize models for real-world needs.
  • Inference optimization is critical—use PEFT, Flash Attention, prefix caching, and speculative decoding.
  • Prompt engineering and sampling tuning matter for precision or creativity.
  • Applications are exploding—text, code, chat, multimodal interfaces.

LLMs are not just tools—they’re platforms. They’re reshaping how we search, chat, learn, create, and discover.