Day 5 – MLOps for Generative AI

Day 5 – MLOps for Generative AI #

1. Introduction #

The rise of foundation models and generative AI (gen AI) has brought a paradigm shift in how we build and deploy AI systems. From selecting architectures to managing prompts and grounding outputs in real data, traditional MLOps needs adaptation.

So how do we evolve MLOps for this new generative world?

2. What Are DevOps and MLOps? #

DevOps: Automation + collaboration for software delivery (CI/CD, testing, reliability)
MLOps: Adds ML-specific needs:
- Data validation
- Model evaluation
- Monitoring
- Experiment tracking

These core principles set the stage, but gen AI has unique needs.

3. Lifecycle of a Gen AI System #

The gen AI lifecycle introduces five major moments:

Discover – Find suitable foundation models from a rapidly growing model zoo.
Develop & Experiment – Iterate on prompts, use few-shot examples, and chains.
Train/Tune – Use parameter-efficient fine-tuning.
Deploy – Includes chains, prompt templates, databases, retrieval systems.
Monitor & Govern – Ensure safety, fairness, drift detection, and lineage.

Each stage requires new tooling and processes compared to traditional ML.

4. Continuous Improvement in Gen AI #

Gen AI focuses on adapting pre-trained models via:
- Prompt tweaks
- Model swaps
- Multi-model chaining
Still uses fine-tuning and human feedback loops when needed.

But not all orgs handle base model training—many just adapt existing FMs.

5. Discover Phase: Choosing the Right FM #

Why it’s hard:

Explosion of open-source and proprietary FMs
Variation in architecture, performance, licensing

Model selection is now a critical MLOps task.

6. Model Discovery Criteria #

Choosing a foundation model now involves nuanced trade-offs:

Quality: Benchmarks, output inspection
Latency & Throughput: Real-time chat ≠ batch summarization
Maintenance: Hosted vs self-managed models
Cost: Compute, serving, data storage
Compliance: Licensing, regulation

Vertex Model Garden supports structured exploration of these options.

7. Develop & Experiment #

Building gen AI systems is iterative: prompt tweaks → model swap → eval → repeat.

This loop mirrors traditional ML but centers around prompts, not raw data.

8. Foundation Model Paradigm #

Unlike predictive models, foundation models are multi-purpose.
They show emergent behavior based on prompt structure.
Prompts define task type (translation, generation, reasoning).

Small changes in wording can completely shift model output.

9. Prompted Model Component #

The key unit of experimentation in gen AI is: Prompt + Model → Prompted Model Component

This redefines MLOps: you now track prompt templates as first-class artifacts.

10. Prompt = Code + Data #

Prompts often include:

Code-like structures (templates, control flow, guardrails)
Data-like elements (examples, contexts, user input)

MLOps must version prompts, track results, and match to model versions.

11. Chains & Augmentation #

When prompts alone aren’t enough:

Chains: Link multiple prompted models + APIs
RAG: Retrieve relevant info before generation
Agents: LLMs choose tools dynamically (ReAct)

MLOps must manage chains end-to-end, not just components.

12. Chain MLOps Needs #

Evaluation: Run full chains to measure behavior
Versioning: Chains need config + history
Monitoring: Track outputs + intermediate steps
Introspection: Debug chain inputs/outputs

Vertex AI + LangChain integration supports these needs.

13. Tuning & Training #

Some tasks require fine-tuning:

SFT: Teach model to produce specific outputs
RLHF: Use human feedback to improve alignment

Tune as needed—especially if prompt engineering hits limits.

14. Continuous Tuning #

Static tasks = low frequency. Dynamic tasks (chatbots) = frequent RLHF.

Balance GPU/TPU cost with improvement needs
Consider quantization to lower costs

Vertex AI provides tuning infra + registry + pipelines + governance.

15. Data in Gen AI #

Unlike predictive ML, gen AI uses:

Prompts & examples
Grounding sources (APIs, vectors)
Human preference data
Task-specific tuning sets
Synthetic + curated data

Each has different MLOps needs: validation, versioning, lineage.

16. Synthetic Data Use Cases #

Generation: Fill in training gaps
Correction: Flag label errors
Augmentation: Introduce diversity

Use large FMs to generate training or eval data when needed.

17. Evaluation in Gen AI #

Evaluation is hard:

Complex, open-ended outputs
Metrics (BLEU, ROUGE) often miss the mark
Auto-evals (e.g. AutoSxS) use FMs as judges

Align automated metrics with human judgment early on.

18. Evaluation Best Practices #

Stabilize metrics, approaches, datasets early
Include adversarial prompts in test set
Use synthetic ground truth if needed

Evaluation = cornerstone of experimentation in gen AI MLOps

19. Deployment in Gen AI Systems #

Gen AI apps involve multiple components:

LLMs
Chains
Prompts
Adapters
External APIs

Two main deployment types:

Full Gen AI Systems (custom apps)
Foundation Model Deployments (standalone models)

20. Version Control #

Key assets to version:

Prompt templates
Chain definitions
Datasets (e.g. RAG sources)
Adapter models

Git, BigQuery, AlloyDB, and Vertex Feature Store help manage assets.

21. Continuous Integration (CI) #

CI ensures reliability through:

Unit + integration tests
Automated pipelines

Challenges:

Test generation is hard due to open-ended outputs
Reproducibility is limited due to LLM randomness

Solutions draw from earlier evaluation methods.

22. Continuous Delivery (CD) #

CD moves tested systems into staging/production.

Two flavors:

Batch delivery: Schedule-driven, test pipeline throughput
Online delivery: API-based, test latency, infra, scalability

Chains are the new “deployment unit”—not just models.

23. Foundation Model Deployment #

Heavy resource demands → need:

GPU/TPU allocation
Scalable data stores
Optimization (distillation, quantization, pruning)

24. Infrastructure Validation #

Check:

Hardware compatibility
Serving configuration
GPU/TPU availability

Tools: TFX infra validation, manual provisioning checks

25. Compression & Optimization #

Strategies:

Quantization: 32-bit → 8-bit
Pruning: Remove unneeded weights
Distillation: Train small model from a larger “teacher”

Step-by-step distillation can reduce size and improve performance.

26. Deployment Checklist #

Steps to productionize:

Version control
Optimize model
Containerize
Define hardware and endpoints
Allocate resources
Secure access
Monitor, log, and alert
Real-time infra: Cloud Functions + Cloud Run

27. Logging & Monitoring #

Track both:

App-level inputs/outputs
Component-level details (chain steps, prompts, models)

Needed for tracing bugs, debugging drift, and transparency.

28. Drift & Skew Detection #

Compare:

Evaluation-time data vs. Production input
Topics, vocab, token count, embeddings

Techniques:

MMD, least squares density, learned kernels

Signals shift in user behavior or data domains.

29. Continuous Evaluation #

Capture live outputs
Evaluate vs. ground truth or human feedback
Track metric degradation
Alert on failures or decay

Production = where real testing happens.

30. Governance #

Governs:

Chains + components
Prompts
Data
Models
Evaluation metrics and lineage

Full lifecycle governance = essential for compliance and maintainability.

31. Role of an AI Platform #

Vertex AI acts as an end-to-end platform for developing and operationalizing Gen AI. It supports:

Data prep
Training/tuning
Deployment
Evaluation
CI/CD
Monitoring
Governance

It enables reuse, scalability, and full-stack observability for Gen AI teams.

32. Model Discovery: Vertex Model Garden #

Model Garden includes:

150+ models: Google, OSS, third-party (e.g., Gemini, Claude, Llama 3, T5, Imagen)
Modalities: Language, Vision, Multimodal, Speech, Video
Tasks: Generation, classification, moderation, detection, etc.

Each model has a card with use cases and tuning options.

33. Prototyping: Vertex AI Studio #

Vertex AI Studio offers:

Playground for trying models (Gemini, Codey, Imagen)
UI + SDKs (Python, NodeJS, Java)
Prompt testing + management
One-click deploy
Built-in notebooks (Colab Enterprise, Workbench)

Low barrier for users from business analysts to ML engineers.

34. Training: Full LLM Training on Vertex AI #

TPU and GPU infrastructure for fast, large-scale training
Vertex AI supports training from scratch and adapting open-weight models

35. Tuning: Five Key Methods #

Prompt engineering – no retraining
SFT (Supervised Fine-Tuning) – train on labeled examples
RLHF – learn from human preferences
Distillation – compress knowledge from large to small models
Step-by-step distillation – Google-optimized, fewer data needs

Each method balances cost, performance, and latency.

36. Orchestration: Vertex Pipelines #

Define pipelines with Kubeflow SDK
Automate tuning, evaluation, and deployment
Managed pipelines for Vertex foundation models

Enables production-readiness and repeatability.

37. Chain & Augmentation: Grounding + Function Calling #

Vertex AI supports:

RAG systems – real-time document retrieval
Agent-based chains – dynamic tool use via ReAct
Function calling – LLM picks which API to use, returns JSON
Grounding – verifies/model output via search or private corpora
Agent Builder – build search/chat agents grounded on any source

Simplifies chaining, reasoning, and integrating internal data.

38. Vector Search #

Vertex AI Vector Search enables:

High-scale, low-latency ANN search
Billions of embeddings using ScaNN
Use with text, images, hybrid metadata search
Works with custom embeddings (e.g., textembedding-gecko)

Choose this when you need control over chunking, retrieval, or models.

39. Evaluate: Vertex AI Experiments & TensorBoard #

Experimentation is essential for iterating and improving Gen AI models. Tools include:

Vertex AI Experiments: Track model runs, hyperparams, training environments
Vertex AI TensorBoard: Visualize loss, accuracy, embeddings, model graphs

Supports reproducibility, debugging, and collaboration.

40. Evaluation Techniques #

Ground truth metrics: Automatic Metrics using reference datasets
LLM-based eval: Auto Side-by-Side (Auto SxS) with model judges
Rapid Evaluation API: Fast SDK-based eval for prototyping

Evaluation is deeply integrated into the development lifecycle.

41. Predict: Vertex Endpoints #

Deploy models to Vertex Endpoints for online prediction
Features:
- Autoscaling
- Access control
- Monitoring
Works with open-source and Google models

42. Safety, Bias, and Moderation #

Built-in responsible AI features:

Citation checkers: Track and quote data sources
Safety scores: Detect harmful content and flag sensitive topics
Watermarking: Identify AI-generated content (via SynthID)
Bias detection: Ensure fairness and appropriateness
Moderation: Filter unsafe responses

These ensure ethical and trustworthy AI deployments.

43. Governance Tools #

Vertex Feature Store:
- Track embedding + feature lineage
- Drift monitoring
- Feature reuse + formulas
Model Registry:
- Lifecycle tracking (versioning, evaluation, deployment)
- One-click deployment
- Access to evaluation, monitoring, and aliasing
Dataplex:
- Cross-product lineage (e.g., Vertex + BigQuery)
- Golden datasets/models
- Access governance + IAM integration

These unify observability, reproducibility, and compliance across Gen AI assets.

44. Conclusion #

MLOps principles—reliability, scalability, repeatability—fully extend into Gen AI.

Gen AI adds prompt chaining, grounding, function calling, etc.
Vertex AI unifies the full lifecycle across models, pipelines, and governance
It supports both predictive and Gen AI use cases

MLOps isn’t replaced—it’s expanded for the age of foundation models.