Day 5 – MLOps for Generative AI #
1. Introduction #
The rise of foundation models and generative AI (gen AI) has brought a paradigm shift in how we build and deploy AI systems. From selecting architectures to managing prompts and grounding outputs in real data, traditional MLOps needs adaptation.
So how do we evolve MLOps for this new generative world?
2. What Are DevOps and MLOps? #
- DevOps: Automation + collaboration for software delivery (CI/CD, testing, reliability)
- MLOps: Adds ML-specific needs:
- Data validation
- Model evaluation
- Monitoring
- Experiment tracking
These core principles set the stage, but gen AI has unique needs.
3. Lifecycle of a Gen AI System #
The gen AI lifecycle introduces five major moments:
- Discover – Find suitable foundation models from a rapidly growing model zoo.
- Develop & Experiment – Iterate on prompts, use few-shot examples, and chains.
- Train/Tune – Use parameter-efficient fine-tuning.
- Deploy – Includes chains, prompt templates, databases, retrieval systems.
- Monitor & Govern – Ensure safety, fairness, drift detection, and lineage.
Each stage requires new tooling and processes compared to traditional ML.
4. Continuous Improvement in Gen AI #
- Gen AI focuses on adapting pre-trained models via:
- Prompt tweaks
- Model swaps
- Multi-model chaining
- Still uses fine-tuning and human feedback loops when needed.
But not all orgs handle base model training—many just adapt existing FMs.
5. Discover Phase: Choosing the Right FM #
Why it’s hard:
- Explosion of open-source and proprietary FMs
- Variation in architecture, performance, licensing
Model selection is now a critical MLOps task.
6. Model Discovery Criteria #
Choosing a foundation model now involves nuanced trade-offs:
- Quality: Benchmarks, output inspection
- Latency & Throughput: Real-time chat ≠ batch summarization
- Maintenance: Hosted vs self-managed models
- Cost: Compute, serving, data storage
- Compliance: Licensing, regulation
Vertex Model Garden supports structured exploration of these options.
7. Develop & Experiment #
Building gen AI systems is iterative: prompt tweaks → model swap → eval → repeat.
This loop mirrors traditional ML but centers around prompts, not raw data.
8. Foundation Model Paradigm #
- Unlike predictive models, foundation models are multi-purpose.
- They show emergent behavior based on prompt structure.
- Prompts define task type (translation, generation, reasoning).
Small changes in wording can completely shift model output.
9. Prompted Model Component #
The key unit of experimentation in gen AI is: Prompt + Model → Prompted Model Component
This redefines MLOps: you now track prompt templates as first-class artifacts.
10. Prompt = Code + Data #
Prompts often include:
- Code-like structures (templates, control flow, guardrails)
- Data-like elements (examples, contexts, user input)
MLOps must version prompts, track results, and match to model versions.
11. Chains & Augmentation #
When prompts alone aren’t enough:
- Chains: Link multiple prompted models + APIs
- RAG: Retrieve relevant info before generation
- Agents: LLMs choose tools dynamically (ReAct)
MLOps must manage chains end-to-end, not just components.
12. Chain MLOps Needs #
- Evaluation: Run full chains to measure behavior
- Versioning: Chains need config + history
- Monitoring: Track outputs + intermediate steps
- Introspection: Debug chain inputs/outputs
Vertex AI + LangChain integration supports these needs.
13. Tuning & Training #
Some tasks require fine-tuning:
- SFT: Teach model to produce specific outputs
- RLHF: Use human feedback to improve alignment
Tune as needed—especially if prompt engineering hits limits.
14. Continuous Tuning #
Static tasks = low frequency. Dynamic tasks (chatbots) = frequent RLHF.
- Balance GPU/TPU cost with improvement needs
- Consider quantization to lower costs
Vertex AI provides tuning infra + registry + pipelines + governance.
15. Data in Gen AI #
Unlike predictive ML, gen AI uses:
- Prompts & examples
- Grounding sources (APIs, vectors)
- Human preference data
- Task-specific tuning sets
- Synthetic + curated data
Each has different MLOps needs: validation, versioning, lineage.
16. Synthetic Data Use Cases #
- Generation: Fill in training gaps
- Correction: Flag label errors
- Augmentation: Introduce diversity
Use large FMs to generate training or eval data when needed.
17. Evaluation in Gen AI #
Evaluation is hard:
- Complex, open-ended outputs
- Metrics (BLEU, ROUGE) often miss the mark
- Auto-evals (e.g. AutoSxS) use FMs as judges
Align automated metrics with human judgment early on.
18. Evaluation Best Practices #
- Stabilize metrics, approaches, datasets early
- Include adversarial prompts in test set
- Use synthetic ground truth if needed
Evaluation = cornerstone of experimentation in gen AI MLOps
19. Deployment in Gen AI Systems #
Gen AI apps involve multiple components:
- LLMs
- Chains
- Prompts
- Adapters
- External APIs
Two main deployment types:
- Full Gen AI Systems (custom apps)
- Foundation Model Deployments (standalone models)
20. Version Control #
Key assets to version:
- Prompt templates
- Chain definitions
- Datasets (e.g. RAG sources)
- Adapter models
Git, BigQuery, AlloyDB, and Vertex Feature Store help manage assets.
21. Continuous Integration (CI) #
CI ensures reliability through:
- Unit + integration tests
- Automated pipelines
Challenges:
- Test generation is hard due to open-ended outputs
- Reproducibility is limited due to LLM randomness
Solutions draw from earlier evaluation methods.
22. Continuous Delivery (CD) #
CD moves tested systems into staging/production.
Two flavors:
- Batch delivery: Schedule-driven, test pipeline throughput
- Online delivery: API-based, test latency, infra, scalability
Chains are the new “deployment unit”—not just models.
23. Foundation Model Deployment #
Heavy resource demands → need:
- GPU/TPU allocation
- Scalable data stores
- Optimization (distillation, quantization, pruning)
24. Infrastructure Validation #
Check:
- Hardware compatibility
- Serving configuration
- GPU/TPU availability
Tools: TFX infra validation, manual provisioning checks
25. Compression & Optimization #
Strategies:
- Quantization: 32-bit → 8-bit
- Pruning: Remove unneeded weights
- Distillation: Train small model from a larger “teacher”
Step-by-step distillation can reduce size and improve performance.
26. Deployment Checklist #
Steps to productionize:
- Version control
- Optimize model
- Containerize
- Define hardware and endpoints
- Allocate resources
- Secure access
- Monitor, log, and alert
- Real-time infra: Cloud Functions + Cloud Run
27. Logging & Monitoring #
Track both:
- App-level inputs/outputs
- Component-level details (chain steps, prompts, models)
Needed for tracing bugs, debugging drift, and transparency.
28. Drift & Skew Detection #
Compare:
- Evaluation-time data vs. Production input
- Topics, vocab, token count, embeddings
Techniques:
- MMD, least squares density, learned kernels
Signals shift in user behavior or data domains.
29. Continuous Evaluation #
- Capture live outputs
- Evaluate vs. ground truth or human feedback
- Track metric degradation
- Alert on failures or decay
Production = where real testing happens.
30. Governance #
Governs:
- Chains + components
- Prompts
- Data
- Models
- Evaluation metrics and lineage
Full lifecycle governance = essential for compliance and maintainability.
31. Role of an AI Platform #
Vertex AI acts as an end-to-end platform for developing and operationalizing Gen AI. It supports:
- Data prep
- Training/tuning
- Deployment
- Evaluation
- CI/CD
- Monitoring
- Governance
It enables reuse, scalability, and full-stack observability for Gen AI teams.
32. Model Discovery: Vertex Model Garden #
Model Garden includes:
- 150+ models: Google, OSS, third-party (e.g., Gemini, Claude, Llama 3, T5, Imagen)
- Modalities: Language, Vision, Multimodal, Speech, Video
- Tasks: Generation, classification, moderation, detection, etc.
Each model has a card with use cases and tuning options.
33. Prototyping: Vertex AI Studio #
Vertex AI Studio offers:
- Playground for trying models (Gemini, Codey, Imagen)
- UI + SDKs (Python, NodeJS, Java)
- Prompt testing + management
- One-click deploy
- Built-in notebooks (Colab Enterprise, Workbench)
Low barrier for users from business analysts to ML engineers.
34. Training: Full LLM Training on Vertex AI #
- TPU and GPU infrastructure for fast, large-scale training
- Vertex AI supports training from scratch and adapting open-weight models
35. Tuning: Five Key Methods #
- Prompt engineering – no retraining
- SFT (Supervised Fine-Tuning) – train on labeled examples
- RLHF – learn from human preferences
- Distillation – compress knowledge from large to small models
- Step-by-step distillation – Google-optimized, fewer data needs
Each method balances cost, performance, and latency.
36. Orchestration: Vertex Pipelines #
- Define pipelines with Kubeflow SDK
- Automate tuning, evaluation, and deployment
- Managed pipelines for Vertex foundation models
Enables production-readiness and repeatability.
37. Chain & Augmentation: Grounding + Function Calling #
Vertex AI supports:
- RAG systems – real-time document retrieval
- Agent-based chains – dynamic tool use via ReAct
- Function calling – LLM picks which API to use, returns JSON
- Grounding – verifies/model output via search or private corpora
- Agent Builder – build search/chat agents grounded on any source
Simplifies chaining, reasoning, and integrating internal data.
38. Vector Search #
Vertex AI Vector Search enables:
- High-scale, low-latency ANN search
- Billions of embeddings using ScaNN
- Use with text, images, hybrid metadata search
- Works with custom embeddings (e.g., textembedding-gecko)
Choose this when you need control over chunking, retrieval, or models.
39. Evaluate: Vertex AI Experiments & TensorBoard #
Experimentation is essential for iterating and improving Gen AI models. Tools include:
- Vertex AI Experiments: Track model runs, hyperparams, training environments
- Vertex AI TensorBoard: Visualize loss, accuracy, embeddings, model graphs
Supports reproducibility, debugging, and collaboration.
40. Evaluation Techniques #
- Ground truth metrics: Automatic Metrics using reference datasets
- LLM-based eval: Auto Side-by-Side (Auto SxS) with model judges
- Rapid Evaluation API: Fast SDK-based eval for prototyping
Evaluation is deeply integrated into the development lifecycle.
41. Predict: Vertex Endpoints #
- Deploy models to Vertex Endpoints for online prediction
- Features:
- Autoscaling
- Access control
- Monitoring
- Works with open-source and Google models
42. Safety, Bias, and Moderation #
Built-in responsible AI features:
- Citation checkers: Track and quote data sources
- Safety scores: Detect harmful content and flag sensitive topics
- Watermarking: Identify AI-generated content (via SynthID)
- Bias detection: Ensure fairness and appropriateness
- Moderation: Filter unsafe responses
These ensure ethical and trustworthy AI deployments.
43. Governance Tools #
-
Vertex Feature Store:
- Track embedding + feature lineage
- Drift monitoring
- Feature reuse + formulas
-
Model Registry:
- Lifecycle tracking (versioning, evaluation, deployment)
- One-click deployment
- Access to evaluation, monitoring, and aliasing
-
Dataplex:
- Cross-product lineage (e.g., Vertex + BigQuery)
- Golden datasets/models
- Access governance + IAM integration
These unify observability, reproducibility, and compliance across Gen AI assets.
44. Conclusion #
MLOps principles—reliability, scalability, repeatability—fully extend into Gen AI.
- Gen AI adds prompt chaining, grounding, function calling, etc.
- Vertex AI unifies the full lifecycle across models, pipelines, and governance
- It supports both predictive and Gen AI use cases
MLOps isn’t replaced—it’s expanded for the age of foundation models.