Day 5 – MLOps for Generative AI

Day 5 – MLOps for Generative AI #


1. Introduction #

The rise of foundation models and generative AI (gen AI) has brought a paradigm shift in how we build and deploy AI systems. From selecting architectures to managing prompts and grounding outputs in real data, traditional MLOps needs adaptation.

So how do we evolve MLOps for this new generative world?

2. What Are DevOps and MLOps? #

  • DevOps: Automation + collaboration for software delivery (CI/CD, testing, reliability)
  • MLOps: Adds ML-specific needs:
    • Data validation
    • Model evaluation
    • Monitoring
    • Experiment tracking

These core principles set the stage, but gen AI has unique needs.

3. Lifecycle of a Gen AI System #

The gen AI lifecycle introduces five major moments:

  1. Discover – Find suitable foundation models from a rapidly growing model zoo.
  2. Develop & Experiment – Iterate on prompts, use few-shot examples, and chains.
  3. Train/Tune – Use parameter-efficient fine-tuning.
  4. Deploy – Includes chains, prompt templates, databases, retrieval systems.
  5. Monitor & Govern – Ensure safety, fairness, drift detection, and lineage.

Each stage requires new tooling and processes compared to traditional ML.

4. Continuous Improvement in Gen AI #

  • Gen AI focuses on adapting pre-trained models via:
    • Prompt tweaks
    • Model swaps
    • Multi-model chaining
  • Still uses fine-tuning and human feedback loops when needed.

But not all orgs handle base model training—many just adapt existing FMs.

5. Discover Phase: Choosing the Right FM #

Why it’s hard:

  • Explosion of open-source and proprietary FMs
  • Variation in architecture, performance, licensing

Model selection is now a critical MLOps task.

6. Model Discovery Criteria #

Choosing a foundation model now involves nuanced trade-offs:

  • Quality: Benchmarks, output inspection
  • Latency & Throughput: Real-time chat ≠ batch summarization
  • Maintenance: Hosted vs self-managed models
  • Cost: Compute, serving, data storage
  • Compliance: Licensing, regulation

Vertex Model Garden supports structured exploration of these options.

7. Develop & Experiment #

Building gen AI systems is iterative: prompt tweaks → model swap → eval → repeat.

This loop mirrors traditional ML but centers around prompts, not raw data.

8. Foundation Model Paradigm #

  • Unlike predictive models, foundation models are multi-purpose.
  • They show emergent behavior based on prompt structure.
  • Prompts define task type (translation, generation, reasoning).

Small changes in wording can completely shift model output.

9. Prompted Model Component #

The key unit of experimentation in gen AI is: Prompt + Model → Prompted Model Component

This redefines MLOps: you now track prompt templates as first-class artifacts.

10. Prompt = Code + Data #

Prompts often include:

  • Code-like structures (templates, control flow, guardrails)
  • Data-like elements (examples, contexts, user input)

MLOps must version prompts, track results, and match to model versions.

11. Chains & Augmentation #

When prompts alone aren’t enough:

  • Chains: Link multiple prompted models + APIs
  • RAG: Retrieve relevant info before generation
  • Agents: LLMs choose tools dynamically (ReAct)

MLOps must manage chains end-to-end, not just components.

12. Chain MLOps Needs #

  • Evaluation: Run full chains to measure behavior
  • Versioning: Chains need config + history
  • Monitoring: Track outputs + intermediate steps
  • Introspection: Debug chain inputs/outputs

Vertex AI + LangChain integration supports these needs.

13. Tuning & Training #

Some tasks require fine-tuning:

  • SFT: Teach model to produce specific outputs
  • RLHF: Use human feedback to improve alignment

Tune as needed—especially if prompt engineering hits limits.

14. Continuous Tuning #

Static tasks = low frequency. Dynamic tasks (chatbots) = frequent RLHF.

  • Balance GPU/TPU cost with improvement needs
  • Consider quantization to lower costs

Vertex AI provides tuning infra + registry + pipelines + governance.

15. Data in Gen AI #

Unlike predictive ML, gen AI uses:

  • Prompts & examples
  • Grounding sources (APIs, vectors)
  • Human preference data
  • Task-specific tuning sets
  • Synthetic + curated data

Each has different MLOps needs: validation, versioning, lineage.

16. Synthetic Data Use Cases #

  • Generation: Fill in training gaps
  • Correction: Flag label errors
  • Augmentation: Introduce diversity

Use large FMs to generate training or eval data when needed.

17. Evaluation in Gen AI #

Evaluation is hard:

  • Complex, open-ended outputs
  • Metrics (BLEU, ROUGE) often miss the mark
  • Auto-evals (e.g. AutoSxS) use FMs as judges

Align automated metrics with human judgment early on.

18. Evaluation Best Practices #

  • Stabilize metrics, approaches, datasets early
  • Include adversarial prompts in test set
  • Use synthetic ground truth if needed

Evaluation = cornerstone of experimentation in gen AI MLOps

19. Deployment in Gen AI Systems #

Gen AI apps involve multiple components:

  • LLMs
  • Chains
  • Prompts
  • Adapters
  • External APIs

Two main deployment types:

  1. Full Gen AI Systems (custom apps)
  2. Foundation Model Deployments (standalone models)

20. Version Control #

Key assets to version:

  • Prompt templates
  • Chain definitions
  • Datasets (e.g. RAG sources)
  • Adapter models

Git, BigQuery, AlloyDB, and Vertex Feature Store help manage assets.

21. Continuous Integration (CI) #

CI ensures reliability through:

  • Unit + integration tests
  • Automated pipelines

Challenges:

  • Test generation is hard due to open-ended outputs
  • Reproducibility is limited due to LLM randomness

Solutions draw from earlier evaluation methods.

22. Continuous Delivery (CD) #

CD moves tested systems into staging/production.

Two flavors:

  • Batch delivery: Schedule-driven, test pipeline throughput
  • Online delivery: API-based, test latency, infra, scalability

Chains are the new “deployment unit”—not just models.

23. Foundation Model Deployment #

Heavy resource demands → need:

  • GPU/TPU allocation
  • Scalable data stores
  • Optimization (distillation, quantization, pruning)

24. Infrastructure Validation #

Check:

  • Hardware compatibility
  • Serving configuration
  • GPU/TPU availability

Tools: TFX infra validation, manual provisioning checks

25. Compression & Optimization #

Strategies:

  • Quantization: 32-bit → 8-bit
  • Pruning: Remove unneeded weights
  • Distillation: Train small model from a larger “teacher”

Step-by-step distillation can reduce size and improve performance.

26. Deployment Checklist #

Steps to productionize:

  • Version control
  • Optimize model
  • Containerize
  • Define hardware and endpoints
  • Allocate resources
  • Secure access
  • Monitor, log, and alert
  • Real-time infra: Cloud Functions + Cloud Run

27. Logging & Monitoring #

Track both:

  • App-level inputs/outputs
  • Component-level details (chain steps, prompts, models)

Needed for tracing bugs, debugging drift, and transparency.

28. Drift & Skew Detection #

Compare:

  • Evaluation-time data vs. Production input
  • Topics, vocab, token count, embeddings

Techniques:

  • MMD, least squares density, learned kernels

Signals shift in user behavior or data domains.

29. Continuous Evaluation #

  • Capture live outputs
  • Evaluate vs. ground truth or human feedback
  • Track metric degradation
  • Alert on failures or decay

Production = where real testing happens.

30. Governance #

Governs:

  • Chains + components
  • Prompts
  • Data
  • Models
  • Evaluation metrics and lineage

Full lifecycle governance = essential for compliance and maintainability.

31. Role of an AI Platform #

Vertex AI acts as an end-to-end platform for developing and operationalizing Gen AI. It supports:

  • Data prep
  • Training/tuning
  • Deployment
  • Evaluation
  • CI/CD
  • Monitoring
  • Governance

It enables reuse, scalability, and full-stack observability for Gen AI teams.

32. Model Discovery: Vertex Model Garden #

Model Garden includes:

  • 150+ models: Google, OSS, third-party (e.g., Gemini, Claude, Llama 3, T5, Imagen)
  • Modalities: Language, Vision, Multimodal, Speech, Video
  • Tasks: Generation, classification, moderation, detection, etc.

Each model has a card with use cases and tuning options.

33. Prototyping: Vertex AI Studio #

Vertex AI Studio offers:

  • Playground for trying models (Gemini, Codey, Imagen)
  • UI + SDKs (Python, NodeJS, Java)
  • Prompt testing + management
  • One-click deploy
  • Built-in notebooks (Colab Enterprise, Workbench)

Low barrier for users from business analysts to ML engineers.

34. Training: Full LLM Training on Vertex AI #

  • TPU and GPU infrastructure for fast, large-scale training
  • Vertex AI supports training from scratch and adapting open-weight models

35. Tuning: Five Key Methods #

  1. Prompt engineering – no retraining
  2. SFT (Supervised Fine-Tuning) – train on labeled examples
  3. RLHF – learn from human preferences
  4. Distillation – compress knowledge from large to small models
  5. Step-by-step distillation – Google-optimized, fewer data needs

Each method balances cost, performance, and latency.

36. Orchestration: Vertex Pipelines #

  • Define pipelines with Kubeflow SDK
  • Automate tuning, evaluation, and deployment
  • Managed pipelines for Vertex foundation models

Enables production-readiness and repeatability.

37. Chain & Augmentation: Grounding + Function Calling #

Vertex AI supports:

  • RAG systems – real-time document retrieval
  • Agent-based chains – dynamic tool use via ReAct
  • Function calling – LLM picks which API to use, returns JSON
  • Grounding – verifies/model output via search or private corpora
  • Agent Builder – build search/chat agents grounded on any source

Simplifies chaining, reasoning, and integrating internal data.

Vertex AI Vector Search enables:

  • High-scale, low-latency ANN search
  • Billions of embeddings using ScaNN
  • Use with text, images, hybrid metadata search
  • Works with custom embeddings (e.g., textembedding-gecko)

Choose this when you need control over chunking, retrieval, or models.

39. Evaluate: Vertex AI Experiments & TensorBoard #

Experimentation is essential for iterating and improving Gen AI models. Tools include:

  • Vertex AI Experiments: Track model runs, hyperparams, training environments
  • Vertex AI TensorBoard: Visualize loss, accuracy, embeddings, model graphs

Supports reproducibility, debugging, and collaboration.

40. Evaluation Techniques #

  • Ground truth metrics: Automatic Metrics using reference datasets
  • LLM-based eval: Auto Side-by-Side (Auto SxS) with model judges
  • Rapid Evaluation API: Fast SDK-based eval for prototyping

Evaluation is deeply integrated into the development lifecycle.

41. Predict: Vertex Endpoints #

  • Deploy models to Vertex Endpoints for online prediction
  • Features:
    • Autoscaling
    • Access control
    • Monitoring
  • Works with open-source and Google models

42. Safety, Bias, and Moderation #

Built-in responsible AI features:

  • Citation checkers: Track and quote data sources
  • Safety scores: Detect harmful content and flag sensitive topics
  • Watermarking: Identify AI-generated content (via SynthID)
  • Bias detection: Ensure fairness and appropriateness
  • Moderation: Filter unsafe responses

These ensure ethical and trustworthy AI deployments.

43. Governance Tools #

  • Vertex Feature Store:

    • Track embedding + feature lineage
    • Drift monitoring
    • Feature reuse + formulas
  • Model Registry:

    • Lifecycle tracking (versioning, evaluation, deployment)
    • One-click deployment
    • Access to evaluation, monitoring, and aliasing
  • Dataplex:

    • Cross-product lineage (e.g., Vertex + BigQuery)
    • Golden datasets/models
    • Access governance + IAM integration

These unify observability, reproducibility, and compliance across Gen AI assets.

44. Conclusion #

MLOps principles—reliability, scalability, repeatability—fully extend into Gen AI.

  • Gen AI adds prompt chaining, grounding, function calling, etc.
  • Vertex AI unifies the full lifecycle across models, pipelines, and governance
  • It supports both predictive and Gen AI use cases

MLOps isn’t replaced—it’s expanded for the age of foundation models.