Eval Infra: Verifiable (STEM) vs Non-Verifiable vs Hybrid
Why These 5 Layers? Separation of Concerns #
Each layer serves a distinct purpose in the AI evaluation lifecycle, from research to production deployment.
| Layer | Purpose | Key Question | Stakeholder | Output |
|---|---|---|---|---|
| L1: Benchmark Design | Define what “good” means | What are we measuring? | Research Scientists | Test suite + evaluation protocol |
| L2: Evaluation Execution | Actually measure performance | How do we score it? | ML Engineers | Raw scores/labels per example |
| L3: Scalability | Handle volume & iteration speed | Can we do this 1000x? | MLOps/Infrastructure | Evaluation pipeline infrastructure |
| L4: Metrics & Reliability | Trust the measurements | Is this signal real? | Data Scientists, Leadership | Aggregate metrics + confidence intervals |
| L5: Production Monitoring | Maintain quality in the wild | Is it still working? | SREs, Product Managers | Live dashboards + alerting systems |
How Layers Map to Natural Workflow #
RESEARCH PHASE (Offline Development)
│
├─ L1: Design Benchmarks
│ └─ "What constitutes correct/good performance?"
│
├─ L2: Run Evaluations
│ └─ "Generate responses and score them"
│
├─ L3: Scale Infrastructure
│ └─ "Need to iterate fast → evaluate 10K examples/day"
│
└─ L4: Analyze Results
└─ "Aggregate metrics, validate reliability"
DEPLOYMENT PHASE (Online Production)
│
└─ L5: Monitor Production
└─ "Continuous validation, catch regressions"
└─ Feed failures back to L1 (closed loop)
Different Stakeholders Own Each Layern #
┌─────────────────────────────────────────────────────┐ │ L1: Research Scientists │ │ → Design evaluation protocols │ │ → Define what "good" means for the domain │ └─────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────┐ │ L2: ML Engineers │ │ → Implement evaluation scripts │ │ → Run model inference + scoring │ └─────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────┐ │ L3: MLOps / Infrastructure Engineers │ │ → Build scalable eval pipelines │ │ → Manage compute resources │ └─────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────┐ │ L4: Data Scientists / Research Leadership │ │ → Statistical analysis of eval results │ │ → Validate metric reliability │ │ → Make deployment decisions │ └─────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────┐ │ L5: SREs / Product Managers │ │ → Monitor production performance │ │ → Alert on regressions │ │ → Coordinate incident response │ └─────────────────────────────────────────────────────┘
The Spectrum of Verifiability #
Fully Verifiable ←――――――――――――――――――――――――→ Non-Verifiable
│ │ │
Code/Math Technical Writing Creative/Social
(external (hybrid: facts + (judgment only)
oracle) style/clarity)
Examples by Category #
| Fully Verifiable | Hybrid (Partially Verifiable) | Non-Verifiable |
|---|---|---|
| • Code execution | • Technical writing (facts ✓, clarity ✗) | • Creative writing |
| • Math computation | • Translation (accuracy ✓, style ✗) | • Persuasive marketing |
| • Logic proofs | • Legal analysis (precedent ✓, judgment ✗) | • Empathetic therapy |
| • Data extraction | • Medical diagnosis (tests ✓, manner ✗) | • Humor generation |
| • Fact checking | • Recipe generation (chemistry ✓, taste ✗) | • Art critique |
| • Physics simulation | • Code review (bugs ✓, readability ✗) | • Storytelling |
LAYER 1: BENCHMARK DESIGN #
| Verifiable (Code/Math/Logic) | Hybrid (Technical/Professional) | Non-Verifiable (Creative/Social) |
|---|---|---|
| Structure: Problem + Test Suite → Automated Oracle | Structure: Task + Mixed Criteria → Automated + Human | Structure: Prompt + Rubric → Human/AI Judgment |
| Key Benchmarks: • HumanEval (164 code problems) • MATH (12K competition problems) • GPQA (448 PhD questions) |
Key Benchmarks: • Technical Writing Quality • Translation (BLEU + human eval) • Medical Q&A (facts + empathy) |
Key Benchmarks: • MT-Bench (80 conversations) • AlpacaEval (805 prompts) • Chatbot Arena (live voting) |
| Properties: ✅ Tests are deterministic ✅ Can generate infinite variants ✅ No human disagreement |
Properties: ⚠️ Some aspects objective ⚠️ Some aspects subjective ⚠️ Requires dual evaluation |
Properties: ❌ Ratings are subjective ❌ Context-dependent quality ❌ Humans disagree frequently |
| Creation Time: 1-2 weeks | Creation Time: 2-3 weeks | Creation Time: 3-4 weeks |
Example:assert reverse([1,2,3]) == [3,2,1] |
Example: Accuracy: Does translation preserve meaning? (✓) Fluency: Does it sound natural? (human) |
Example: “Is this story engaging?" → 5-point Likert scale (human) |
LAYER 2: EVALUATION EXECUTION #
| Verifiable | Hybrid | Non-Verifiable |
|---|---|---|
| Pipeline: 1. Generate solutions (10 min) 2. Execute & verify (5 min) 3. Compute metrics (<1 min) |
Pipeline: 1. Generate outputs (10 min) 2. Automated checks (5 min) 3. Human evaluation (4-20 hours) 4. Combine scores (30 min) |
Pipeline: 1. Generate responses (5 min) 2. Human/AI rating (8-40 hours) 3. Aggregate & validate (1 hour) |
Verification:python result = execute_code(solution)<br>label = PASS if tests_pass else FAIL<br> |
Verification:python # Automated component<br>facts_correct = verify_facts(output)<br># Human component<br>clarity = human_rate_clarity(output)<br>score = 0.5*facts + 0.5*clarity<br> |
Verification:python ratings = get_human_ratings(n=3)<br>score = mean(ratings)<br># Then validate inter-rater agreement<br> |
| Throughput: 10K-100K evals/hour | Throughput: 500-5K evals/hour | Throughput: 100-1K evals/hour |
| Cost per eval: $0.001-0.01 | Cost per eval: $0.05-1.00 | Cost per eval: $0.10-5.00 |
| Human time: 0 hours | Human time: 4-20 hours (partial) | Human time: 24-120 hours (full) |
LAYER 3: SCALABILITY #
| Verifiable | Hybrid | Non-Verifiable |
|---|---|---|
| Bottleneck: Compute (GPU time) | Bottleneck: Human time for subjective parts | Bottleneck: Human bandwidth |
| Scaling Examples: • 164 problems → $2, 15 min • 10K problems → $122, 2 hours • 100K problems → $1,220, 8 hours |
Scaling Examples: • 100 docs → $50, 4 hours • 1K docs → $500, 20 hours • 10K docs → $5K, 200 hours (Mix of automated + human) |
Scaling Examples: • 80 prompts → $400, 8-40 hours • 10K prompts → $37,500, 2,500 hours • (Or $8K hybrid with AI judges) |
| Human scaling: 0 hours regardless of scale | Human scaling: Sub-linear (automate what’s possible) | Human scaling: Linear or super-linear |
| Constraint: Money (buy more GPUs) | Constraint: Time + money (human for quality checks) | Constraint: Time (recruit, train raters) |
| Automation: 99.9% | Automation: 40-70% (depends on domain) | Automation: 5-30% (AI judges need validation) |
| Scalability Strategy: • Parallelize across GPUs • Generate synthetic test cases • Cost scales linearly |
Scalability Strategy: • Automate objective criteria • Sample human evaluation (10-20%) • Use AI judges for subjective parts (with validation) |
Scalability Strategy: • Train reward models on human labels • Use LLM-as-judge (must validate) • Spot-check 5-10% with humans |
LAYER 4: METRICS & RELIABILITY #
| Verifiable | Hybrid | Non-Verifiable |
|---|---|---|
| Metrics: • pass@k (% solved in k tries) • Compile rate • Exact match accuracy • Error tolerance |
Metrics: • Factual accuracy (automated) • Readability score (formula) • Clarity rating (human) • Combined weighted score |
Metrics: • Likert scale (1-5 ratings) • Win rate vs baseline • Elo ratings (head-to-head) • Thumbs up/down ratio |
| Properties: ✅ Objective & reproducible ✅ Labs can compare directly ✅ No gaming (oracle is external) ✅ Leaderboards meaningful |
Properties: ⚠️ Partially objective ⚠️ Requires careful weighting ⚠️ Some gaming risk on subjective parts ⚠️ Need to report both auto + human metrics |
Properties: ❌ Subjective & noisy ❌ Different protocols → incomparable ❌ Gaming risk (optimize for judge) ❌ Leaderboards have selection bias |
| Inter-evaluator agreement: 100% | Inter-evaluator agreement: Facts: 95-100% Quality: 70-85% |
Inter-evaluator agreement: 60-80% |
| Example Metrics: • HumanEval pass@1: 67.8% • MATH accuracy: 82.3% • Error rate: 5.2% |
Example Metrics: • Translation BLEU: 45.2 (auto) • Fluency: 4.1/5 (human) • Medical accuracy: 94% (auto), Empathy: 3.8/5 (human) |
Example Metrics: • MT-Bench: 7.9/10 (GPT-4 judge) • Human preference: 78% win rate • Elo rating: 1,245 |
LAYER 5: PRODUCTION MONITORING #
| Verifiable | Hybrid | Non-Verifiable |
|---|---|---|
| Real-time signals: • Does code compile? ✓/✗ • Tests pass? ✓/✗ • User accepted? ✓/✗ • Execution time OK? ✓/✗ |
Real-time signals: • Facts verified? ✓/✗ • Format correct? ✓/✗ • User satisfaction proxy (usage time) • Error rate |
Proxy signals: • Thumbs up/down ratio • Session length • Regeneration rate • Response length |
| Monitoring: • Every request → Automated check • Dashboard updates: Real-time • Regression alerts: Instant |
Monitoring: • Automated checks: Real-time • Human spot-checks: Weekly (10% sample) • Combined quality score trending |
Monitoring: • Sample 100 conversations/week • 3 humans rate each • Compare to last month |
| Action: • Compile rate <90% → Auto rollback • Pass@1 drops >5% → Alert engineer |
Action: • Fact accuracy <95% → Auto rollback • Quality score drops >0.3 → Investigate • Run deeper human eval if needed |
Action: • Quality drops >0.3 → Investigate • A/B test for 7 days • Need human eval to decide |
| Human role: Only when alerts fire | Human role: Weekly spot-checks (10%) | Human role: Continuous (weekly audits) |
| Dashboard Example: Compile Rate: 94.2% 🟢 Pass@1: 67.8% 🟢 Latency: 1.2s 🟡 |
Dashboard Example: Fact Check: 96.1% 🟢 User Rating: 4.2/5 🟢 Clarity (sampled): 3.9/5 🟡 |
Dashboard Example: Human Rating: 4.2/5 🔴 Thumbs Up: 78% 🟢 Session Time: 8.2min 🟢 |
EACH LAYER HAS DIFFERENT BOTTLENECK #
| Layer | Verifiable Bottleneck | Hybrid Bottleneck | Non-Verifiable Bottleneck |
|---|---|---|---|
| L1: Design | Writing test suite | Defining which parts are verifiable | Getting human agreement on rubric |
| L2: Execute | GPU inference time | Human time for quality checks | Human annotation time |
| L3: Scale | Compute budget | Hiring raters for quality | Hiring/training many raters |
| L4: Metrics | Statistical analysis | Balancing auto vs human metrics | Inter-rater reliability |
| L5: Monitor | Infrastructure cost | Continuous spot-checking | Continuous human auditing |
REAL-WORLD EXAMPLES BY CATEGORY #
-
Verifiable:
✅ Code Generation (GitHub Copilot, Cursor) → Unit tests verify correctness → Compile rate is objective ✅ Math Problem Solving (Khan Academy AI) → Symbolic solver verifies answers → Can generate infinite practice problems ✅ Data Extraction (GPT-4 with function calling) → Schema validation is deterministic → JSON parsing either works or fails ``` -
Hybrid:
⚠️ Medical Diagnosis Assistant → Facts: Test results, drug interactions (verifiable) → Quality: Bedside manner, explanation clarity (human eval) ⚠️ Legal Document Analysis → Facts: Case precedents, statutes (verifiable) → Quality: Argument strength, writing quality (human eval) ⚠️ Translation Systems → Accuracy: BLEU score, term consistency (automated) → Fluency: Natural phrasing, cultural adaptation (human eval) ``` -
Non-Verifiable:
❌ Creative Writing (Claude, ChatGPT creative mode) → "Is this story engaging?" → Subjective → No automated test possible ❌ Therapy Chatbots (Woebot, Replika) → "Is this empathetic?" → Cultural/personal → Requires human evaluation ❌ Marketing Copy Generation → "Is this persuasive?" → Audience-dependent → A/B testing required (slow, expensive) ```
The Three-Way Split: #
Verifiable = Evaluation bottlenecked by compute budget
- Fast iteration, predictable costs, objective metrics
- Future: Unlimited synthetic data generation
Hybrid = Evaluation bottlenecked by smart automation + targeted human input
- Medium iteration speed, mixed costs, dual metrics
- Future: Better AI judges for subjective aspects
Non-Verifiable = Evaluation bottlenecked by human labor availability
- Slow iteration, uncertain costs, noisy metrics
- Future: Constitutional AI, better preference learning
Why This Matters: #
This three-way framework explains why:
- ✅ Coding assistants improve faster than creative writing tools
- ✅ Math tutors are more reliable than therapy chatbots
- ✅ Technical Q&A is easier to align than open-ended conversation
- ⚠️ Medical AI needs dual evaluation (facts + empathy)
- ⚠️ Translation quality requires both automated + human metrics
The future of AI alignment depends on:
- For verifiable domains: More efficient compute
- For hybrid domains: Better decomposition of verifiable vs subjective aspects
- For non-verifiable domains: Creating reliable “oracles” (AI judges as good as compilers)