Eval Infra: Verifiable vs Non-Verifiable vs Hybrid

AI Reasoning Logo Eval Infra: Verifiable (STEM) vs Non-Verifiable vs Hybrid


Why These 5 Layers? Separation of Concerns #

Each layer serves a distinct purpose in the AI evaluation lifecycle, from research to production deployment.

Layer Purpose Key Question Stakeholder Output
L1: Benchmark Design Define what “good” means What are we measuring? Research Scientists Test suite + evaluation protocol
L2: Evaluation Execution Actually measure performance How do we score it? ML Engineers Raw scores/labels per example
L3: Scalability Handle volume & iteration speed Can we do this 1000x? MLOps/Infrastructure Evaluation pipeline infrastructure
L4: Metrics & Reliability Trust the measurements Is this signal real? Data Scientists, Leadership Aggregate metrics + confidence intervals
L5: Production Monitoring Maintain quality in the wild Is it still working? SREs, Product Managers Live dashboards + alerting systems

How Layers Map to Natural Workflow #

RESEARCH PHASE (Offline Development)
│
├─ L1: Design Benchmarks
│   └─ "What constitutes correct/good performance?"
│
├─ L2: Run Evaluations  
│   └─ "Generate responses and score them"
│
├─ L3: Scale Infrastructure
│   └─ "Need to iterate fast → evaluate 10K examples/day"
│
└─ L4: Analyze Results
    └─ "Aggregate metrics, validate reliability"

DEPLOYMENT PHASE (Online Production)
│
└─ L5: Monitor Production
    └─ "Continuous validation, catch regressions"
    └─ Feed failures back to L1 (closed loop)

Different Stakeholders Own Each Layern #

┌─────────────────────────────────────────────────────┐
│ L1: Research Scientists                             │
│     → Design evaluation protocols                   │
│     → Define what "good" means for the domain       │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│ L2: ML Engineers                                    │
│     → Implement evaluation scripts                  │
│     → Run model inference + scoring                 │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│ L3: MLOps / Infrastructure Engineers                │
│     → Build scalable eval pipelines                 │
│     → Manage compute resources                      │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│ L4: Data Scientists / Research Leadership           │
│     → Statistical analysis of eval results          │
│     → Validate metric reliability                   │
│     → Make deployment decisions                     │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│ L5: SREs / Product Managers                         │
│     → Monitor production performance                │
│     → Alert on regressions                          │
│     → Coordinate incident response                  │
└─────────────────────────────────────────────────────┘

The Spectrum of Verifiability #

Fully Verifiable ←――――――――――――――――――――――――→ Non-Verifiable
      │                    │                      │
   Code/Math          Technical Writing      Creative/Social
   (external          (hybrid: facts +       (judgment only)
    oracle)            style/clarity)

Examples by Category #

Fully Verifiable Hybrid (Partially Verifiable) Non-Verifiable
• Code execution • Technical writing (facts ✓, clarity ✗) • Creative writing
• Math computation • Translation (accuracy ✓, style ✗) • Persuasive marketing
• Logic proofs • Legal analysis (precedent ✓, judgment ✗) • Empathetic therapy
• Data extraction • Medical diagnosis (tests ✓, manner ✗) • Humor generation
• Fact checking • Recipe generation (chemistry ✓, taste ✗) • Art critique
• Physics simulation • Code review (bugs ✓, readability ✗) • Storytelling

LAYER 1: BENCHMARK DESIGN #

Verifiable (Code/Math/Logic) Hybrid (Technical/Professional) Non-Verifiable (Creative/Social)
Structure: Problem + Test Suite → Automated Oracle Structure: Task + Mixed Criteria → Automated + Human Structure: Prompt + Rubric → Human/AI Judgment
Key Benchmarks:
• HumanEval (164 code problems)
• MATH (12K competition problems)
• GPQA (448 PhD questions)
Key Benchmarks:
• Technical Writing Quality
• Translation (BLEU + human eval)
• Medical Q&A (facts + empathy)
Key Benchmarks:
• MT-Bench (80 conversations)
• AlpacaEval (805 prompts)
• Chatbot Arena (live voting)
Properties:
✅ Tests are deterministic
✅ Can generate infinite variants
✅ No human disagreement
Properties:
⚠️ Some aspects objective
⚠️ Some aspects subjective
⚠️ Requires dual evaluation
Properties:
❌ Ratings are subjective
❌ Context-dependent quality
❌ Humans disagree frequently
Creation Time: 1-2 weeks Creation Time: 2-3 weeks Creation Time: 3-4 weeks
Example:
assert reverse([1,2,3]) == [3,2,1]
Example:
Accuracy: Does translation preserve meaning? (✓)
Fluency: Does it sound natural? (human)
Example:
“Is this story engaging?"
→ 5-point Likert scale (human)

LAYER 2: EVALUATION EXECUTION #

Verifiable Hybrid Non-Verifiable
Pipeline:
1. Generate solutions (10 min)
2. Execute & verify (5 min)
3. Compute metrics (<1 min)
Pipeline:
1. Generate outputs (10 min)
2. Automated checks (5 min)
3. Human evaluation (4-20 hours)
4. Combine scores (30 min)
Pipeline:
1. Generate responses (5 min)
2. Human/AI rating (8-40 hours)
3. Aggregate & validate (1 hour)
Verification:
python result = execute_code(solution)<br>label = PASS if tests_pass else FAIL<br>
Verification:
python # Automated component<br>facts_correct = verify_facts(output)<br># Human component<br>clarity = human_rate_clarity(output)<br>score = 0.5*facts + 0.5*clarity<br>
Verification:
python ratings = get_human_ratings(n=3)<br>score = mean(ratings)<br># Then validate inter-rater agreement<br>
Throughput: 10K-100K evals/hour Throughput: 500-5K evals/hour Throughput: 100-1K evals/hour
Cost per eval: $0.001-0.01 Cost per eval: $0.05-1.00 Cost per eval: $0.10-5.00
Human time: 0 hours Human time: 4-20 hours (partial) Human time: 24-120 hours (full)

LAYER 3: SCALABILITY #

Verifiable Hybrid Non-Verifiable
Bottleneck: Compute (GPU time) Bottleneck: Human time for subjective parts Bottleneck: Human bandwidth
Scaling Examples:
• 164 problems → $2, 15 min
• 10K problems → $122, 2 hours
• 100K problems → $1,220, 8 hours
Scaling Examples:
• 100 docs → $50, 4 hours
• 1K docs → $500, 20 hours
• 10K docs → $5K, 200 hours
(Mix of automated + human)
Scaling Examples:
• 80 prompts → $400, 8-40 hours
• 10K prompts → $37,500, 2,500 hours
• (Or $8K hybrid with AI judges)
Human scaling: 0 hours regardless of scale Human scaling: Sub-linear (automate what’s possible) Human scaling: Linear or super-linear
Constraint: Money (buy more GPUs) Constraint: Time + money (human for quality checks) Constraint: Time (recruit, train raters)
Automation: 99.9% Automation: 40-70% (depends on domain) Automation: 5-30% (AI judges need validation)
Scalability Strategy:
• Parallelize across GPUs
• Generate synthetic test cases
• Cost scales linearly
Scalability Strategy:
• Automate objective criteria
• Sample human evaluation (10-20%)
• Use AI judges for subjective parts (with validation)
Scalability Strategy:
• Train reward models on human labels
• Use LLM-as-judge (must validate)
• Spot-check 5-10% with humans

LAYER 4: METRICS & RELIABILITY #

Verifiable Hybrid Non-Verifiable
Metrics:
• pass@k (% solved in k tries)
• Compile rate
• Exact match accuracy
• Error tolerance
Metrics:
• Factual accuracy (automated)
• Readability score (formula)
• Clarity rating (human)
• Combined weighted score
Metrics:
• Likert scale (1-5 ratings)
• Win rate vs baseline
• Elo ratings (head-to-head)
• Thumbs up/down ratio
Properties:
✅ Objective & reproducible
✅ Labs can compare directly
✅ No gaming (oracle is external)
✅ Leaderboards meaningful
Properties:
⚠️ Partially objective
⚠️ Requires careful weighting
⚠️ Some gaming risk on subjective parts
⚠️ Need to report both auto + human metrics
Properties:
❌ Subjective & noisy
❌ Different protocols → incomparable
❌ Gaming risk (optimize for judge)
❌ Leaderboards have selection bias
Inter-evaluator agreement: 100% Inter-evaluator agreement:
Facts: 95-100%
Quality: 70-85%
Inter-evaluator agreement: 60-80%
Example Metrics:
• HumanEval pass@1: 67.8%
• MATH accuracy: 82.3%
• Error rate: 5.2%
Example Metrics:
• Translation BLEU: 45.2 (auto)
• Fluency: 4.1/5 (human)
• Medical accuracy: 94% (auto), Empathy: 3.8/5 (human)
Example Metrics:
• MT-Bench: 7.9/10 (GPT-4 judge)
• Human preference: 78% win rate
• Elo rating: 1,245

LAYER 5: PRODUCTION MONITORING #

Verifiable Hybrid Non-Verifiable
Real-time signals:
• Does code compile? ✓/✗
• Tests pass? ✓/✗
• User accepted? ✓/✗
• Execution time OK? ✓/✗
Real-time signals:
• Facts verified? ✓/✗
• Format correct? ✓/✗
• User satisfaction proxy (usage time)
• Error rate
Proxy signals:
• Thumbs up/down ratio
• Session length
• Regeneration rate
• Response length
Monitoring:
• Every request → Automated check
• Dashboard updates: Real-time
• Regression alerts: Instant
Monitoring:
• Automated checks: Real-time
• Human spot-checks: Weekly (10% sample)
• Combined quality score trending
Monitoring:
• Sample 100 conversations/week
• 3 humans rate each
• Compare to last month
Action:
• Compile rate <90% → Auto rollback
• Pass@1 drops >5% → Alert engineer
Action:
• Fact accuracy <95% → Auto rollback
• Quality score drops >0.3 → Investigate
• Run deeper human eval if needed
Action:
• Quality drops >0.3 → Investigate
• A/B test for 7 days
• Need human eval to decide
Human role: Only when alerts fire Human role: Weekly spot-checks (10%) Human role: Continuous (weekly audits)
Dashboard Example:
Compile Rate: 94.2% 🟢
Pass@1: 67.8% 🟢
Latency: 1.2s 🟡
Dashboard Example:
Fact Check: 96.1% 🟢
User Rating: 4.2/5 🟢
Clarity (sampled): 3.9/5 🟡
Dashboard Example:
Human Rating: 4.2/5 🔴
Thumbs Up: 78% 🟢
Session Time: 8.2min 🟢

EACH LAYER HAS DIFFERENT BOTTLENECK #

Layer Verifiable Bottleneck Hybrid Bottleneck Non-Verifiable Bottleneck
L1: Design Writing test suite Defining which parts are verifiable Getting human agreement on rubric
L2: Execute GPU inference time Human time for quality checks Human annotation time
L3: Scale Compute budget Hiring raters for quality Hiring/training many raters
L4: Metrics Statistical analysis Balancing auto vs human metrics Inter-rater reliability
L5: Monitor Infrastructure cost Continuous spot-checking Continuous human auditing

REAL-WORLD EXAMPLES BY CATEGORY #

  • Verifiable:

        ✅ Code Generation (GitHub Copilot, Cursor)
           → Unit tests verify correctness
           → Compile rate is objective
    
        ✅ Math Problem Solving (Khan Academy AI)
           → Symbolic solver verifies answers
           → Can generate infinite practice problems
    
        ✅ Data Extraction (GPT-4 with function calling)
           → Schema validation is deterministic
           → JSON parsing either works or fails
        ```
    
  • Hybrid:

        ⚠️ Medical Diagnosis Assistant
           → Facts: Test results, drug interactions (verifiable)
           → Quality: Bedside manner, explanation clarity (human eval)
    
        ⚠️ Legal Document Analysis
           → Facts: Case precedents, statutes (verifiable)
           → Quality: Argument strength, writing quality (human eval)
    
        ⚠️ Translation Systems
           → Accuracy: BLEU score, term consistency (automated)
           → Fluency: Natural phrasing, cultural adaptation (human eval)
        ```
    
  • Non-Verifiable:

        ❌ Creative Writing (Claude, ChatGPT creative mode)
           → "Is this story engaging?" → Subjective
           → No automated test possible
    
        ❌ Therapy Chatbots (Woebot, Replika)
           → "Is this empathetic?" → Cultural/personal
           → Requires human evaluation
    
        ❌ Marketing Copy Generation
           → "Is this persuasive?" → Audience-dependent
           → A/B testing required (slow, expensive)
        ```
    

The Three-Way Split: #

Verifiable = Evaluation bottlenecked by compute budget

  • Fast iteration, predictable costs, objective metrics
  • Future: Unlimited synthetic data generation

Hybrid = Evaluation bottlenecked by smart automation + targeted human input

  • Medium iteration speed, mixed costs, dual metrics
  • Future: Better AI judges for subjective aspects

Non-Verifiable = Evaluation bottlenecked by human labor availability

  • Slow iteration, uncertain costs, noisy metrics
  • Future: Constitutional AI, better preference learning

Why This Matters: #

This three-way framework explains why:

  • ✅ Coding assistants improve faster than creative writing tools
  • ✅ Math tutors are more reliable than therapy chatbots
  • ✅ Technical Q&A is easier to align than open-ended conversation
  • ⚠️ Medical AI needs dual evaluation (facts + empathy)
  • ⚠️ Translation quality requires both automated + human metrics

The future of AI alignment depends on:

  1. For verifiable domains: More efficient compute
  2. For hybrid domains: Better decomposition of verifiable vs subjective aspects
  3. For non-verifiable domains: Creating reliable “oracles” (AI judges as good as compilers)