Eval Infra: Verifiable vs Non-Verifiable vs Hybrid

AI Reasoning Logo Eval Infra: Verifiable (STEM) vs Non-Verifiable vs Hybrid

Why These 5 Layers? Separation of Concerns #

Each layer serves a distinct purpose in the AI evaluation lifecycle, from research to production deployment.

Layer	Purpose	Key Question	Stakeholder	Output
L1: Benchmark Design	Define what “good” means	What are we measuring?	Research Scientists	Test suite + evaluation protocol
L2: Evaluation Execution	Actually measure performance	How do we score it?	ML Engineers	Raw scores/labels per example
L3: Scalability	Handle volume & iteration speed	Can we do this 1000x?	MLOps/Infrastructure	Evaluation pipeline infrastructure
L4: Metrics & Reliability	Trust the measurements	Is this signal real?	Data Scientists, Leadership	Aggregate metrics + confidence intervals
L5: Production Monitoring	Maintain quality in the wild	Is it still working?	SREs, Product Managers	Live dashboards + alerting systems

How Layers Map to Natural Workflow #

RESEARCH PHASE (Offline Development)
│
├─ L1: Design Benchmarks
│   └─ "What constitutes correct/good performance?"
│
├─ L2: Run Evaluations  
│   └─ "Generate responses and score them"
│
├─ L3: Scale Infrastructure
│   └─ "Need to iterate fast → evaluate 10K examples/day"
│
└─ L4: Analyze Results
    └─ "Aggregate metrics, validate reliability"

DEPLOYMENT PHASE (Online Production)
│
└─ L5: Monitor Production
    └─ "Continuous validation, catch regressions"
    └─ Feed failures back to L1 (closed loop)

Different Stakeholders Own Each Layern #

┌─────────────────────────────────────────────────────┐
│ L1: Research Scientists                             │
│     → Design evaluation protocols                   │
│     → Define what "good" means for the domain       │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│ L2: ML Engineers                                    │
│     → Implement evaluation scripts                  │
│     → Run model inference + scoring                 │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│ L3: MLOps / Infrastructure Engineers                │
│     → Build scalable eval pipelines                 │
│     → Manage compute resources                      │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│ L4: Data Scientists / Research Leadership           │
│     → Statistical analysis of eval results          │
│     → Validate metric reliability                   │
│     → Make deployment decisions                     │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│ L5: SREs / Product Managers                         │
│     → Monitor production performance                │
│     → Alert on regressions                          │
│     → Coordinate incident response                  │
└─────────────────────────────────────────────────────┘

The Spectrum of Verifiability #

Fully Verifiable ←――――――――――――――――――――――――→ Non-Verifiable
      │                    │                      │
   Code/Math          Technical Writing      Creative/Social
   (external          (hybrid: facts +       (judgment only)
    oracle)            style/clarity)

Examples by Category #

Fully Verifiable	Hybrid (Partially Verifiable)	Non-Verifiable
• Code execution	• Technical writing (facts ✓, clarity ✗)	• Creative writing
• Math computation	• Translation (accuracy ✓, style ✗)	• Persuasive marketing
• Logic proofs	• Legal analysis (precedent ✓, judgment ✗)	• Empathetic therapy
• Data extraction	• Medical diagnosis (tests ✓, manner ✗)	• Humor generation
• Fact checking	• Recipe generation (chemistry ✓, taste ✗)	• Art critique
• Physics simulation	• Code review (bugs ✓, readability ✗)	• Storytelling

LAYER 1: BENCHMARK DESIGN #

Verifiable (Code/Math/Logic)	Hybrid (Technical/Professional)	Non-Verifiable (Creative/Social)
Structure: Problem + Test Suite → Automated Oracle	Structure: Task + Mixed Criteria → Automated + Human	Structure: Prompt + Rubric → Human/AI Judgment
Key Benchmarks: • HumanEval (164 code problems) • MATH (12K competition problems) • GPQA (448 PhD questions)	Key Benchmarks: • Technical Writing Quality • Translation (BLEU + human eval) • Medical Q&A (facts + empathy)	Key Benchmarks: • MT-Bench (80 conversations) • AlpacaEval (805 prompts) • Chatbot Arena (live voting)
Properties: ✅ Tests are deterministic ✅ Can generate infinite variants ✅ No human disagreement	Properties: ⚠️ Some aspects objective ⚠️ Some aspects subjective ⚠️ Requires dual evaluation	Properties: ❌ Ratings are subjective ❌ Context-dependent quality ❌ Humans disagree frequently
Creation Time: 1-2 weeks	Creation Time: 2-3 weeks	Creation Time: 3-4 weeks
Example: `assert reverse([1,2,3]) == [3,2,1]`	Example: Accuracy: Does translation preserve meaning? (✓) Fluency: Does it sound natural? (human)	Example: “Is this story engaging?" → 5-point Likert scale (human)

LAYER 2: EVALUATION EXECUTION #

Verifiable	Hybrid	Non-Verifiable
Pipeline: 1. Generate solutions (10 min) 2. Execute & verify (5 min) 3. Compute metrics (<1 min)	Pipeline: 1. Generate outputs (10 min) 2. Automated checks (5 min) 3. Human evaluation (4-20 hours) 4. Combine scores (30 min)	Pipeline: 1. Generate responses (5 min) 2. Human/AI rating (8-40 hours) 3. Aggregate & validate (1 hour)
Verification: `python result = execute_code(solution)<br>label = PASS if tests_pass else FAIL<br>`	Verification: `python # Automated component<br>facts_correct = verify_facts(output)<br># Human component<br>clarity = human_rate_clarity(output)<br>score = 0.5facts + 0.5clarity<br>`	Verification: `python ratings = get_human_ratings(n=3)<br>score = mean(ratings)<br># Then validate inter-rater agreement<br>`
Throughput: 10K-100K evals/hour	Throughput: 500-5K evals/hour	Throughput: 100-1K evals/hour
Cost per eval: $0.001-0.01	Cost per eval: $0.05-1.00	Cost per eval: $0.10-5.00
Human time: 0 hours	Human time: 4-20 hours (partial)	Human time: 24-120 hours (full)

LAYER 3: SCALABILITY #

Verifiable	Hybrid	Non-Verifiable
Bottleneck: Compute (GPU time)	Bottleneck: Human time for subjective parts	Bottleneck: Human bandwidth
Scaling Examples: • 164 problems → $2, 15 min • 10K problems → $122, 2 hours • 100K problems → $1,220, 8 hours	Scaling Examples: • 100 docs → $50, 4 hours • 1K docs → $500, 20 hours • 10K docs → $5K, 200 hours (Mix of automated + human)	Scaling Examples: • 80 prompts → $400, 8-40 hours • 10K prompts → $37,500, 2,500 hours • (Or $8K hybrid with AI judges)
Human scaling: 0 hours regardless of scale	Human scaling: Sub-linear (automate what’s possible)	Human scaling: Linear or super-linear
Constraint: Money (buy more GPUs)	Constraint: Time + money (human for quality checks)	Constraint: Time (recruit, train raters)
Automation: 99.9%	Automation: 40-70% (depends on domain)	Automation: 5-30% (AI judges need validation)
Scalability Strategy: • Parallelize across GPUs • Generate synthetic test cases • Cost scales linearly	Scalability Strategy: • Automate objective criteria • Sample human evaluation (10-20%) • Use AI judges for subjective parts (with validation)	Scalability Strategy: • Train reward models on human labels • Use LLM-as-judge (must validate) • Spot-check 5-10% with humans

LAYER 4: METRICS & RELIABILITY #

Verifiable	Hybrid	Non-Verifiable
Metrics: • pass@k (% solved in k tries) • Compile rate • Exact match accuracy • Error tolerance	Metrics: • Factual accuracy (automated) • Readability score (formula) • Clarity rating (human) • Combined weighted score	Metrics: • Likert scale (1-5 ratings) • Win rate vs baseline • Elo ratings (head-to-head) • Thumbs up/down ratio
Properties: ✅ Objective & reproducible ✅ Labs can compare directly ✅ No gaming (oracle is external) ✅ Leaderboards meaningful	Properties: ⚠️ Partially objective ⚠️ Requires careful weighting ⚠️ Some gaming risk on subjective parts ⚠️ Need to report both auto + human metrics	Properties: ❌ Subjective & noisy ❌ Different protocols → incomparable ❌ Gaming risk (optimize for judge) ❌ Leaderboards have selection bias
Inter-evaluator agreement: 100%	Inter-evaluator agreement: Facts: 95-100% Quality: 70-85%	Inter-evaluator agreement: 60-80%
Example Metrics: • HumanEval pass@1: 67.8% • MATH accuracy: 82.3% • Error rate: 5.2%	Example Metrics: • Translation BLEU: 45.2 (auto) • Fluency: 4.1/5 (human) • Medical accuracy: 94% (auto), Empathy: 3.8/5 (human)	Example Metrics: • MT-Bench: 7.9/10 (GPT-4 judge) • Human preference: 78% win rate • Elo rating: 1,245

LAYER 5: PRODUCTION MONITORING #

Verifiable	Hybrid	Non-Verifiable
Real-time signals: • Does code compile? ✓/✗ • Tests pass? ✓/✗ • User accepted? ✓/✗ • Execution time OK? ✓/✗	Real-time signals: • Facts verified? ✓/✗ • Format correct? ✓/✗ • User satisfaction proxy (usage time) • Error rate	Proxy signals: • Thumbs up/down ratio • Session length • Regeneration rate • Response length
Monitoring: • Every request → Automated check • Dashboard updates: Real-time • Regression alerts: Instant	Monitoring: • Automated checks: Real-time • Human spot-checks: Weekly (10% sample) • Combined quality score trending	Monitoring: • Sample 100 conversations/week • 3 humans rate each • Compare to last month
Action: • Compile rate <90% → Auto rollback • Pass@1 drops >5% → Alert engineer	Action: • Fact accuracy <95% → Auto rollback • Quality score drops >0.3 → Investigate • Run deeper human eval if needed	Action: • Quality drops >0.3 → Investigate • A/B test for 7 days • Need human eval to decide
Human role: Only when alerts fire	Human role: Weekly spot-checks (10%)	Human role: Continuous (weekly audits)
Dashboard Example: Compile Rate: 94.2% 🟢 Pass@1: 67.8% 🟢 Latency: 1.2s 🟡	Dashboard Example: Fact Check: 96.1% 🟢 User Rating: 4.2/5 🟢 Clarity (sampled): 3.9/5 🟡	Dashboard Example: Human Rating: 4.2/5 🔴 Thumbs Up: 78% 🟢 Session Time: 8.2min 🟢

EACH LAYER HAS DIFFERENT BOTTLENECK #

Layer	Verifiable Bottleneck	Hybrid Bottleneck	Non-Verifiable Bottleneck
L1: Design	Writing test suite	Defining which parts are verifiable	Getting human agreement on rubric
L2: Execute	GPU inference time	Human time for quality checks	Human annotation time
L3: Scale	Compute budget	Hiring raters for quality	Hiring/training many raters
L4: Metrics	Statistical analysis	Balancing auto vs human metrics	Inter-rater reliability
L5: Monitor	Infrastructure cost	Continuous spot-checking	Continuous human auditing

REAL-WORLD EXAMPLES BY CATEGORY #

Verifiable:

    ✅ Code Generation (GitHub Copilot, Cursor)
       → Unit tests verify correctness
       → Compile rate is objective

    ✅ Math Problem Solving (Khan Academy AI)
       → Symbolic solver verifies answers
       → Can generate infinite practice problems

    ✅ Data Extraction (GPT-4 with function calling)
       → Schema validation is deterministic
       → JSON parsing either works or fails
    ```

Hybrid:

    ⚠️ Medical Diagnosis Assistant
       → Facts: Test results, drug interactions (verifiable)
       → Quality: Bedside manner, explanation clarity (human eval)

    ⚠️ Legal Document Analysis
       → Facts: Case precedents, statutes (verifiable)
       → Quality: Argument strength, writing quality (human eval)

    ⚠️ Translation Systems
       → Accuracy: BLEU score, term consistency (automated)
       → Fluency: Natural phrasing, cultural adaptation (human eval)
    ```

Non-Verifiable:

    ❌ Creative Writing (Claude, ChatGPT creative mode)
       → "Is this story engaging?" → Subjective
       → No automated test possible

    ❌ Therapy Chatbots (Woebot, Replika)
       → "Is this empathetic?" → Cultural/personal
       → Requires human evaluation

    ❌ Marketing Copy Generation
       → "Is this persuasive?" → Audience-dependent
       → A/B testing required (slow, expensive)
    ```

The Three-Way Split: #

Verifiable = Evaluation bottlenecked by compute budget

Fast iteration, predictable costs, objective metrics
Future: Unlimited synthetic data generation

Hybrid = Evaluation bottlenecked by smart automation + targeted human input

Medium iteration speed, mixed costs, dual metrics
Future: Better AI judges for subjective aspects

Non-Verifiable = Evaluation bottlenecked by human labor availability

Slow iteration, uncertain costs, noisy metrics
Future: Constitutional AI, better preference learning

Why This Matters: #

This three-way framework explains why:

✅ Coding assistants improve faster than creative writing tools
✅ Math tutors are more reliable than therapy chatbots
✅ Technical Q&A is easier to align than open-ended conversation
⚠️ Medical AI needs dual evaluation (facts + empathy)
⚠️ Translation quality requires both automated + human metrics

The future of AI alignment depends on:

For verifiable domains: More efficient compute
For hybrid domains: Better decomposition of verifiable vs subjective aspects
For non-verifiable domains: Creating reliable “oracles” (AI judges as good as compilers)