Section B · Critical

Evals

How you know an AI system actually works. The topic that separates serious AI engineers from prompt hobbyists.

Why evals matter (especially in compliance)

You can't ship an AI system that produces compliance artifacts without a way to answer:

"Is this output good enough?"
"Is the new prompt better than the old one?"
"Did the model regress when we upgraded?"
"When the regulator asks why we trusted this AI, what's our evidence?"

The line

No evals = no production AI in regulated domains. Evals are how you answer all four questions above.

The mental model: tests for non-deterministic systems

Unit tests assert exact equality. LLM outputs vary. Evals are the bridge:

Unit test: assertEqual(output, "Hello") — fails on any deviation.
Eval: "On 200 representative inputs, the output passes a quality bar X% of the time, judged by [rubric/judge/humans]."

Evals trade exactness for statistical confidence. You move from "did this pass?" to "is this version meaningfully better than the last?"

The four building blocks of an eval

Every eval has four components. If you can name and design these for any task, you're 80% of the way there.

1. Dataset

A curated set of inputs (and often expected outputs) that represents your real workload.

Size: 50-200 to start. More for high-stakes tasks.
Coverage: include edge cases, adversarial inputs, common failures, real customer data (sanitized).
Versioned: dataset changes get tracked. "We added 30 sanctions edge cases on 2026-04-12."
Source: from production traces, from synthetic generation, from SME (subject matter expert) curation. Best evals mix all three.

In compliance: ground-truth labeled examples are gold. "Here are 100 alerts; SMEs labeled 60 as 'should dismiss', 40 as 'should escalate'."

2. Task / system under test

The thing being evaluated:

A single prompt + model
A full agent loop (multi-turn, tool-calling)
An end-to-end workflow (n8n + agents + post-processing)

You eval at multiple levels: component (prompt), unit (one tool), integration (the full agent on a task), end-to-end (the workflow).

3. Grader / judge

Grader type	When to use	Pros	Cons
Exact match / regex	Deterministic tasks, classification	Fast, cheap, reproducible	Brittle on free-form text
Numeric comparison	Math, counts	Reliable	Limited applicability
Code-based check	Format validation, JSON schema, business rules	Trustworthy	Limited to checkable properties
LLM-as-judge	Free-form quality	Scalable	Bias, jailbreak risk, drift
Human review	High-stakes, subjective, ground truth	Most reliable	Slow, expensive
Hybrid	Production systems	Practical balance	More to maintain

Compliance reality

High-stakes outputs (SAR drafts, risk recommendations) demand human review at least on a sample. LLM-as-judge is fine for "did the agent follow the format" or "is the summary on-topic."

4. Metric

The number(s) that come out. Common ones:

Accuracy / Pass rate: % of cases passing the grader.
Precision / Recall / F1: classification tasks (e.g. alert triage).
BLEU, ROUGE: text similarity (mostly outdated for modern LLMs).
Win rate: head-to-head ("model A vs B — which is better, judged by [judge]?").
Cost per task, latency p50/p95: not quality metrics, but you track them alongside.
Tool-use correctness: did the agent call the right tool with the right args?
Trajectory match: did the agent's full sequence of steps match what an expert would do?
Citation/grounding rate: % of claims backed by retrieved sources (for RAG).

For compliance specifically:

Recall on must-catch cases (you cannot miss a sanctions hit; precision is secondary)
False positive rate (every false positive costs a human's time)
Calibration (when the model says "high confidence," is it actually right?)

LLM-as-judge — the technique you must understand

Modern eval pipelines lean heavily on "LLM-as-judge": you use a strong model (often Claude Opus or GPT-4-class) to grade outputs from your system. Cheap, scalable, surprisingly good.

judge prompt pattern

You are a compliance reviewer. Given the alert, the draft narrative, and a rubric,
score the narrative on:
1. Factuality (does it contradict the alert data?)
2. Completeness (does it cover required SAR sections?)
3. Tone (is it appropriate for regulator review?)

Return JSON: { factuality: 0-5, completeness: 0-5, tone: 0-5, reasoning: "..." }

Pitfalls to acknowledge

Position bias: in head-to-head comparison, judges favor whichever response comes first. Mitigate by randomizing order and running both directions.
Length bias: judges often prefer longer responses. Calibrate.
Style match bias: judges prefer outputs in their own style.
Self-preference: a model judging its own output rates higher than it should. Use a different model as judge.
Drift: when the judge model is upgraded, your historical eval scores aren't comparable. Pin judge versions.
Jailbreak/injection risk: malicious inputs may try to trick the judge. Sandbox the judge call.

"How do you trust an LLM-as-judge?"

"I anchor it. I take 50-100 outputs, have humans grade them, and check that the LLM judge agrees with humans at >90%. If not, I refine the rubric. Then I track judge-human agreement on a rolling sample to detect drift."

Eval frameworks worth knowing by name

Anthropic eval cookbook / prompt eval tools — first-party guidance and patterns.
OpenAI Evals — open source, schema for grader/dataset/runner. Influential.
Inspect AI (UK AISI) — rigorous Python framework, used in safety research.
Promptfoo — developer-friendly, YAML-driven, great for CI.
DeepEval — Pytest-style, lots of metrics.
Ragas — RAG-specific (faithfulness, answer relevance, context precision).
LangSmith — LangChain's hosted eval/observability platform.
Braintrust — popular eval/observability platform with strong UI.
Weights & Biases Weave — eval + tracing.
Vellum, Humanloop, Helicone, Phoenix (Arize) — adjacent platforms.

If asked which you'd use: "For a compliance org, I'd start with Promptfoo or DeepEval for fast iteration in dev, plus a hosted observability/eval tool like Braintrust or Phoenix for production tracing and continuous evals. For RAG-heavy use cases, Ragas-style metrics."

Evals are continuous, not one-shot

Three eval moments matter:

Pre-deploy / dev evals: you run these as you build prompts and agents. Like unit tests for AI. CI-integrated.
Pre-release regression: before shipping a new prompt or model upgrade, you confirm no regression on the canonical set.
Production / online evals: real traffic samples are graded continuously. Flags drift, degradation, novel failures.

Compliance bonus

Production evals double as your audit story. "We sample 5% of agent outputs for SME review and 100% for LLM-judge scoring; mean factuality has been ≥4.2/5 over the past 90 days. Here's the dashboard."

Eval-driven development (EDD) — the workflow

This is the workflow modern AI engineers practice. Be ready to describe it.

Start with the failure cases, not the prompt. Collect 20-30 real or realistic inputs.
Define the grader before writing the prompt. Force yourself to define "good."
Write the simplest prompt that might work. Run the eval.
Inspect failures by reading actual outputs. Cluster the failure modes.
Edit prompt / add tools / add retrieval to address top failure cluster.
Re-run eval. Compare scores. Did the change help? Did it regress anything?
Repeat. Don't move on until you understand each remaining failure.
Hold out a test set the prompt has never seen. Score on it before deploying.

Say this verbatim

"I'd build the eval set first, then the prompt." It signals maturity.

Agent evals — harder, more important

Evaluating an agent (multi-step, tool-using) is harder than evaluating a single prompt because:

The output isn't just text — it's a trajectory (the sequence of tool calls).
One mistake mid-trajectory cascades.
Ground truth is harder to define ("there are several valid ways to do this").

Common agent eval techniques:

Final-answer eval: did the agent reach the right end state, ignoring path.
Trajectory eval: did the agent's tool sequence match an expert trajectory.
Step-wise eval: at each step, was the chosen tool reasonable given the state.
Cost / step count: did the agent spend a reasonable number of tool calls.
Permission / safety: did the agent stay within authorized actions.
Recovery: when the agent encountered an error, did it recover or spiral.

For compliance: trajectory and step-wise matter most. "The agent dismissed this alert" is an end state, but a regulator wants to see why — every step that led there.

Common eval anti-patterns to call out

Vibe-checking: "I tried 5 examples and it looked good." Not an eval.
Train on test: tweaking the prompt against the same examples you're scoring against. Hold out a real test set.
Single-point estimates: reporting one accuracy number without confidence intervals.
One-grader gospel: trusting one LLM judge with no human anchoring.
Stale datasets: data that doesn't reflect current production traffic.
Eval avoidance: "we'll add evals later" — by which point you can't tell if you've regressed.

What to say if asked "have you written evals?"

If you haven't formally — be honest:

The honest answer

"I haven't built a formal eval harness in production yet. The closest I've done is [whatever — manual quality reviews, small-scale testing on side projects, structured review of outputs]. I understand the pattern: dataset, grader, metric, run continuously, anchor against humans. If I were starting a compliance eval program from scratch, I'd [walk through the EDD loop above]."

That's a winning answer. Honest + technically grounded + actionable.

Cheat sheet — vocabulary

Eval set / golden set / canonical set — the labeled dataset
Grader / judge / scorer — the function that scores an output
Rubric — the criteria the grader uses
LLM-as-judge — using an LLM to grade
Pairwise / head-to-head — comparing two outputs side by side
Pass rate / accuracy — % of cases that meet the bar
Regression test — re-run on canonical set to catch quality regressions
Online eval / shadow eval — eval running on real production traffic
Trajectory eval — evaluating the full agent path, not just the final answer
Calibration — does model confidence match actual correctness
Trace — the recorded sequence of LLM calls, tool calls, and outputs