Eval harnesses are the new test suite

For ten years I wrote unit tests. Then I started shipping AI features and realized: unit tests can't catch a model getting subtly worse overnight. Here's what replaced them.

For ten years I wrote unit tests. Some of them were good, some were boilerplate, but the discipline was real — every change ran through a test suite, and a red light meant I had introduced a regression. The system was deterministic. The tests caught regressions because the regressions were deterministic too.

Then I started shipping features built on language models. The first time a deployed feature got quietly worse with no code change on my side, I realized the test suite I had inherited from a decade of software engineering would not catch what would actually go wrong.

Unit tests tell you when your code does something different. AI features need a test discipline that tells you when your model is doing something worse — and worse is much harder to detect than different.

Why unit tests fail for AI

Traditional software has a deterministic input-output contract. Same input, same output. A regression is a code change that breaks an assertion. You bisect commits, find the cause, fix it.

An AI feature has a probabilistic contract. Same input, slightly different output every time, all of which may be acceptable. The whole shape of "what's correct" is statistical, not exact.

This means three things break at once when you try to apply unit-test discipline to AI:

You cannot assert exact match. Two runs of the same prompt with the same model can differ in punctuation, in ordering, in word choice. None of those differences indicate a regression. Asserting on output text directly produces a flaky test suite that no one trusts.
You cannot bisect commits. A "regression" in an AI feature might come from your code change — but it might also come from the model provider silently updating the underlying model weights, from a change in tokenizer, from rate limiting that re-routed your traffic to a different inference cluster. The cause is often outside your repo entirely.
The tests can pass while the output gets worse. If your assertions check for response length, JSON validity, and the presence of specific phrases, those can all pass while the actual quality of the answer to the user's question quietly degrades. Pass/fail is the wrong vocabulary.

So the mental model has to shift: from did my code change? to is the system as a whole behaving worse? That second question is what an eval harness answers.

What evals actually are

Strip the jargon. An eval is three things:

A set of inputs — your "golden set," hand-curated test cases representing the spectrum of real production inputs.
A way to score the output — one or more functions that take an output and produce a quality score. Not a boolean. A continuous signal.
A baseline and a threshold — what the score looks like when the system is healthy, and how much it can drop before you care.

That's the whole concept. Everything else is implementation.

The differences from unit tests are three:

Scores are continuous (0-1) rather than boolean (pass/fail). A model that's scoring 0.92 on correctness is doing slightly worse than one scoring 0.95. The number is the signal, not the assertion.
Thresholds are statistical, not exact. The eval doesn't say "this output is right." It says "the median output across 50 cases is at least 0.85" or "p95 latency is under 1.5 seconds." Quality is measured in distributions, not in single instances.
Evals run nightly against production, not on every commit. The most important thing they catch is silent drift — model providers updating their backends, prompt engineering breaking under load, your golden set turning out to be unrepresentative of new user behavior. None of that fires on a commit hook.

The anatomy of an eval harness

Six components. None of them are magic. All of them require a small amount of care to build well.

EVAL HARNESS · COMPONENTS

Golden set. 20-100 hand-curated input cases per AI feature. Covers normal cases, edge cases, known failure modes, and adversarial cases (prompt injection attempts, data leakage probes). Versioned. Updated as new failure modes are discovered.
Scoring functions. Some are deterministic and free — JSON validity, response length, format compliance, latency, token cost. Some require an LLM judge — "does this answer address the question," "is this output free of PII," "does the tone match the brand rubric."
Run loop. A scheduled job that runs the golden set through the live production model, captures every output, scores them, and stores the scores with timestamps.
Baseline window. A rolling baseline — usually the last 7-14 days of healthy scores — that today's run gets compared against. The baseline is not a single number; it's a distribution.
Regression detection. A statistical test that flags when today's distribution has shifted meaningfully against baseline. Common rule of thumb: any criterion dropping more than 5% on its rolling baseline is worth a human look.
Alert + dashboard. When regression is detected, an actual person (you, in the operator model) gets paged. Beyond that, a dashboard shows the trend over time, so degradation that builds slowly across weeks is visible too.

The architecture is the boring part. The judgment is in the golden set and the scoring functions — those are where the practice lives.

How Sextant runs in this practice

Sextant, my eval agent, runs at 02:00 GMT+7 every night. The cycle:

Loads the golden set for each AI feature shipped to clients in the last 90 days.
Runs each input through the live production endpoint (not a staging mock — the actual model the client traffic hits).
Scores each output across the criteria configured for that feature: correctness via LLM judge, format compliance, latency, cost per request, optional brand-style score.
Compares each criterion to the rolling 7-day baseline.
If any criterion drops more than 5% on baseline, sends an alert to my morning queue with the failing cases attached.
Logs every score to a time-series store so I can look at trends across weeks and months.
Goes back to sleep until tomorrow.

Run time per feature is roughly 15 minutes and roughly 30 cents to a few dollars in API costs. Across the typical client engagement load, the total bill for nightly evals is in low tens of dollars per month per client. The cost of not running them is the cost of finding out about a regression from the client's support inbox instead of from your own pager.

Categories of evals that matter

Different AI features need different eval suites. The categories I run by default:

Correctness. Is the answer right? Hardest to score. Usually requires an LLM judge with a clear rubric and a small set of paired (input, ideal-output) examples for calibration. Worth the effort — this is the criterion users feel most.

Safety. Does the output contain anything that shouldn't be there? PII leakage, profanity, refusals where it should have answered, answers where it should have refused, evidence of successful prompt injection. Often programmatic — regex for PII patterns, classifier for tone, comparison against a refusal-pattern catalog.

Latency. p50, p95, p99 response time. The threshold depends on the UX. A chat agent has different latency requirements than a batch document classifier. Set thresholds based on what your specific users will tolerate.

Cost. Tokens in and tokens out per request, multiplied by the model's price. Surprisingly easy to regress — a prompt engineering change can quietly double the cost-per-request without any obvious signal in the output. Catch it before the bill arrives.

Format. Is the output the structured shape your downstream code expects? JSON validity. Required fields present. Date formats. Enum values inside their valid set. A format regression breaks integrations silently and can be the longest-to-detect failure mode without a format eval.

Style. Does the output match the brand voice? Custom for each client. Subjective. LLM judge with a rubric document. Optional but valuable for client-facing AI features.

Most features need three of these (correctness, latency, format). Client-facing features want all six.

A representative example

The kind of incident an eval harness catches looks like this:

Sextant flags a 12% drop in correctness on a fraud-detection support agent overnight. No code change in our repo. Investigation reveals the model provider has pushed a silent update to the underlying model — common in the production AI world. The new version is, on average, slightly less aggressive on a specific class of edge cases that matter for this fraud use case. The change is too small to show up in casual testing. It is large enough that, deployed at production volume, false negatives tick upward.

Sequence of events:

02:14 — Sextant detects the regression, sends alert.
09:30 — I review the failing cases over coffee. Pattern is clear.
11:00 — Pin the previous model version on the production endpoint as a temporary fix. Latency unchanged, correctness restored.
Day 2 — Update prompt engineering and few-shot examples to compensate for the new model's bias. Test against golden set.
Day 3 — Roll back to the new model with the updated prompts. Sextant confirms correctness back at baseline. Done.

The client never noticed. They would have, two to four weeks later, when the elevated false-negative rate showed up in their fraud KPI and someone in their team raised it as an incident. By then there would have been actual fraud losses to count, customer trust to rebuild, and a postmortem conversation to have. None of that happened.

If you ship AI features without evals, you are flying without instruments. The plane is in the air. The plane might be doing fine. You will not know until it isn't.

Building your own eval harness

For engineers reading this who want to start: do not over-build. The minimum viable eval harness is small.

Step 1: Pick one feature. Don't try to evaluate everything in your AI surface area at once. Pick the AI feature with the highest cost-of-failure — the one whose silent regression would most embarrass you. Start there.

Step 2: Build a golden set of 20-50 cases. Hand-curated. Real production inputs where possible. Cover the failure modes you've already seen. Save them in a CSV or JSON file. Version them.

Step 3: Write 3-5 scoring functions. Start with the cheap ones — latency, format compliance, response length, token cost. These are deterministic, fast, free. Add an LLM judge for correctness only if you genuinely need it. The LLM judge is the most expensive part.

Step 4: Run the harness against a known-good model. Capture baseline scores. This is your "healthy" reference distribution.

Step 5: Set thresholds. Anything more than a 5% drop on any criterion fires an alert. Tune the threshold based on your actual variance — if your golden set is small, your noise floor is higher and your threshold should be too.

Step 6: Run nightly. Schedule it. Send alerts to a place you'll see them. Do not bury them in a log. The goal is for the eval harness to be boring — most nights it does nothing visible. The night it does something, you very much want to know.

Total build time for the first eval harness: roughly a day of focused work for an engineer who has shipped AI features before. Total build time for the second harness on the same project: an afternoon, because the infrastructure is reusable. Total cost to run: usually less than $30 per month per AI feature, including LLM judge costs.

The cost is small. The thing it protects against is hard to recover from.

Why this matters for buyers

If you are evaluating AI consultants — me or anyone else — here is the test question:

What is your eval discipline for the AI features you ship?

Most providers can't answer specifically. They've shipped AI features. They have no clear answer about whether those features are still working as intended. The honest answers I've heard from other consultants in this space:

"We deploy and monitor for client complaints." — Translation: the client is the eval. Wait until something is wrong, then react.
"We use the model provider's eval suite." — Provider evals measure general model capability. They know nothing about your specific use case, your specific golden set, your specific thresholds.
"We test on a few prompts manually before each deploy." — Manual testing on a handful of prompts catches obvious breakage. It misses everything subtle.
"We have an eval harness running nightly with a hand-curated golden set per feature, with alerts going to an actual on-call rotation." — This is the answer that means the AI features will keep working.

An eval discipline is the AI equivalent of having a CI/CD pipeline. Most consulting practices that ship AI today don't have one yet. A few do. The few are the only ones whose AI features can be relied on for serious work.

The closing argument

Unit tests tell you when your code does something different. Eval harnesses tell you when your model is doing something worse. Both matter. The first has been a baseline expectation for two decades. The second is becoming one — slowly in some markets, quickly in others, but inevitably.

If you ship AI features without evals, you are flying without instruments. You may be in clear air. You may be in turbulence. You will not know which until something has gone wrong long enough that the cost of finding out is the cost of fixing what already broke.

If you ship AI features with evals — even a minimal harness, even one feature, even a 30-case golden set — you are flying with instruments. The plane is doing what the plane should be doing, and you'll be the first to know if it isn't. That is a different category of practice from the one most AI consulting is currently in.

It will not be a different category for long. It is becoming the baseline. The question is whether your provider builds the discipline before the market expects it, or after. (See what I learned from one million accounts — the same engineering ethic, applied at fintech scale.)

— Mikkel, Bangkok