LLM evaluation in regulated domains: beyond accuracy

TL;DR — “Accuracy” is an insufficient metric when your LLM’s output influences regulatory decisions. At Darwin we combine golden sets + LLM-as-judge + regression detection + numeric guardrails. This post covers how we built it and what we learned in production with thousands of cases per day.

The problem: accuracy is not enough

When you evaluate a traditional classifier, accuracy + confusion matrix tell you quite a bit. When you evaluate an LLM in a regulated domain:

Accuracy looks high on generic benchmarks but fails on domain edge cases
Critical cases weigh differently — an error on a “routine” case vs. a “high-risk” case is not the same thing
Outputs are free text — there’s no single correct answer
Models change without warning — OpenAI/Anthropic update models, your day-1 eval doesn’t work on day 90
Numerical hallucinations are silent killers — the model makes up a number and presents it with total confidence

At Darwin, when the system says “this lot has 3 FSMA 204 compliance gaps”, that answer goes to an audit. Being wrong there is legal liability. We need evaluation that captures that.

The 4 levels of evaluation we use

Level 1: Manually curated golden sets

A set of 300-500 representative cases with ground truth answers curated by human experts (internal compliance officers + external advisors).

Covers:

Routine cases — the majority, which the model should handle easily
Known edge cases — ambiguous situations, conflicting rules, incomplete lots
Adversarial cases — attempts to make the model fail (trick questions, contradictory data)

Metrics at this level:

Exact match — for structured answers (JSON, enum values)
Semantic similarity — for text-based answers (BERTScore, embeddings)
Fact extraction recall — did it extract every relevant fact?

Frequency: we run this set in CI before every deploy. The deploy fails if it drops below threshold.

Level 2: LLM-as-judge for text outputs

For long answers (reports, gap analyses, explanations) there’s no single “correct”. We use another LLM as judge — with a specific prompt that evaluates:

Factuality — are the claims verifiable against the evidence shown?
Completeness — does it cover all the aspects it should?
Correct citation — does it cite the right regulations/sources?
Tone/style — is it professional, precise, not alarmist?

The key: the judge LLM uses a different model (e.g., model A generates, model B judges). This avoids the “model judging itself” bias.

Pitfalls: LLM-as-judge has its own biases. We calibrate it with samples where human experts also evaluated, and tune the judge prompt iteratively.

Level 3: Regression detection in production

Between golden-set evals, the production model processes thousands of cases per day. We can’t curate all of them — but we can detect regressions.

Techniques:

Score distributions over time — if average confidence, per-dimension eval scores, or output patterns change suddenly, we alert
Shadow mode for deploys — the new model runs in parallel, we compare its outputs to the current one on real cases
Sampling + human review — 1-2% of outputs are sampled weekly and a human compliance officer reviews them
Feedback from operators — every time an operator corrects a system answer, it’s a data point

When something changes systematically (e.g., “completeness” score drops 10% in a week), we investigate before it causes an incident.

Level 4: Numerical guardrails

This one is the most critical. LLMs make up numbers when they don’t have them explicitly in context. I covered this partially in the RAG post, but it’s worth going deeper.

Every time our system returns an output containing a number (lot counts, dates, percentages, counts), it goes through a guardrail validator:

def validate_numerical_claims(response: str, source_data: dict) -> ValidationResult:
    """Extract numerical claims from LLM response and verify against source data."""
    claims = extract_numbers_from_text(response)

    for claim in claims:
        # Is this number present in our verified data?
        if not verify_claim_in_source(claim, source_data):
            return ValidationResult(
                valid=False,
                reason=f"Claim '{claim}' cannot be verified against source data",
            )
    return ValidationResult(valid=True)

If a number from the LLM can’t be verified against the structured source, we fail fast — we return an error instead of a potentially incorrect answer. Better to say “I don’t know” than to lie with confidence.

The eval stack

Pytest + custom fixtures — golden sets as YAML files, asserts in tests
LangSmith + OpenTelemetry — tracing of every invocation with input, output, metadata
PromptLayer — prompt versioning + history
Custom dashboards — Grafana with score distributions, drift detection
Postgres + dbt — persistence of eval runs, historical comparison

What didn’t work

V0: a single global “accuracy” metric — it reduced technical decisions to “got better / got worse” with no nuance. We moved to multi-dimensional dashboards (factuality, completeness, citation, etc.) with explicit trade-offs.

Static golden set with no updates — the set got stale as regulations got new versions. Now we have a monthly curation/review process.

Eval only in staging before deploy — some regressions only surfaced under real production distribution. We added level 3 (regression detection in prod) to compensate.

Trusting LLM-as-judge 100% without human calibration — the judge had biases we didn’t see. Now we calibrate against human-rated samples monthly.

What did work

Small but well-curated golden sets — 300 well-thought-out cases > 3000 random cases

Trend dashboards, not just snapshot — detecting drift matters more than the absolute score

Numerical guardrails as defensive coding — zero numerical hallucinations reaching the end user since we implemented it

Short feedback loops — every operator correction comes back to the eval system as a potential golden case

Tracing on every LLM call — when something goes wrong in production, we go from trace to root cause in minutes

Lessons learned

Accuracy is the floor, not the ceiling of what your eval suite needs
Multi-dimensional scoring > single score
Regression detection in prod is as important as pre-deploy eval
Numerical guardrails are non-negotiable in domains where numbers matter
LLM-as-judge is useful but needs calibration against human experts periodically
Operator feedback is the most valuable eval set in the long run

What’s next?

We’re exploring automated adversarial testing — programmatic generation of edge cases that stress the model. Combined with input fuzz testing (partial, contradictory, weird-format data), it amplifies our eval coverage without requiring manual curation.

If you’re building LLMs in production and your eval suite is “run an eval.py with 50 examples”, you’re probably underestimating the problem. What’s critical isn’t the number in the report — it’s that you catch regressions before they reach the user.

Are you building your eval suite for LLMs in production? Let’s talk — we can share templates, dashboards and learnings.