· Hernán Pérez Rodal · Engineering · 6 min read
LLM evaluation in regulated domains: beyond accuracy
When a wrong answer from your LLM affects an FDA audit, accuracy isn't enough. We share how we evaluate LLMs and agents at Darwin — golden sets, LLM-as-judge, regression detection and numeric guardrails.

TL;DR — “Accuracy” is an insufficient metric when your LLM’s output influences regulatory decisions. At Darwin we combine golden sets + LLM-as-judge + regression detection + numeric guardrails. This post covers how we built it and what we learned in production with thousands of cases per day.
The problem: accuracy is not enough
When you evaluate a traditional classifier, accuracy + confusion matrix tell you quite a bit. When you evaluate an LLM in a regulated domain:
- Accuracy looks high on generic benchmarks but fails on domain edge cases
- Critical cases weigh differently — an error on a “routine” case vs. a “high-risk” case is not the same thing
- Outputs are free text — there’s no single correct answer
- Models change without warning — OpenAI/Anthropic update models, your day-1 eval doesn’t work on day 90
- Numerical hallucinations are silent killers — the model makes up a number and presents it with total confidence
At Darwin, when the system says “this lot has 3 FSMA 204 compliance gaps”, that answer goes to an audit. Being wrong there is legal liability. We need evaluation that captures that.
The 4 levels of evaluation we use
Level 1: Manually curated golden sets
A set of 300-500 representative cases with ground truth answers curated by human experts (internal compliance officers + external advisors).
Covers:
- Routine cases — the majority, which the model should handle easily
- Known edge cases — ambiguous situations, conflicting rules, incomplete lots
- Adversarial cases — attempts to make the model fail (trick questions, contradictory data)
Metrics at this level:
- Exact match — for structured answers (JSON, enum values)
- Semantic similarity — for text-based answers (BERTScore, embeddings)
- Fact extraction recall — did it extract every relevant fact?
Frequency: we run this set in CI before every deploy. The deploy fails if it drops below threshold.
Level 2: LLM-as-judge for text outputs
For long answers (reports, gap analyses, explanations) there’s no single “correct”. We use another LLM as judge — with a specific prompt that evaluates:
- Factuality — are the claims verifiable against the evidence shown?
- Completeness — does it cover all the aspects it should?
- Correct citation — does it cite the right regulations/sources?
- Tone/style — is it professional, precise, not alarmist?
The key: the judge LLM uses a different model (e.g., model A generates, model B judges). This avoids the “model judging itself” bias.
Pitfalls: LLM-as-judge has its own biases. We calibrate it with samples where human experts also evaluated, and tune the judge prompt iteratively.
Level 3: Regression detection in production
Between golden-set evals, the production model processes thousands of cases per day. We can’t curate all of them — but we can detect regressions.
Techniques:
- Score distributions over time — if average confidence, per-dimension eval scores, or output patterns change suddenly, we alert
- Shadow mode for deploys — the new model runs in parallel, we compare its outputs to the current one on real cases
- Sampling + human review — 1-2% of outputs are sampled weekly and a human compliance officer reviews them
- Feedback from operators — every time an operator corrects a system answer, it’s a data point
When something changes systematically (e.g., “completeness” score drops 10% in a week), we investigate before it causes an incident.
Level 4: Numerical guardrails
This one is the most critical. LLMs make up numbers when they don’t have them explicitly in context. I covered this partially in the RAG post, but it’s worth going deeper.
Every time our system returns an output containing a number (lot counts, dates, percentages, counts), it goes through a guardrail validator:
def validate_numerical_claims(response: str, source_data: dict) -> ValidationResult:
"""Extract numerical claims from LLM response and verify against source data."""
claims = extract_numbers_from_text(response)
for claim in claims:
# Is this number present in our verified data?
if not verify_claim_in_source(claim, source_data):
return ValidationResult(
valid=False,
reason=f"Claim '{claim}' cannot be verified against source data",
)
return ValidationResult(valid=True)If a number from the LLM can’t be verified against the structured source, we fail fast — we return an error instead of a potentially incorrect answer. Better to say “I don’t know” than to lie with confidence.
The eval stack
- Pytest + custom fixtures — golden sets as YAML files, asserts in tests
- LangSmith + OpenTelemetry — tracing of every invocation with input, output, metadata
- PromptLayer — prompt versioning + history
- Custom dashboards — Grafana with score distributions, drift detection
- Postgres + dbt — persistence of eval runs, historical comparison
What didn’t work
V0: a single global “accuracy” metric — it reduced technical decisions to “got better / got worse” with no nuance. We moved to multi-dimensional dashboards (factuality, completeness, citation, etc.) with explicit trade-offs.
Static golden set with no updates — the set got stale as regulations got new versions. Now we have a monthly curation/review process.
Eval only in staging before deploy — some regressions only surfaced under real production distribution. We added level 3 (regression detection in prod) to compensate.
Trusting LLM-as-judge 100% without human calibration — the judge had biases we didn’t see. Now we calibrate against human-rated samples monthly.
What did work
Small but well-curated golden sets — 300 well-thought-out cases > 3000 random cases
Trend dashboards, not just snapshot — detecting drift matters more than the absolute score
Numerical guardrails as defensive coding — zero numerical hallucinations reaching the end user since we implemented it
Short feedback loops — every operator correction comes back to the eval system as a potential golden case
Tracing on every LLM call — when something goes wrong in production, we go from trace to root cause in minutes
Lessons learned
- Accuracy is the floor, not the ceiling of what your eval suite needs
- Multi-dimensional scoring > single score
- Regression detection in prod is as important as pre-deploy eval
- Numerical guardrails are non-negotiable in domains where numbers matter
- LLM-as-judge is useful but needs calibration against human experts periodically
- Operator feedback is the most valuable eval set in the long run
What’s next?
We’re exploring automated adversarial testing — programmatic generation of edge cases that stress the model. Combined with input fuzz testing (partial, contradictory, weird-format data), it amplifies our eval coverage without requiring manual curation.
If you’re building LLMs in production and your eval suite is “run an eval.py with 50 examples”, you’re probably underestimating the problem. What’s critical isn’t the number in the report — it’s that you catch regressions before they reach the user.
Are you building your eval suite for LLMs in production? Let’s talk — we can share templates, dashboards and learnings.




