Posts by tag 'testing', Darwin Evolution

10 nov 2025 · Hernán Pérez Rodal · Engineering

LLM evaluation en dominios regulados: más allá de accuracy

Cuando una respuesta incorrecta de tu LLM impacta una auditoría FDA, accuracy no alcanza. Contamos cómo evaluamos LLMs y agents en Darwin, golden sets, LLM-as-judge, regression detection y guardrails numéricos.

10 nov 2025 · Hernán Pérez Rodal · Engineering

LLM evaluation in regulated domains: beyond accuracy

When a wrong answer from your LLM affects an FDA audit, accuracy isn't enough. We share how we evaluate LLMs and agents at Darwin, golden sets, LLM-as-judge, regression detection and numeric guardrails.

10 nov 2025 · Hernán Pérez Rodal · Engineering

LLM evaluation dans les domaines régulés : au-delà de l'accuracy

Quand une réponse incorrecte de votre LLM impacte un audit FDA, l'accuracy ne suffit pas. Nous racontons comment nous évaluons les LLMs et agents chez Darwin, golden sets, LLM-as-judge, regression detection et guardrails numériques.

10 nov 2025 · Hernán Pérez Rodal · Engineering

LLM evaluation em domínios regulados: além de accuracy

Quando uma resposta incorreta do seu LLM impacta uma auditoria FDA, accuracy não basta. Contamos como avaliamos LLMs e agents na Darwin, golden sets, LLM-as-judge, regression detection e guardrails numéricos.