Evaluating AI Outputs: Make Quality Measurable with Evals

“It worked fine in the demo” is the most expensive sentence in any AI project. A passed demo says nothing about how the system behaves over the next 10,000 real requests — and that’s exactly where the errors arise that cost money, trust, or, in legally relevant cases, liability.

Evals (AI evaluations) are structured tests that make the quality of probabilistic AI outputs measurable — through defined failure modes, reference data, and metrics rather than gut feeling. Evaluating an LLM means nothing other than setting up this AI evaluation systematically and repeatably. They deliver two things in one artifact: the engineering metric you use to release a feature, and the proof that a Werkvertrag (German contract for work, sec. 631 BGB), the GDPR, and the EU AI Act demand as evidence of quality.

I write this from both perspectives — as a developer who builds such pipelines, and as a business lawyer who knows when a measurement counts in court or at acceptance. This article shows concretely how to build the measurement and why the same technical effort pays off twice.

Why You Can’t Judge AI Quality by Gut Feeling

Deterministic vs. probabilistic

Classic software is deterministic: same input, same output, same test. A generative AI model is probabilistic — the same prompt can produce a different, equally plausible answer on the second run. A demo that passed therefore proves nothing about the next 10,000 real requests.

The three reasons classic software tests fail

No binary right/wrong. Quality in text is a spectrum — an answer can be correct yet inappropriate in tone. An assertEquals doesn’t help here.
Infinite input space. Users phrase things in countless ways. You can’t test every case in advance, only a representative sample.
Non-reproducible errors. A misbehavior occurs sporadically and often can’t be reliably reproduced — you can’t simply pin it down as a failing test and “fix” it.

The underrated point: no measurement, no proof

This is the blind spot of many AI projects. If you don’t measure quality, you can’t prove it either. And provability isn’t a nice-to-have — under several legal frameworks it’s an obligation; more on that below. Eval scores are therefore both an engineering metric and evidence.

What Evals Are: The Three Evaluation Methods

In practice, an eval usually combines all three of the following methods. None is complete on its own.

Method	How it works	Strength	Weakness	When to use
Human review	Experts assess outputs against a rubric	Highest differentiation (tone, empathy, domain logic)	Expensive, slow, not scalable	Ground truth, calibration, sensitive cases
Rule-based tests	Deterministic checks (format, forbidden terms, escalation)	Fast, unambiguous, cheap	Narrow coverage, no quality judgment	Hard must-pass criteria
LLM-as-a-judge	An AI model scores outputs against a spec	Scalable, cheap, ~80–85% agreement with humans	Requires bias control/calibration	High-volume, nuanced evaluation

Human review

Humans deliver the finest judgment — and the reference against which everything else is calibrated. The price: high cost and poor scalability. What matters is inter-annotator agreement: if two reviewers disagree, the AI isn’t the problem — your evaluation rubric is unclear.

Rule-based tests

Some criteria are binary and deterministically measurable: in a crisis request, is the case reliably handed off to a human? Does the output format stay valid JSON? Such tests are fast and unambiguous, but they cover only a narrow, clearly definable slice.

LLM-as-a-judge — can you trust an AI to evaluate?

In the “AI evaluates AI” approach, one model judges another’s outputs against clear criteria. This scales cheaply and, by current measurements, reaches roughly 80–85% agreement with human evaluators — comparable to how much two humans agree (Galileo, Confident AI).

Trust, however, is only warranted with caution. Documented biases are real: self-preference bias (models favor their own outputs — the measured effect is substantial and, depending on the model and setup, runs into double digits), position bias, and verbosity bias (arXiv 2410.21819, Future AGI). The consequence: calibrate the judge against a human-labeled reference set and measure its agreement yourself — otherwise you aren’t evaluating the evaluation.

This resolves the apparent chicken-and-egg question — if the judge needs humans as its yardstick anyway, why use a judge at all — pragmatically: humans label a small reference set once (the ~50–100 critical cases) against which the judge is calibrated; after that, the judge takes over the ongoing evaluation of large volumes. The expensive human work is incurred once for calibration, not per output — and that is precisely where the scaling gain lies. If agreement drifts, you recalibrate with a fresh sample.

How to Build an Eval: Failure Modes → Golden Dataset → Metrics → Continuous Evaluation

A solid eval is not an ad-hoc script but a pipeline with four stages and a feedback loop. Every production error flows back into the golden dataset — so the measurement gets harder over time, not weaker.

Four-stage pipeline for building an AI eval, from failure modes through golden dataset and metrics to continuous evaluation with a feedback loop

The eval pipeline as a closed loop: define failure modes, curate reference cases, measure the right metrics, monitor in production — and feed every real error back into the dataset.

Step 1 — Define failure modes

Understand the failures first, then measure. Collect the concrete ways your system can be wrong: hallucination (fabricated facts), wrong tone, compliance violation, bias, format break, impermissible legal/medical advice. Each failure mode later becomes a measurable criterion. Practical entry point: have the system answer 50–100 real requests, read the outputs, and cluster the errors — this inductive error analysis (“open coding”) reveals more than any checklist devised in advance.

Step 2 — Build a golden dataset

A golden dataset is a curated set of reference examples with a target answer (ground truth): question + ideal answer + required context. Realistically, you start with 50–100 carefully selected cases that cover your real and your critical scenarios — quality beats quantity (Kinde). This set grows with every real-world error that surfaces in production.

Step 3 — Choose metrics

The right metric depends on the task type. For classic tasks, established measures are available; for RAG systems (answers from your own documents), the RAGAS framework provides specialized metrics.

Metric	Measures what	Task type	Example tool
F1 / Precision / Recall	Hit quality in classification	Categorization, extraction	scikit-learn
ROUGE / RougeL	Word overlap with reference	Summarization	Hugging Face
sacreBLEU	Agreement with a reference translation	Translation	sacreBLEU
BERTScore	Semantic closeness (not just word match)	Generation, general	BERTScore
Faithfulness	Fidelity of the answer to the provided context = hallucination measure	RAG	RAGAS
Answer Relevancy	How well the answer addresses the question	RAG	RAGAS
Context Precision / Recall	Quality of the retrieved documents	RAG	RAGAS

For RAG, a useful distinction applies: context precision/recall measure how well retrieval fetches the right documents; faithfulness and answer relevancy measure how well the model generates the answer from them (RAGAS docs, Confident AI). When a RAG setup is the right path in the first place and when fine-tuning is, the comparison RAG vs. fine-tuning for businesses explains.

How do I measure whether my RAG system hallucinates?

Through the faithfulness metric. It checks whether every statement in the answer is supported by the retrieved context. A low faithfulness score means the model is inventing content that isn’t in your sources — the technical measure of hallucination in the RAG context.

Step 4 — Offline eval before go-live, then continuous evaluation

Before going live, you measure offline against the golden dataset (acceptance gate). After that, continuous evaluation runs in production: sampling real requests, monitoring for drift (quality drops when data, user behavior, or the underlying model changes). On every prompt or model change, you rerun the eval as a regression test. Quality is not a one-time event but a state you have to keep maintaining.

Tools — described factually, not promoted

Widely used in the field are, among others, DeepEval, RAGAS, Langfuse, and, for rule-based guardrails, NeMo Guardrails. Which one fits depends on your stack, task, and data-protection requirements — this is not a paid recommendation but neutral guidance. What matters is the methodology, not the choice of tool.

Evals as Compliance Evidence: What Law and Regulation Require

Note: This section explains general legal frameworks and is not legal advice for an individual case. Whether and which obligations apply to your specific system depends on the individual case.

EU AI Act — measured accuracy and robustness

For high-risk AI systems, the EU AI Act requires, in Article 15, an appropriate level of accuracy and robustness across the entire lifecycle — including the obligation to declare the relevant accuracy metrics in the instructions for use. Article 9 requires a risk management system, Article 17 a quality management system (artificialintelligenceact.eu Art. 15, Art. 9). “Declare accuracy metrics” means, plainly: no eval, no technical proof.

As of March 2026 — deadlines in flux: The “Digital Omnibus” package is currently under negotiation; it would postpone several high-risk deadlines. Without that postponement, the high-risk rules apply as of August 2, 2026 (Gibson Dunn, White & Case). This is a state of flux — please have the dates checked against the current state of affairs rather than treating them as settled. And, honestly, for small and mid-sized businesses: most applications are not high-risk but fall under “limited risk” with pure transparency obligations (AI labeling). No overstating the duties — eval discipline still pays off, because GDPR accuracy and Werkvertrag acceptance (below) apply regardless of the risk tier. Which obligations the EU AI Act and the GDPR actually impose on companies is laid out in the overview EU AI Act & GDPR — what businesses must do now.

If your AI processes personal data, Article 5(1)(d) GDPR applies: data must be accurate and kept up to date. If a model produces a false statement about a person, that is an accuracy problem with legal consequences. Measurable accuracy — and logging which answer was generated when — is the evidence that you take this obligation seriously.

Werkvertrag — eval thresholds as acceptance criteria

Anyone who commissions or procures AI software is subject to acceptance and warranty (secs. 633 ff. BGB) (the German Civil Code provisions on defects in a contract for work). “Felt good” is not an acceptable acceptance criterion. Defined eval thresholds (e.g., “faithfulness ≥ 0.9 on the golden dataset, zero critical failure modes”) make acceptance objectively verifiable — and protect both sides from disputes.

The seam: measurement and proof are one artifact

This is exactly where technology meets law. The same pipeline that measures F1 and faithfulness produces, together with an audit trail and logging, the audit-proof evidence for the EU AI Act, GDPR accuracy, and Werkvertrag acceptance. Anyone who builds AI themselves and understands the duty of proof builds the measurement from the outset so that it counts twice. Anyone who first has it measured and then adds a human control layer for AI outputs closes the gap fully: measure first, then safeguard.

FAQ

What are evals (AI evaluations) and why do you need them?

Evals are structured tests that make the quality of AI outputs measurable — through defined failure modes, reference data, and metrics. You need them because generative AI is probabilistic: a successful demo doesn’t prove that the system delivers reliably in real-world operation.

How can I measure quality when there’s no clear “right”?

By breaking the assessment into three building blocks: first define the failure modes, then curate a golden dataset with target answers, then set up appropriate metrics (e.g., faithfulness for RAG). This translates a quality spectrum into comparable, reproducible numbers.

What is a golden dataset and how large does it need to be?

A golden dataset is a curated set of reference examples with an ideal answer and context. Realistically, you start with 50–100 carefully chosen cases that cover your common and your critical scenarios. It grows with every error found in production.

Can you trust an AI to evaluate other AI?

Conditionally. LLM-as-a-judge reaches roughly 80–85% agreement with human evaluators but has documented biases (self-preference, position, verbosity bias). It only becomes trustworthy once you calibrate it against a human-labeled reference and measure its agreement yourself.

Do I have to prove my AI’s quality legally?

That depends on the system. For high-risk AI, the EU AI Act requires measured accuracy and quality management (Art. 9/15/17); for personal data, the GDPR accuracy duty applies (Art. 5); for commissioned software, acceptance/warranty (secs. 633 ff. BGB). Without an eval pipeline, the technical proof is missing. This is general information, not legal advice.

As of March 2026. Author: Leon Lotz, business lawyer + developer (MusketierSoftware).

Do you want to take an AI feature live and need measurable rather than felt quality? I set up your eval pipeline so the results also serve as acceptance and compliance evidence — technical implementation and legal assessment in one person. Request an initial consultation.

Sources — as of 15.03.2026