AE
Wiki/Regression Evaluation
workflows8 min read·AI EngineersProduct Managers·See in app:Regression·Source-backed · 6

Regression Evaluation

How the tool compares two runs, what counts as a regression, and how to avoid the common ways teams accidentally lie to themselves with comparison data.


1. What "regression" means here

A regression is a measurable drop in quality between two evaluation runs, on the same dataset, scored by the same rubric version, where exactly one variable changed.

The keywords: measurable, same dataset, same rubric, one variable. If any of those is missing, the comparison is not a regression check; it is a vibe.

2. What the tool requires for a valid comparison

The tool refuses to compute a regression report unless:

  • Run A and Run B use the same dataset (same case ids).
  • Run A and Run B use the same rubric version.
  • Each case ran on the same retrieved context (or both ran without context).
  • Metadata explicitly tags the changed variable (model, prompt id, retrieval strategy, etc.).

This is enforced at the API level. There is no "compare anyway" button.

3. Variables a team might change

Common things changed between A and B:

  • Model swap.
  • Prompt version swap.
  • Retrieval pipeline change (new chunker, new embedding model, new k).
  • System prompt or persona change.
  • Temperature / parameters.
  • Tool definitions.
  • Safety filter changes.

The tool stores the variable name in the comparison record. A reader of the report knows which knob moved.

4. What the tool reports

For a comparison of Run A → Run B:

  • Aggregate deltas: per-dimension mean delta, overall mean delta, median delta.
  • Distribution shift: histogram per dimension for A and B overlaid.
  • Regressed cases: cases where A passed and B failed.
  • Recovered cases: cases where A failed and B passed.
  • Stable cases: cases where the status did not change.
  • Score-delta-only cases: cases where pass/fail did not change but the score moved by more than a configurable threshold.
  • Net effect: regressed count minus recovered count, with a confidence band.

5. When is a change "a regression"?

The tool flags Run B as a regression if any of these are true:

  • Any dimension's mean dropped by more than 5 points on a 100-scale (configurable per project).
  • More than 2% of cases regressed (configurable).
  • A new safety finding appeared in any case that did not previously have one.
  • A dimension that was passing-on-average is now failing-on-average.

A regression flag is a prompt to investigate, not a verdict. The team reads the regressed cases and decides.

6. What "the overall went up" does not prove

A frequent trap: B's overall score is higher than A's, so B is shipped. Then a specific capability silently degraded.

Examples of this trap the tool surfaces:

  • Trade-off masking. Tone fit went up 8 points; accuracy went down 4 points. Overall is +1. Team ships. Users notice.
  • Sample bias. B handled the easy cases better but failed the hard ones. Mean is up; tail is worse.
  • Variance illusion. LLM-judge variance produced a +2 mean delta within noise. Team ships a non-change.

The regression report always shows per-dimension deltas, not just overall. The overall delta alone is treated as suspect until the per-dimension story confirms it.

7. The same-dataset rule

A team that changes the dataset between A and B is no longer measuring a regression. They are reporting two unrelated runs.

The tool enforces this. Adding cases to the dataset creates a new dataset version. A comparison across dataset versions is allowed but labeled non-comparable; the report banner says so.

The right pattern: keep the dataset stable for the duration of a model/prompt iteration. Grow the dataset on its own schedule. Re-run A on the new dataset to re-establish a baseline before evaluating B.

8. The same-rubric rule

Editing the rubric between A and B changes the unit of measure. A score of 78 under rubric v1.1 is not comparable to a score of 78 under rubric v1.0.

The tool refuses cross-rubric comparisons. The right pattern: when changing a rubric, lock A under the new rubric (re-run A) before evaluating B. This costs one extra run; it is worth it.

9. LLM-judge noise

If a dimension is scored by an LLM judge, repeated evaluations of the same case give different scores. The tool reports an estimated standard deviation per dimension from a calibration set, and treats deltas smaller than 2σ as not significant.

A team that ships on a 1-point overall improvement when 2σ is 3 points is shipping noise.

The tool helps with two patterns:

  • Judge averaging. Run the LLM judge N times per case (default 3), take the mean. Cuts variance.
  • Deterministic fallback. Where possible, replace an LLM-judge dimension with a deterministic check. Eliminates noise at the cost of subjectivity capture.

10. Patterns to investigate when a regression is flagged

The regression report shows which cases regressed. The team's job is to look at them. Common patterns:

  • All regressions are in one category. The model swap is worse at, say, technical questions. Specific failure mode.
  • Regressions cluster at long inputs. The new model has a smaller usable context window.
  • Regressions cluster at the end of outputs. New model truncates or rambles.
  • Regressions appear only in a specific tone-fit failure. New model is more formal or less formal than the previous; rubric needs revisit.
  • Regressions co-occur with safety findings. New model has weaker guardrails on a specific category.

A team that does not look at the failing cases will draw the wrong conclusion.

11. Multi-version comparison (A → B → C)

The tool supports comparing more than two runs. Limitations:

  • Same dataset, same rubric remain required.
  • Reports become harder to read above three runs. Default UI tops out at three.
  • For longer time series, the recommended view is the trend chart: per-dimension mean over time across N runs. The tool offers this view at V1.

12. Comparison vs. monitoring

The tool's regression comparison is offline. It is not a live monitor. The expected workflow:

  • Before shipping a change → run the dataset → compare against the last shipped run.
  • After shipping → log production traces.
  • Periodically → sample production traces into the dataset (V2) → re-baseline.

Continuous live monitoring belongs in an observability stack. The tool is the gate, not the watch tower.

13. What the report looks like

A regression report includes:

  • Header: A version, B version, dataset, rubric, what changed.
  • Verdict: regression flagged / no regression.
  • Aggregate table: per-dimension means for A, B, delta, significance.
  • Top regressed cases (default 10): input, A output, B output, A scores, B scores, evidence.
  • Top recovered cases (default 5).
  • Safety delta.
  • Recommendation: ship / do not ship / human review required.

The recommendation is advisory. The decision is human.

13a. Variance-aware comparison

Comparing two runs without accounting for LLM-judge noise is the most common way to ship a non-change. The tool surfaces two variance controls:

  • Per-dimension σ from a calibration set. Each LLM-judge dimension has an estimated standard deviation, computed by running the judge multiple times on a held-out calibration set. Deltas below 2σ are flagged "within noise" in the comparison report.
  • Judge averaging (N samples). Per the variance-reduction recipe in the LLM-as-Judge survey, averaging N independent judge calls reduces standard error by ~√N. A regression-critical run typically uses N=3; an exploratory run uses N=1 and is labeled "high variance".

Practical implication: a 1-point overall improvement on a 100-scale, against a per-dimension σ of ~3, is not an improvement. The tool's verdict in that case is "no significant change".

13b. Variable-isolation discipline

The same-dataset / same-rubric rule is one half of variable isolation; the other half is changing only one thing at a time. Common violations:

  • Swapping the model and editing the prompt in the same iteration. The regression delta now has two causes.
  • Changing the retrieval pipeline (chunker + embedding + k) at the same time as the generator prompt. The retrieval failure surface and the answer failure surface are now tangled.
  • Adding cases to the dataset for the new run. Now it's not even a comparison — it's two different experiments.

The tool stores the changed_variable field per comparison record and refuses to compute a comparison when it cannot identify a single named change. The intended workflow is: lock baseline → change one variable → run → compare → repeat.

14. Common mistakes the report exists to prevent

  • Comparing different datasets.
  • Comparing different rubrics.
  • Trusting a small overall delta.
  • Ignoring per-dimension trade-offs.
  • Skipping the failed-case list because the aggregate looked fine.
  • Re-running until the numbers look favorable. (The tool stores every run; a team that does this leaves an audit trail.)
  • Conflating "improvement on the eval set" with "improvement in production". Eval is a hypothesis about production.

Source-backed concepts

  • Same dataset, same grader. OpenAI Evals and LangSmith both treat the dataset and the grader as part of the experiment's identity. Changing either creates a new experiment, not a comparable run. The tool's refusal to compute a regression report across mismatched datasets or rubrics is the operational form of that rule.
  • Per-dimension deltas, not a single overall delta. Stanford HELM's stance — a single score hides axis-specific failures — applies just as strongly to regression. A +1 overall delta that hides a −5 accuracy delta is a regression masquerading as an improvement. The tool's regression report makes per-dimension deltas the first table, not an appendix.
  • LLM-judge noise is real and must be bounded. The MT-Bench judge paper and the LLM-as-Judge survey both document inter-run variance: scoring the same case twice produces different numbers. The tool's 2σ rule (deltas smaller than 2× the calibration-set standard deviation are not significant) and the optional judge-averaging (N samples) are responses to that finding.
  • Deterministic checks reduce variance. IFEval shows that verifiable instruction-following constraints can be checked deterministically with zero variance. The tool's recommendation to replace LLM-judge dimensions with deterministic checks where possible (and to do so for regression-critical dimensions) follows directly.
  • Online vs offline. LangSmith distinguishes offline evaluation (frozen dataset, batch run) from online observation (production traces). The tool is offline by design; live monitoring belongs in an observability stack. This article makes the boundary explicit.

Applied in this tool

  • The Compare view (/compare/[runA]/[runB]) enforces same-dataset / same-rubric at the API level. There is no "compare anyway" button.
  • The per-dimension delta table in the comparison report is the HELM stance made operational: every dimension has a delta, a 2σ band, and a "significant?" flag.
  • The "judge-averaging" toggle on a run (N samples per LLM-judge call) is the variance-reduction recipe from the LLM-as-Judge survey.
  • The advisory recommendation (ship / do not ship / human review required) is derived from thresholds + safety gate; it is not invented by an LLM editorial layer.

Sources used

  • OpenAI Evals — same dataset, same grader for comparison.
  • LangSmith Evaluation Concepts — offline vs online evaluation; pinned configuration.
  • Stanford HELM — per-dimension deltas, no global score.
  • MT-Bench / Chatbot Arena judge paper — LLM-judge variance per run.
  • LLM-as-Judge survey — calibration drift and variance bounds.
  • IFEval — deterministic checks reduce variance.

Related topics

  • Evaluation Principles — principles 9 ("comparable only against itself"), 12 (reproducibility), and 16 (calibration drift) ground this workflow.
  • Scoring Rubrics — why rubric versions are immutable and how cross-rubric comparison is reported.
  • LLM-as-Judge — variance per judge, judge-averaging, and the 2σ rule.
  • Evaluation Reports — how the regression report is structured (verdict, per-dimension deltas, regressed cases).
  • Human Review — how regression-flagged cases enter the review queue.

Mini-quiz · 3 questions

Test your reading

All three correct = article marked complete and highlighted in your Learning Paths.

See this in the app

Sources used

  • FrameworkOpenAI EvalsMaps to the tool's three scoring methods and to the immutable-run + pinned-config rules.
  • Official docsLangSmith Evaluation ConceptsGrounds the Project → Rubric → Case → Run → Review → Report workflow.
  • FrameworkStanford HELMJustifies the refusal to display an overall score without the per-dimension breakdown.
  • PaperMT-Bench / Chatbot Arena: Judging LLM-as-a-JudgeBacks the explicit LLM-judge failure-mode list and the cross-family judge recommendation.
  • PaperLLM-as-a-Judge SurveyJustifies the calibration loop, the drift alert, and human review priority on uncertainty.
  • BenchmarkIFEvalJustifies the deterministic scoring method for verifiable requirements.

Detail in wiki/sources/source-cards.md.

Related articles