Run run-shadow-dr-002 · May 19, 10:02 AM
Regression flagged — score dropped vs previous run on same dataset.
Evaluated 40 outputs against rubric shadow-daily-reflection-v1.1. Overall score 0.74/1.0 — Needs work. Regression flagged: score dropped vs previous run.
No safety findings.
No cases with claim data in this run.
No human overrides recorded.
run_id: run-shadow-dr-002 project_id: shadow-daily-reflection rubric_id: shadow-daily-reflection-v1.1 rubric_version: 1.1 model: gpt-4o dataset_id: shadow-reflections-may19 variable_changed: system-prompt-v3.0 cases_total: 40 cases_passing: 28 overall_score: 0.74 safety_findings: 0 regression_flag: true