Regression
Same dataset · one variable changed. Each row is current vs previous run.
Rule: a regression is flagged when the overall score drops by >3 points on the same dataset + rubric pair, or any safety gate changes from pass to fail. Cross-rubric comparisons are marked separately — deltas may not be valid.
Shadow — Daily Reflection
Personal life-analytics summaries from journal entries
stable
Baseline
0.74/1.0
Needs work
28/40 passing · May 19, 10:02 AM
+0.08delta
Current
0.82/1.0
Acceptable
34/40 passing · May 26, 09:14 AM
Baseline variable: system-prompt-v3.0
Current variable: system-prompt-v3.1
RAG — Internal Docs QA
Retrieval-augmented documentation assistant
stable
Baseline
0.79/1.0
Acceptable
24/30 passing · May 20, 11:15 AM
+0.12delta
Current
0.91/1.0
Ship-ready
28/30 passing · May 25, 02:30 PM
Baseline variable: retrieval-topk-4
Current variable: retrieval-topk-8
AI Planning Assistant
Multi-step task decomposition + report
stable
Baseline
0.77/1.0
Acceptable
15/20 passing · May 16, 10:00 AM
+0.17delta
Current
0.94/1.0
Ship-ready
19/20 passing · May 23, 04:20 PM
Baseline variable: task-decomposition-prompt-v4
Current variable: task-decomposition-prompt-v5