Shadow — Daily Reflection

Run run-shadow-dr-002 · May 19, 10:02 AM

rubric shadow-daily-reflection-v1.1model gpt-4odataset shadow-reflections-may19changed: system-prompt-v3.0regression flagged

Verdict

Needs work

Regression flagged — score dropped vs previous run on same dataset.

Summary

Evaluated 40 outputs against rubric shadow-daily-reflection-v1.1. Overall score 0.74/1.0 — Needs work. Regression flagged: score dropped vs previous run.

Overall

0.74

/ 1.0

Pass rate

70%

28/40 cases

Safety

findings

Dimension Breakdown

Life-area classification accuracy

no data— ≥0.80

Emotional nuance

no data— ≥0.70

Non-judgmental tone

no data— ≥0.75

Useful next step

no data— ≥0.65

Memory relevance

no data— ≥0.70

Completeness

no data— ≥0.70

Hallucination risk

no data— ≥0.80

Tone fit

no data— ≥0.70

Consistency

no data— ≥0.70

Actionability

no data— ≥0.65

Safety Findings

No safety findings.

Hallucination Summary

No cases with claim data in this run.

Overrides

No human overrides recorded.

Recommendations

Investigate regression — score dropped vs previous run. Diff prompt versions.

Appendix: Configuration

run_id:           run-shadow-dr-002
project_id:       shadow-daily-reflection
rubric_id:        shadow-daily-reflection-v1.1
rubric_version:   1.1
model:            gpt-4o
dataset_id:       shadow-reflections-may19
variable_changed: system-prompt-v3.0
cases_total:      40
cases_passing:    28
overall_score:    0.74
safety_findings:  0
regression_flag:  true