Eval Runs
Stored evaluation runs. Each row is immutable; re-evaluation creates a new run.
| When | Project | Rubric | Model | Variable | Score | Pass | Verdict | Flags | |
|---|---|---|---|---|---|---|---|---|---|
| May 26, 09:14 AM | Shadow — Daily Reflection | shadow-daily-reflection-v1.1 | gpt-4o | system-prompt-v3.1 | 0.82 | 85% | Acceptable | ||
| May 19, 10:02 AM | Shadow — Daily Reflection | shadow-daily-reflection-v1.1 | gpt-4o | system-prompt-v3.0 | 0.74 | 70% | Needs work | regression | |
| May 25, 02:30 PM | RAG — Internal Docs QA | rag-qa-v2.0 | claude-sonnet-4-6 | retrieval-topk-8 | 0.91 | 93% | Ship-ready | ||
| May 20, 11:15 AM | RAG — Internal Docs QA | rag-qa-v2.0 | claude-sonnet-4-6 | retrieval-topk-4 | 0.79 | 80% | Acceptable | ||
| May 24, 08:45 AM | Area Mosa — Booking Assistant | booking-assistant-v1.3 | gpt-4o-mini | tone-friendly-v2 | 0.88 | 92% | Ship-ready | ||
| May 26, 01:00 PM | Customer Support Reply | support-reply-v1.0 | claude-sonnet-4-6 | context-window-reduced | 0.64 | 64% | Needs work | 2 safety | |
| May 23, 04:20 PM | AI Planning Assistant | planner-v0.4 | claude-opus-4-6 | task-decomposition-prompt-v5 | 0.94 | 95% | Ship-ready | ||
| May 16, 10:00 AM | AI Planning Assistant | planner-v0.4 | claude-opus-4-6 | task-decomposition-prompt-v4 | 0.77 | 75% | Acceptable |