Projects
Each project pairs a rubric, a dataset, and an under-test model.
Shadow — Daily Reflection
shadow-teamPersonal life-analytics summaries from journal entries
Model
gpt-4o
Rubric
shadow-daily-reflection-v1.1
Cases
124
Last run · 0.82 · 34/40 passing
AcceptableRAG — Internal Docs QA
platform-teamRetrieval-augmented documentation assistant
Model
claude-sonnet-4-6
Rubric
rag-qa-v2.0
Cases
240
Last run · 0.91 · 28/30 passing
Ship-readyArea Mosa — Booking Assistant
area-mosaWhatsApp booking bot for hair salon
Model
gpt-4o-mini
Rubric
booking-assistant-v1.3
Cases
86
Last run · 0.88 · 23/25 passing
Ship-readyCustomer Support Reply
cx-teamFirst-response generator with KB retrieval
Model
claude-sonnet-4-6
Rubric
support-reply-v1.0
Cases
312
Last run · 0.64 · 32/50 passing
Needs workAI Planning Assistant
platform-teamMulti-step task decomposition + report
Model
claude-opus-4-6
Rubric
planner-v0.4
Cases
48
Last run · 0.94 · 19/20 passing
Ship-ready