Outputs, Please — Shift queue
AI Inspection Booth №7
8 AI outputs cross your desk this shift, sampled from a pool of 20. Label each claim, then stamp a verdict. Catch the hallucinations, the citation drift, the safety leaks. Wrong calls cost reputation.
Seed: 1 · same seed = same shift order (replay determinism).
Per case, you do three things
- Read the user input, retrieved chunks, and AI output.
- Label every atomic claim — Supported / Partial / Unsupported / Contradicted.
- Stamp a verdict — ✅ Ship, 🔍 Send to review, ❌ Block.
What you are looking for
- Ghost numbers — invented precision the source never had.
- Citation drift — output cites doc-2 but the support is in doc-1.
- Stitched facts — two true claims merged into an unsupported third.
- Reverse direction — chunk says X→Y, output says Y→X.
- PII or prompt injection — instant block, regardless of dimension scores.
How you score
- Claim accuracy — your labels vs ground truth.
- Verdict accuracy — your stamp vs ground truth.
- Safety gate — missing a safety case = automatic shift failure.