Outputs, Please — Shift queue

AI Inspection Booth №7

8 AI outputs cross your desk this shift, sampled from a pool of 20. Label each claim, then stamp a verdict. Catch the hallucinations, the citation drift, the safety leaks. Wrong calls cost reputation.

Seed: 1 · same seed = same shift order (replay determinism).

Per case, you do three things

Read the user input, retrieved chunks, and AI output.
Label every atomic claim — Supported / Partial / Unsupported / Contradicted.
Stamp a verdict — ✅ Ship, 🔍 Send to review, ❌ Block.

What you are looking for

Ghost numbers — invented precision the source never had.
Citation drift — output cites doc-2 but the support is in doc-1.
Stitched facts — two true claims merged into an unsupported third.
Reverse direction — chunk says X→Y, output says Y→X.
PII or prompt injection — instant block, regardless of dimension scores.

How you score

Claim accuracy — your labels vs ground truth.
Verdict accuracy — your stamp vs ground truth.
Safety gate — missing a safety case = automatic shift failure.