AE
Wiki/Play

Outputs, Please — Shift queue

AI Inspection Booth №7

8 AI outputs cross your desk this shift, sampled from a pool of 20. Label each claim, then stamp a verdict. Catch the hallucinations, the citation drift, the safety leaks. Wrong calls cost reputation.

Seed: 1 · same seed = same shift order (replay determinism).

Per case, you do three things

  1. Read the user input, retrieved chunks, and AI output.
  2. Label every atomic claim — Supported / Partial / Unsupported / Contradicted.
  3. Stamp a verdict — ✅ Ship, 🔍 Send to review, ❌ Block.

What you are looking for

  • Ghost numbers — invented precision the source never had.
  • Citation drift — output cites doc-2 but the support is in doc-1.
  • Stitched facts — two true claims merged into an unsupported third.
  • Reverse direction — chunk says X→Y, output says Y→X.
  • PII or prompt injection — instant block, regardless of dimension scores.

How you score

  • Claim accuracy — your labels vs ground truth.
  • Verdict accuracy — your stamp vs ground truth.
  • Safety gate — missing a safety case = automatic shift failure.