AE
Wiki/Start Here
Getting Started10 min read

Start Here: AI Evaluation in 10 Minutes

Core workflow, key terms, how to read an eval result, and a first demo path through the tool.

01

What this tool does

Does

  • +Evaluate AI outputs against rubric dimensions
  • +Extract and label claims for hallucination risk
  • +Measure groundedness against retrieved context
  • +Track regressions across model or prompt changes
  • +Flag safety findings and route to human review
  • +Generate reproducible evaluation reports

Does not

  • Train or fine-tune AI models
  • Generate AI outputs (evaluation only, not inference)
  • Guarantee factual accuracy outside retrieved context
  • Replace domain expert review for high-stakes decisions
  • Produce a single universal quality score across projects
  • Version or deploy prompts to production
02

Core workflow

Each step links to its corresponding view in the app.

03

Key terms

ProjectA named evaluation context: one AI behavior, one default model, one active rubric.
RubricA versioned scoring template with dimensions, weights (sum = 1), and thresholds.
CaseOne input/output pair evaluated against a rubric. Holds claims, scores, and safety findings.
RunA complete evaluation of a dataset through a rubric. Produces an overall score and verdict.
DimensionA single scoring axis inside a rubric — e.g., accuracy, groundedness, tone fit.
Safety gateA rubric dimension that, if it fails, blocks the run from receiving a passing verdict — regardless of overall score.
Hallucination riskClaims in AI output labeled as unsupported or contradicted by retrieved context or known facts.
GroundednessThe degree to which AI output faithfully reflects retrieved source chunks — not general truth.
Human overrideA reviewer-submitted score replacement on a specific dimension, with a required reason.
04

How to read an evaluation result

1

Safety status

Check first. Any open safety finding blocks ship. Red = do not ship.

2

Failed dimensions

Which scoring dimensions fell below threshold? These drive the verdict.

3

Hallucination claims

Count unsupported and contradicted claims. Review evidence text for each.

4

Groundedness evidence

Check which retrieved chunks were used, ignored, or misrepresented.

5

Review queue

Cases with open findings or low confidence are queued for human review.

6

Overall score

Weighted aggregate. Ship-ready ≥ 85, Acceptable 70–84, Needs work 55–69, Blocked < 55 or safety gate failed.

05

First demo path

Follow these steps to complete a full evaluation walkthrough in the app.

06

Good evaluation vs bad evaluation

Aspect
Bad
Good
Dataset
Different inputs each run
Same held-out dataset every run
Rubric
Changed between runs
Pinned version — only one variable changed
Verdict
Based on vibe / single metric
Weighted multi-dimension with safety gate
Claims
No source attribution
Every claim traced to evidence or flagged
Regression
Not tracked
Compared against prior run on same rubric
Review
Ad hoc, no record
Structured queue, override stored with reason
07

Before launch checklist

  • Project has a named owner and an active rubric with normalized weights
  • Rubric has at least one safety gate dimension
  • Held-out evaluation dataset is fixed — not the training or fine-tuning set
  • Judge model is a different family from the model under evaluation
  • All open safety findings are resolved or accepted with a documented reason
  • Regression check passed: run compared against last stable baseline
  • At least one human reviewer has spot-checked safety-flagged cases
  • Report generated and stored for the run
08

Sources behind this guide

The workflow above (Project → Rubric → Case → Run → Review → Report) is the standard evaluation loop used by these primary sources. Each one is paraphrased inwiki/sources/source-cards.md.

Mini-quiz · 3 questions

Test your reading

All three correct = article marked complete and highlighted in your Learning Paths.