Wiki/Start Here

Getting Started10 min read

Start Here: AI Evaluation in 10 Minutes

Core workflow, key terms, how to read an eval result, and a first demo path through the tool.

What this tool does

Does

+Evaluate AI outputs against rubric dimensions
+Extract and label claims for hallucination risk
+Measure groundedness against retrieved context
+Track regressions across model or prompt changes
+Flag safety findings and route to human review
+Generate reproducible evaluation reports

Does not

−Train or fine-tune AI models
−Generate AI outputs (evaluation only, not inference)
−Guarantee factual accuracy outside retrieved context
−Replace domain expert review for high-stakes decisions
−Produce a single universal quality score across projects
−Version or deploy prompts to production

Core workflow

Project

Define what AI behavior you are evaluating, which model, and which rubric to use.

Rubric

Set scoring dimensions, weights, thresholds, and safety gates for this project.

Cases

Each case is one input → output pair with extracted claims, scores, and safety findings.

Run

Execute evaluation on a dataset. Produces scores, verdicts, and regression flags.

Review

Human reviewers inspect flagged cases, verify safety findings, and override if needed.

Report

Reproducible 13-section report with verdict, dimension breakdown, and recommendations.

Each step links to its corresponding view in the app.

Key terms

ProjectA named evaluation context: one AI behavior, one default model, one active rubric.

RubricA versioned scoring template with dimensions, weights (sum = 1), and thresholds.

CaseOne input/output pair evaluated against a rubric. Holds claims, scores, and safety findings.

RunA complete evaluation of a dataset through a rubric. Produces an overall score and verdict.

DimensionA single scoring axis inside a rubric — e.g., accuracy, groundedness, tone fit.

Safety gateA rubric dimension that, if it fails, blocks the run from receiving a passing verdict — regardless of overall score.

Hallucination riskClaims in AI output labeled as unsupported or contradicted by retrieved context or known facts.

GroundednessThe degree to which AI output faithfully reflects retrieved source chunks — not general truth.

Human overrideA reviewer-submitted score replacement on a specific dimension, with a required reason.

How to read an evaluation result

Safety status

Check first. Any open safety finding blocks ship. Red = do not ship.

Failed dimensions

Which scoring dimensions fell below threshold? These drive the verdict.

Hallucination claims

Count unsupported and contradicted claims. Review evidence text for each.

Groundedness evidence

Check which retrieved chunks were used, ignored, or misrepresented.

Review queue

Cases with open findings or low confidence are queued for human review.

Overall score

Weighted aggregate. Ship-ready ≥ 85, Acceptable 70–84, Needs work 55–69, Blocked < 55 or safety gate failed.

First demo path

Follow these steps to complete a full evaluation walkthrough in the app.

Open Projects·Select Shadow — Daily Reflection

→

Open latest Eval Run·Check verdict and overall score

→

Inspect dimension breakdown·Which dimensions are below threshold?

→

Open a failed case·Read AI output, scores, and rationale

→

Check claim labels·Identify unsupported and contradicted claims

→

Open Human Review·See which cases are queued and why

→

Open Reports·View or generate the full evaluation report

→

Good evaluation vs bad evaluation

Aspect

Bad

Good

Dataset

Different inputs each run

Same held-out dataset every run

Rubric

Changed between runs

Pinned version — only one variable changed

Verdict

Based on vibe / single metric

Weighted multi-dimension with safety gate

Claims

No source attribution

Every claim traced to evidence or flagged

Regression

Not tracked

Compared against prior run on same rubric

Review

Ad hoc, no record

Structured queue, override stored with reason

Before launch checklist

Project has a named owner and an active rubric with normalized weights
Rubric has at least one safety gate dimension
Held-out evaluation dataset is fixed — not the training or fine-tuning set
Judge model is a different family from the model under evaluation
All open safety findings are resolved or accepted with a documented reason
Regression check passed: run compared against last stable baseline
At least one human reviewer has spot-checked safety-flagged cases
Report generated and stored for the run

Sources behind this guide

The workflow above (Project → Rubric → Case → Run → Review → Report) is the standard evaluation loop used by these primary sources. Each one is paraphrased inwiki/sources/source-cards.md.

frameworkOpenAI Evals— Maps to the tool's three scoring methods and to the immutable-run + pinned-config rules.
Official docsLangSmith Evaluation Concepts— Grounds the Project → Rubric → Case → Run → Review → Report workflow.
Official docsAnthropic Evaluation Documentation— Justifies the rule that a rubric must exist and be activated before scoring.

Mini-quiz · 3 questions

Test your reading

All three correct = article marked complete and highlighted in your Learning Paths.

Evaluation Reports →