Start Here: AI Evaluation in 10 Minutes
Core workflow, key terms, how to read an eval result, and a first demo path through the tool.
What this tool does
Does
- +Evaluate AI outputs against rubric dimensions
- +Extract and label claims for hallucination risk
- +Measure groundedness against retrieved context
- +Track regressions across model or prompt changes
- +Flag safety findings and route to human review
- +Generate reproducible evaluation reports
Does not
- −Train or fine-tune AI models
- −Generate AI outputs (evaluation only, not inference)
- −Guarantee factual accuracy outside retrieved context
- −Replace domain expert review for high-stakes decisions
- −Produce a single universal quality score across projects
- −Version or deploy prompts to production
Core workflow
Project
Define what AI behavior you are evaluating, which model, and which rubric to use.
Rubric
Set scoring dimensions, weights, thresholds, and safety gates for this project.
Cases
Each case is one input → output pair with extracted claims, scores, and safety findings.
Run
Execute evaluation on a dataset. Produces scores, verdicts, and regression flags.
Review
Human reviewers inspect flagged cases, verify safety findings, and override if needed.
Report
Reproducible 13-section report with verdict, dimension breakdown, and recommendations.
Each step links to its corresponding view in the app.
Key terms
How to read an evaluation result
Safety status
Check first. Any open safety finding blocks ship. Red = do not ship.
Failed dimensions
Which scoring dimensions fell below threshold? These drive the verdict.
Hallucination claims
Count unsupported and contradicted claims. Review evidence text for each.
Groundedness evidence
Check which retrieved chunks were used, ignored, or misrepresented.
Review queue
Cases with open findings or low confidence are queued for human review.
Overall score
Weighted aggregate. Ship-ready ≥ 85, Acceptable 70–84, Needs work 55–69, Blocked < 55 or safety gate failed.
First demo path
Follow these steps to complete a full evaluation walkthrough in the app.
Good evaluation vs bad evaluation
Before launch checklist
- Project has a named owner and an active rubric with normalized weights
- Rubric has at least one safety gate dimension
- Held-out evaluation dataset is fixed — not the training or fine-tuning set
- Judge model is a different family from the model under evaluation
- All open safety findings are resolved or accepted with a documented reason
- Regression check passed: run compared against last stable baseline
- At least one human reviewer has spot-checked safety-flagged cases
- Report generated and stored for the run
Sources behind this guide
The workflow above (Project → Rubric → Case → Run → Review → Report) is the standard evaluation loop used by these primary sources. Each one is paraphrased inwiki/sources/source-cards.md.
- frameworkOpenAI Evals— Maps to the tool's three scoring methods and to the immutable-run + pinned-config rules.
- Official docsLangSmith Evaluation Concepts— Grounds the Project → Rubric → Case → Run → Review → Report workflow.
- Official docsAnthropic Evaluation Documentation— Justifies the rule that a rubric must exist and be activated before scoring.
Mini-quiz · 3 questions
Test your reading
All three correct = article marked complete and highlighted in your Learning Paths.