AEAI Evalquality + grounding lab

today$0.00/ $2

Wiki

Practical evaluation knowledge base — the opinion layer of the tool.

New to AI evaluation?

Start with the 10-minute guide

Understand projects, rubrics, cases, runs, safety gates, and reports before diving into individual articles.

Start guide Run demo evaluation Open example report

Interactive · 10 cases · ~8 min

Outputs, Please — practice mode

AI Inspection Booth №7. Label claims, catch ghost numbers, citation drift, PII leaks, prompt injection. Each case maps to one wiki article.

Start shift →

Learning Paths

Quiz-passed articles light up green.

Product Managers

Launch readiness, reports, and regressions

Evaluation Reports8 min Regression Evaluation8 min Evaluation Principles12 min

Reports Projects Regression

28 min total

AI Engineers

Rubrics, judge behavior, groundedness, claim evidence

Scoring Rubrics14 min LLM-as-Judge11 min Groundedness9 min

Rubrics Eval Runs

34 min total

Reviewers

Human review, safety findings, overrides

Human Review10 min Hallucination Risk7 min Evaluation Reports8 min

Human Review Eval Runs

25 min total

Trust & Safety

Safety gates, false confirmations, PII, unresolved blockers

Human Review10 min Evaluation Principles12 min Scoring Rubrics14 min

Safety Log Human Review

36 min total

Getting Started

Start here before reading anything else.

Start Here: AI Evaluation in 10 Minutes

Core workflow, key terms, how to read an eval result, and a first demo path through the tool.

Evaluation Reports

How to read and generate a 13-section evaluation report — from verdict to appendix.

Product ManagersReviewers

Core Concepts

The foundational ideas behind every evaluation.

Evaluation Principles

15 core principles and 7 anti-principles that govern how evaluations are designed and interpreted.

EveryoneAI Engineers

Scoring Rubrics

10 reference dimensions, 5 starter rubrics, weight normalization rules, and versioning strategy.

AI EngineersProduct Managers

Hallucination Risk

How claims are labeled as supported, partially supported, unsupported, or contradicted.

AI EngineersReviewers

Groundedness

Faithfulness to retrieved context — how to score source utilization and detect citation drift.

AI EngineersReviewers

Workflows

Step-by-step operational procedures.

Regression Evaluation

How to detect measurable drops between runs on the same dataset, rubric, and retrieved context.

AI EngineersProduct Managers

Regression8 min

Human Review

Queue ordering, override mechanics, two-reviewer policy, and calibration loop protocol.

ReviewersTrust & Safety

Human Review10 min

Evaluators: Methods That Actually Run

How the four scoring methods work in this tool — LLM judge, claim pipeline (groundedness), deterministic checks, and human review — and how to configure and test them.

AI EngineersQA / AI Reviewers

Evaluators8 min

Advanced

Deep dives for engineers optimizing evaluation pipelines.

LLM-as-Judge

8 failure modes, judge selection rules, calibration loop, and when not to use LLM judges.

10 articles · 17 primary sources · Read time 97 min total · Source files in projects/ai-evaluation-tool/wiki/ · Source cards in wiki/sources/source-cards.md