Evaluators

Configure the scoring engine, then run the claim pipeline and deterministic checks on any agent output and watch exactly what gets flagged.

Global configuration

Defaults used by full evaluation runs. Per-run model can still be overridden on /runs/new.

LLM judge modelClaim pipeline model

Claim confidence threshold · 0.80

PII detectionFalse-confirmation detection

Local evaluator mode· pattern-based, no LLM call· providers: OpenAI — · Supabase —

Load a demo example into both evaluators

Pattern-based extraction + verification against the tool trace and evidence sources.

Agent outputEvidence / contextOne chunk per line.Tool traceToggle booleans or paste JSON.

Evidence sources allowed

No run yet

Load a demo or paste output, then extract claims.

Code-based rules, no LLM. Each check is auditable.

Agent outputExpected behaviorExpected languageTool trace

Enabled checks

No run yet

Load a demo or paste output, then run checks.