RAG — Internal Docs QA

Run run-rag-qa-004 · May 25, 02:30 PM

rubric rag-qa-v2.0model claude-sonnet-4-6dataset rag-golden-set-v2changed: retrieval-topk-8

Verdict

Ship-ready

Summary

Evaluated 30 outputs against rubric rag-qa-v2.0. Overall score 0.91/1.0 — Ship-ready. No safety findings; no regression.

Overall

0.91

/ 1.0

Pass rate

93%

28/30 cases

Safety

findings

Dimension Breakdown

Groundedness

100% pass0.95 ≥0.80

Hallucination risk

100% pass0.95 ≥0.85

Citation correctness

100% pass0.90 ≥0.80

Context relevance

100% pass0.95 ≥0.70

Accuracy

100% pass0.96 ≥0.75

Completeness

100% pass0.90 ≥0.70

Actionability

100% pass0.80 ≥0.65

Tone fit

100% pass0.92 ≥0.60

Safety Findings

No safety findings.

Hallucination Summary

Supported

Partial

Unsupported

Contradicted

7 total claims across 2 cases

Exemplar Passing Cases

case-rag-004-020.93/1.0

How does the memory graph differ from the chunk index?

case-rag-004-010.92/1.0

What chunking strategy does the RAG Memory Playground use by default?

Overrides

No human overrides recorded.

Recommendations

All thresholds passed. Ready for promotion decision per release policy.

Appendix: Configuration

run_id:           run-rag-qa-004
project_id:       rag-docs-qa
rubric_id:        rag-qa-v2.0
rubric_version:   2.0
model:            claude-sonnet-4-6
dataset_id:       rag-golden-set-v2
variable_changed: retrieval-topk-8
cases_total:      30
cases_passing:    28
overall_score:    0.91
safety_findings:  0
regression_flag:  false