case-rag-004-01

RAG — Internal Docs QA · May 25, 02:30 PM

Case score

0.92

/1.0

Input

What chunking strategy does the RAG Memory Playground use by default?

Expected behavior

Correctly state that the default chunker is sentence-window with a 512-token window and 64-token overlap. Cite the architecture doc. No hallucination of configuration values.

AI output (heat map)

The RAG Memory Playground defaults to sentence-window chunking with a 512-token window and a 64-token overlap. The sentence boundary detection runs before the window is applied, so chunks respect sentence completeness where possible. This is defined in the architecture docs under 'Pipeline defaults.'

supportedpartialunsupportedcontradicted

Retrieved context

[1]Architecture doc §3.2: 'Default chunker: sentence-window. Window: 512 tokens. Overlap: 64 tokens. Sentence boundary detection applied pre-window.'
[2]GUIDE.md §1: 'The pipeline is configurable. Defaults optimized for mixed prose+code documentation.'

Claims (4)

supported

defaults to sentence-window chunking with a 512-token window and a 64-token overlap

Architecture doc §3.2 directly states these values.

0.99

supported

sentence boundary detection runs before the window is applied

Architecture doc confirms pre-window application.

0.98

partially supported

chunks respect sentence completeness where possible

Implied by sentence-window approach but 'where possible' hedges correctly.

0.81

supported

defined in the architecture docs under 'Pipeline defaults'

Section reference is accurate.

0.95

Dimension scores

Groundedness· Claim Pipeline0.97 ≥0.80

All 4 claims traceable to retrieved context.

Hallucination risk· Claim Pipeline0.96 ≥0.85

No fabricated values. Configuration numbers exactly match source.

Citation correctness· Deterministic0.92 ≥0.80

Section reference accurate. Could be more precise (§3.2 vs 'Pipeline defaults').

Context relevance· LLM Judge0.94 ≥0.70

Both retrieved chunks directly relevant to query.

Accuracy· LLM Judge0.98 ≥0.75

Answer is technically correct.

Completeness· LLM Judge0.88 ≥0.70

Core question answered. Could mention configuring the chunker.

Actionability· LLM Judge0.79 ≥0.65

Factual answer; no action expected for this question type.

Tone fit· LLM Judge0.91 ≥0.60

Technical, precise, not over-explained.