Run run-rag-qa-004
case-rag-004-01
RAG — Internal Docs QA · May 25, 02:30 PM
Case score
0.92
/1.0
Input
What chunking strategy does the RAG Memory Playground use by default?
Expected behavior
Correctly state that the default chunker is sentence-window with a 512-token window and 64-token overlap. Cite the architecture doc. No hallucination of configuration values.
AI output (heat map)
The RAG Memory Playground defaults to sentence-window chunking with a 512-token window and a 64-token overlap. The sentence boundary detection runs before the window is applied, so chunks respect sentence completeness where possible. This is defined in the architecture docs under 'Pipeline defaults.'
supportedpartialunsupportedcontradicted
Retrieved context
- [1]Architecture doc §3.2: 'Default chunker: sentence-window. Window: 512 tokens. Overlap: 64 tokens. Sentence boundary detection applied pre-window.'
- [2]GUIDE.md §1: 'The pipeline is configurable. Defaults optimized for mixed prose+code documentation.'
Claims (4)
supported0.99
defaults to sentence-window chunking with a 512-token window and a 64-token overlap
Architecture doc §3.2 directly states these values.
supported0.98
sentence boundary detection runs before the window is applied
Architecture doc confirms pre-window application.
partially supported0.81
chunks respect sentence completeness where possible
Implied by sentence-window approach but 'where possible' hedges correctly.
supported0.95
defined in the architecture docs under 'Pipeline defaults'
Section reference is accurate.
Dimension scores
Groundedness· Claim Pipeline0.97 ≥0.80
All 4 claims traceable to retrieved context.
Hallucination risk· Claim Pipeline0.96 ≥0.85
No fabricated values. Configuration numbers exactly match source.
Citation correctness· Deterministic0.92 ≥0.80
Section reference accurate. Could be more precise (§3.2 vs 'Pipeline defaults').
Context relevance· LLM Judge0.94 ≥0.70
Both retrieved chunks directly relevant to query.
Accuracy· LLM Judge0.98 ≥0.75
Answer is technically correct.
Completeness· LLM Judge0.88 ≥0.70
Core question answered. Could mention configuring the chunker.
Actionability· LLM Judge0.79 ≥0.65
Factual answer; no action expected for this question type.
Tone fit· LLM Judge0.91 ≥0.60
Technical, precise, not over-explained.