How to Test RAG and Knowledge-Base Search
Short answer
RAG failures split across retrieval (wrong/missing chunks, ACL leaks) and generation (hallucination when context is empty). Test retrieval with golden query sets and recall@k offline evals; test integration with fixture corpora per run, AIMock generation when needed, citation UI asserts, and probe retrieval_ids—not exact answer prose.
Part of Testing Guides by AI and conversational UX.
Who this is for
Teams shipping document Q&A, support bots with knowledge bases, internal wikis with semantic search, or copilots that cite PDFs, tickets, and Confluence pages.
Typical stacks: vector DB (Pinecone, pgvector, Weaviate), embedding pipelines, chunking jobs, rerankers, and chat UI with citation chips.
Why testing RAG matters
| Failure mode | User impact | Test signal |
|---|---|---|
| Stale index | Wrong policy answers | Probe index version after reindex job |
| ACL leak | User B sees User A doc | Cross-tenant retrieval probe |
| Empty retrieval | Confident hallucination | Assert "no sources" + eval refusal |
| Wrong citation | Trust erosion | Citation id matches probe chunk |
| Chunking bug | Answer missing key paragraph | Golden doc_id in eval set |
| Reranker regression | Right doc ranked low | recall@k eval |
Fluent LLM text is the worst assert—models answer plausibly without retrieval.
Complexity map
| Scenario | Edge case | Why tests break | Approach |
|---|---|---|---|
| Empty retrieval | No chunks above threshold | Hallucination passes | Assert empty sources UI + eval |
| ACL | User A doc in User B query | Security incident | Seed docs per tenant; probe deny |
| Citation | Wrong chunk linked | Misleading footnote | Eval expected doc_id + E2E chip assert |
| Reindex lag | Query before embed complete | Flake | Poll index_ready probe |
| Multi-hop | Needs two docs | Single-chunk miss | Golden set with multi-doc questions |
| Metadata filter | department=legal | Returns HR docs | Probe filter in retrieval API |
| Long context | Truncation drops cite | Missing citation | Eval with max token boundary |
| HyDE / query rewrite | Rewrite drift | Wrong retrieval | Eval rewrite + retrieval jointly |
| PDF tables | OCR garbage chunks | Wrong numbers | Fixture PDF with known cell values |
| Language mismatch | Query EN, doc DE | Empty hit | Locale-specific golden queries |
Two-layer testing model
Query → Embed → Retrieve chunks → (optional rerank) → LLM answer + citations
↑ ↑
Eval layer (recall@k) E2E layer (UI + probe retrieval_ids)
Retrieval evals run cheaply on index changes. E2E verifies UI wiring, auth on search API, and citation rendering.
Fixture corpus per test run
Avoid shared staging KB that other tests mutate:
// POST /api/test/seed-rag-corpus
// Body: { runId, documents: [{ id, text, acl: { tenantId } }] }
// Response: { corpusId, indexReady: boolean }
await expect.poll(async () => {
const res = await request.get(`/api/test/probe-index-ready?corpusId=${corpusId}`);
return (await res.json()).ready;
}, { timeout: 30_000 }).toBe(true);
Documents should be small and deterministic—three to ten chunks covering ACL and citation cases.
Golden query set (offline evals)
query_id,query,expected_doc_ids,min_recall_at_5,must_refuse_if_empty
refund-policy,What is the refund window?,doc-refund-001,1,true
billing-acl,Show me Acme Corp invoice,doc-acme-billing-ONLY,1,false
Run on every PR touching chunking, embeddings, or retrieval params. Track recall@k, MRR, and empty-hit rate.
LLM-as-judge for answer quality (optional)
After retrieval is fixed, judge whether answer uses only retrieved context:
- Calibrate judge against human labels on 50–100 examples
- Freeze judge prompt version in repo
- Fail CI if groundedness score drops on golden set
Do not use judge as sole gate for ACL—probe retrieval directly.
E2E: citation and probe Assert
await page.getByRole('textbox').fill('What is our refund policy?');
await page.keyboard.press('Enter');
await page.locator('[data-stream-complete="true"]').waitFor();
await expect(page.getByTestId('citation-chip').first())
.toHaveAttribute('data-doc-id', 'doc-refund-001');
await expect.poll(async () => {
const res = await request.get(`/api/test/probe-retrieval?runId=${runId}`);
return (await res.json()).docIds;
}).toContain('doc-refund-001');
Use AIMock for generation text when testing citation UI wiring without model variance:
// AIMock returns answer referencing doc-refund-001 with fixed summary text
await ai.verify('Answer mentions 30-day refund window or shows citation chip');
Empty retrieval path
Critical negative scenario—often untested:
await page.getByRole('textbox').fill('What is the CEO home address?');
await page.keyboard.press('Enter');
await expect(page.getByTestId('no-sources-banner')).toBeVisible();
await ai.verify('Assistant declines or states insufficient documentation—not a fabricated address');
await expect.poll(() => probeRetrievalCount(runId)).toBe(0);
Golden eval must mark must_refuse_if_empty=true for similar queries.
Access control E2E
| Arrange | Act | Assert |
|---|---|---|
| Doc owned by tenant A only | User B asks question targeting that doc | Probe empty retrieval; no citation chip |
| User A same query | — | Probe hit; citation visible |
Never rely on assistant saying "I cannot access" without probing retrieval API.
Reindex and embedding pipeline
When uploads trigger async indexing:
- Upload fixture doc via UI or seed route
- Poll
index_readyprobe—not fixed sleep - Query and assert retrieval
Test failure path: indexing error surfaces in UI; probe shows index_status=failed.
Anti-patterns
| Anti-pattern | Why it fails | Better approach |
|---|---|---|
| Assert exact answer text | Model rephrases | Citation id + eval groundedness |
| Shared prod-like KB in CI | Cross-test pollution | Per-run corpusId |
| Skip empty retrieval | Hallucination ships | Explicit no-sources scenario |
| Only UI search box test | API ACL untested | Probe retrieval with auth headers |
| No eval on embed model change | Silent recall drop | Golden set gates embed PRs |
| Snapshot citation HTML | Layout churn | data-doc-id attributes |
Example scenario
Situation: User asks a question answerable only from a tenant-private policy doc.
Expected outcome: Correct citation shown for authorized user; other tenant gets no sources—not a guess.
Why UI-only automation breaks: Citation chip appears but probe shows wrong doc_id or cross-tenant chunk.
- Arrange: Seed corpus with doc-private-tenantA; users A and B in separate sessions.
- Act: Each user asks the same policy question in chat UI.
- Assert: User A probe contains doc-private-tenantA; User B probe empty and no-sources UI.
TestChimp workflow: Track query_category × retrieval_hit in TrueCoverage; expand when zero-hit queries spike in prod.
Same Arrange/Act/Assert pattern as expired-coupon checkout.
Evals vs E2E: when each layer helps
| Layer | Best for | Limitations |
|---|---|---|
| Offline evals (golden queries, recall@k, LLM-as-judge groundedness) | Embedding/chunking/reranker regression, citation accuracy at scale | Misses UI citation rendering, auth on search routes, index_ready timing |
| E2E SmartTests (fixture corpus + AIMock + probe retrieval_ids) | ACL integration, upload→index→query journey, citation chips | Cannot economically cover thousands of queries |
| Hybrid | Evals gate index pipeline; E2E gate ACL and empty-retrieval UX | Map eval failures to new E2E when integration suspected |
Run recall evals on every index config change. Run E2E for tenant isolation, empty retrieval UX, and one happy path per major query_category. TestChimp does not ship eval tooling—wire your eval CI separately; use AIMock in SmartTests for stable generation asserts.
Connect scenarios to your QA workflow
Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).
Related scenarios
- Conversational UI — chat patterns
- LLM output validation — structured citations JSON
- Search and filters — non-LLM search UI
- File uploads — document ingest
External references
Frequently asked questions
How do I test RAG retrieval quality in CI?
Maintain a golden query set with expected doc IDs—run recall@k evals on embedding, chunking, or reranker changes. E2E verifies UI citation chips match probe retrieval_ids for representative queries.
How do I test empty retrieval without hallucinations?
Use queries with no matching chunks in fixture corpus. Assert no-sources UI, probe empty retrieval_ids, and offline eval must_refuse_if_empty flags. AIMock can stabilize generation text while testing wiring.
How do I test document ACL in RAG?
Seed docs with tenant-scoped ACL. Same query as two users—probe authorized hits and denied empty retrieval. Never trust assistant prose alone for security.
Should I assert the full generated answer?
No for CI E2E—assert citation doc_ids and probe retrieval. Use offline evals or LLM-as-judge for answer quality on golden sets.
How do I avoid flake on async indexing?
Poll index_ready probe after upload or seed reindex job—never fixed sleep. Fail with indexer logs on timeout.
When are offline evals enough without RAG E2E?
Pure retrieval parameter tuning with stable UI—evals may suffice. ACL, upload pipeline, and citation UI require E2E with probes.
Which query categories dominate prod?
Compare prod vs test-run across query_category × retrieval_hit in TrueCoverage. When zero-hit or acl_denied slices rise without scenarios, run /testchimp evolve when expanding corpus or ACL rules.
Apply these patterns in your repo
Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.