How to Test RAG and Knowledge-Base Search

Short answer

RAG failures split across retrieval (wrong/missing chunks, ACL leaks) and generation (hallucination when context is empty). Test retrieval with golden query sets and recall@k offline evals; test integration with fixture corpora per run, AIMock generation when needed, citation UI asserts, and probe retrieval_ids—not exact answer prose.

Part of Testing Guides by AI and conversational UX.

Who this is for

Teams shipping document Q&A, support bots with knowledge bases, internal wikis with semantic search, or copilots that cite PDFs, tickets, and Confluence pages.

Typical stacks: vector DB (Pinecone, pgvector, Weaviate), embedding pipelines, chunking jobs, rerankers, and chat UI with citation chips.

Why testing RAG matters

Failure mode	User impact	Test signal
Stale index	Wrong policy answers	Probe index version after reindex job
ACL leak	User B sees User A doc	Cross-tenant retrieval probe
Empty retrieval	Confident hallucination	Assert "no sources" + eval refusal
Wrong citation	Trust erosion	Citation id matches probe chunk
Chunking bug	Answer missing key paragraph	Golden doc_id in eval set
Reranker regression	Right doc ranked low	recall@k eval

Fluent LLM text is the worst assert—models answer plausibly without retrieval.

Complexity map

Scenario	Edge case	Why tests break	Approach
Empty retrieval	No chunks above threshold	Hallucination passes	Assert empty sources UI + eval
ACL	User A doc in User B query	Security incident	Seed docs per tenant; probe deny
Citation	Wrong chunk linked	Misleading footnote	Eval expected doc_id + E2E chip assert
Reindex lag	Query before embed complete	Flake	Poll index_ready probe
Multi-hop	Needs two docs	Single-chunk miss	Golden set with multi-doc questions
Metadata filter	`department=legal`	Returns HR docs	Probe filter in retrieval API
Long context	Truncation drops cite	Missing citation	Eval with max token boundary
HyDE / query rewrite	Rewrite drift	Wrong retrieval	Eval rewrite + retrieval jointly
PDF tables	OCR garbage chunks	Wrong numbers	Fixture PDF with known cell values
Language mismatch	Query EN, doc DE	Empty hit	Locale-specific golden queries

Two-layer testing model

Query → Embed → Retrieve chunks → (optional rerank) → LLM answer + citations
         ↑                              ↑
    Eval layer (recall@k)         E2E layer (UI + probe retrieval_ids)

Retrieval evals run cheaply on index changes. E2E verifies UI wiring, auth on search API, and citation rendering.

Fixture corpus per test run

Avoid shared staging KB that other tests mutate:

// POST /api/test/seed-rag-corpus
// Body: { runId, documents: [{ id, text, acl: { tenantId } }] }
// Response: { corpusId, indexReady: boolean }

await expect.poll(async () => {
  const res = await request.get(`/api/test/probe-index-ready?corpusId=${corpusId}`);
  return (await res.json()).ready;
}, { timeout: 30_000 }).toBe(true);

Documents should be small and deterministic—three to ten chunks covering ACL and citation cases.

Golden query set (offline evals)

query_id,query,expected_doc_ids,min_recall_at_5,must_refuse_if_empty
refund-policy,What is the refund window?,doc-refund-001,1,true
billing-acl,Show me Acme Corp invoice,doc-acme-billing-ONLY,1,false

Run on every PR touching chunking, embeddings, or retrieval params. Track recall@k, MRR, and empty-hit rate.

LLM-as-judge for answer quality (optional)

After retrieval is fixed, judge whether answer uses only retrieved context:

Calibrate judge against human labels on 50–100 examples
Freeze judge prompt version in repo
Fail CI if groundedness score drops on golden set

Do not use judge as sole gate for ACL—probe retrieval directly.

E2E: citation and probe Assert

await page.getByRole('textbox').fill('What is our refund policy?');
await page.keyboard.press('Enter');
await page.locator('[data-stream-complete="true"]').waitFor();

await expect(page.getByTestId('citation-chip').first())
  .toHaveAttribute('data-doc-id', 'doc-refund-001');

await expect.poll(async () => {
  const res = await request.get(`/api/test/probe-retrieval?runId=${runId}`);
  return (await res.json()).docIds;
}).toContain('doc-refund-001');

Use AIMock for generation text when testing citation UI wiring without model variance:

// AIMock returns answer referencing doc-refund-001 with fixed summary text
await ai.verify('Answer mentions 30-day refund window or shows citation chip');

Empty retrieval path

Critical negative scenario—often untested:

await page.getByRole('textbox').fill('What is the CEO home address?');
await page.keyboard.press('Enter');
await expect(page.getByTestId('no-sources-banner')).toBeVisible();
await ai.verify('Assistant declines or states insufficient documentation—not a fabricated address');
await expect.poll(() => probeRetrievalCount(runId)).toBe(0);

Golden eval must mark must_refuse_if_empty=true for similar queries.

Access control E2E

Arrange	Act	Assert
Doc owned by tenant A only	User B asks question targeting that doc	Probe empty retrieval; no citation chip
User A same query	—	Probe hit; citation visible

Never rely on assistant saying "I cannot access" without probing retrieval API.

Reindex and embedding pipeline

When uploads trigger async indexing:

Upload fixture doc via UI or seed route
Poll index_ready probe—not fixed sleep
Query and assert retrieval

Test failure path: indexing error surfaces in UI; probe shows index_status=failed.

Anti-patterns

Anti-pattern	Why it fails	Better approach
Assert exact answer text	Model rephrases	Citation id + eval groundedness
Shared prod-like KB in CI	Cross-test pollution	Per-run corpusId
Skip empty retrieval	Hallucination ships	Explicit no-sources scenario
Only UI search box test	API ACL untested	Probe retrieval with auth headers
No eval on embed model change	Silent recall drop	Golden set gates embed PRs
Snapshot citation HTML	Layout churn	data-doc-id attributes

Example scenario

Situation: User asks a question answerable only from a tenant-private policy doc.

Expected outcome: Correct citation shown for authorized user; other tenant gets no sources—not a guess.

Why UI-only automation breaks: Citation chip appears but probe shows wrong doc_id or cross-tenant chunk.

Arrange: Seed corpus with doc-private-tenantA; users A and B in separate sessions.
Act: Each user asks the same policy question in chat UI.
Assert: User A probe contains doc-private-tenantA; User B probe empty and no-sources UI.

TestChimp workflow: Track query_category × retrieval_hit in TrueCoverage; expand when zero-hit queries spike in prod.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Evals vs E2E: when each layer helps

Layer	Best for	Limitations
Offline evals (golden queries, recall@k, LLM-as-judge groundedness)	Embedding/chunking/reranker regression, citation accuracy at scale	Misses UI citation rendering, auth on search routes, index_ready timing
E2E SmartTests (fixture corpus + AIMock + probe retrieval_ids)	ACL integration, upload→index→query journey, citation chips	Cannot economically cover thousands of queries
Hybrid	Evals gate index pipeline; E2E gate ACL and empty-retrieval UX	Map eval failures to new E2E when integration suspected

Run recall evals on every index config change. Run E2E for tenant isolation, empty retrieval UX, and one happy path per major query_category. TestChimp does not ship eval tooling—wire your eval CI separately; use AIMock in SmartTests for stable generation asserts.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

Conversational UI — chat patterns
LLM output validation — structured citations JSON
Search and filters — non-LLM search UI
File uploads — document ingest

External references

Frequently asked questions

How do I test RAG retrieval quality in CI?

Maintain a golden query set with expected doc IDs—run recall@k evals on embedding, chunking, or reranker changes. E2E verifies UI citation chips match probe retrieval_ids for representative queries.

How do I test empty retrieval without hallucinations?

Use queries with no matching chunks in fixture corpus. Assert no-sources UI, probe empty retrieval_ids, and offline eval must_refuse_if_empty flags. AIMock can stabilize generation text while testing wiring.

How do I test document ACL in RAG?

Seed docs with tenant-scoped ACL. Same query as two users—probe authorized hits and denied empty retrieval. Never trust assistant prose alone for security.

Should I assert the full generated answer?

No for CI E2E—assert citation doc_ids and probe retrieval. Use offline evals or LLM-as-judge for answer quality on golden sets.

How do I avoid flake on async indexing?

Poll index_ready probe after upload or seed reindex job—never fixed sleep. Fail with indexer logs on timeout.

When are offline evals enough without RAG E2E?

Pure retrieval parameter tuning with stable UI—evals may suffice. ACL, upload pipeline, and citation UI require E2E with probes.

Which query categories dominate prod?

Compare prod vs test-run across query_category × retrieval_hit in TrueCoverage. When zero-hit or acl_denied slices rise without scenarios, run /testchimp evolve when expanding corpus or ACL rules.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo

Who this is for​

Why testing RAG matters​

Complexity map​

Two-layer testing model​

Fixture corpus per test run​

Golden query set (offline evals)​

LLM-as-judge for answer quality (optional)​

E2E: citation and probe Assert​

Empty retrieval path​

Access control E2E​

Reindex and embedding pipeline​

Anti-patterns​