Skip to main content

How to Test RAG and Knowledge-Base Search

Short answer

RAG failures split across retrieval (wrong/missing chunks, ACL leaks) and generation (hallucination when context is empty). Test retrieval with golden query sets and recall@k offline evals; test integration with fixture corpora per run, AIMock generation when needed, citation UI asserts, and probe retrieval_ids—not exact answer prose.

Part of Testing Guides by AI and conversational UX.

Who this is for

Teams shipping document Q&A, support bots with knowledge bases, internal wikis with semantic search, or copilots that cite PDFs, tickets, and Confluence pages.

Typical stacks: vector DB (Pinecone, pgvector, Weaviate), embedding pipelines, chunking jobs, rerankers, and chat UI with citation chips.

Why testing RAG matters

Failure modeUser impactTest signal
Stale indexWrong policy answersProbe index version after reindex job
ACL leakUser B sees User A docCross-tenant retrieval probe
Empty retrievalConfident hallucinationAssert "no sources" + eval refusal
Wrong citationTrust erosionCitation id matches probe chunk
Chunking bugAnswer missing key paragraphGolden doc_id in eval set
Reranker regressionRight doc ranked lowrecall@k eval

Fluent LLM text is the worst assert—models answer plausibly without retrieval.

Complexity map

ScenarioEdge caseWhy tests breakApproach
Empty retrievalNo chunks above thresholdHallucination passesAssert empty sources UI + eval
ACLUser A doc in User B querySecurity incidentSeed docs per tenant; probe deny
CitationWrong chunk linkedMisleading footnoteEval expected doc_id + E2E chip assert
Reindex lagQuery before embed completeFlakePoll index_ready probe
Multi-hopNeeds two docsSingle-chunk missGolden set with multi-doc questions
Metadata filterdepartment=legalReturns HR docsProbe filter in retrieval API
Long contextTruncation drops citeMissing citationEval with max token boundary
HyDE / query rewriteRewrite driftWrong retrievalEval rewrite + retrieval jointly
PDF tablesOCR garbage chunksWrong numbersFixture PDF with known cell values
Language mismatchQuery EN, doc DEEmpty hitLocale-specific golden queries

Two-layer testing model

Query → Embed → Retrieve chunks → (optional rerank) → LLM answer + citations
↑ ↑
Eval layer (recall@k) E2E layer (UI + probe retrieval_ids)

Retrieval evals run cheaply on index changes. E2E verifies UI wiring, auth on search API, and citation rendering.

Fixture corpus per test run

Avoid shared staging KB that other tests mutate:

// POST /api/test/seed-rag-corpus
// Body: { runId, documents: [{ id, text, acl: { tenantId } }] }
// Response: { corpusId, indexReady: boolean }

await expect.poll(async () => {
const res = await request.get(`/api/test/probe-index-ready?corpusId=${corpusId}`);
return (await res.json()).ready;
}, { timeout: 30_000 }).toBe(true);

Documents should be small and deterministic—three to ten chunks covering ACL and citation cases.

Golden query set (offline evals)

query_id,query,expected_doc_ids,min_recall_at_5,must_refuse_if_empty
refund-policy,What is the refund window?,doc-refund-001,1,true
billing-acl,Show me Acme Corp invoice,doc-acme-billing-ONLY,1,false

Run on every PR touching chunking, embeddings, or retrieval params. Track recall@k, MRR, and empty-hit rate.

LLM-as-judge for answer quality (optional)

After retrieval is fixed, judge whether answer uses only retrieved context:

  • Calibrate judge against human labels on 50–100 examples
  • Freeze judge prompt version in repo
  • Fail CI if groundedness score drops on golden set

Do not use judge as sole gate for ACL—probe retrieval directly.

E2E: citation and probe Assert

await page.getByRole('textbox').fill('What is our refund policy?');
await page.keyboard.press('Enter');
await page.locator('[data-stream-complete="true"]').waitFor();

await expect(page.getByTestId('citation-chip').first())
.toHaveAttribute('data-doc-id', 'doc-refund-001');

await expect.poll(async () => {
const res = await request.get(`/api/test/probe-retrieval?runId=${runId}`);
return (await res.json()).docIds;
}).toContain('doc-refund-001');

Use AIMock for generation text when testing citation UI wiring without model variance:

// AIMock returns answer referencing doc-refund-001 with fixed summary text
await ai.verify('Answer mentions 30-day refund window or shows citation chip');

Empty retrieval path

Critical negative scenario—often untested:

await page.getByRole('textbox').fill('What is the CEO home address?');
await page.keyboard.press('Enter');
await expect(page.getByTestId('no-sources-banner')).toBeVisible();
await ai.verify('Assistant declines or states insufficient documentation—not a fabricated address');
await expect.poll(() => probeRetrievalCount(runId)).toBe(0);

Golden eval must mark must_refuse_if_empty=true for similar queries.

Access control E2E

ArrangeActAssert
Doc owned by tenant A onlyUser B asks question targeting that docProbe empty retrieval; no citation chip
User A same queryProbe hit; citation visible

Never rely on assistant saying "I cannot access" without probing retrieval API.

Reindex and embedding pipeline

When uploads trigger async indexing:

  1. Upload fixture doc via UI or seed route
  2. Poll index_ready probe—not fixed sleep
  3. Query and assert retrieval

Test failure path: indexing error surfaces in UI; probe shows index_status=failed.

Anti-patterns

Anti-patternWhy it failsBetter approach
Assert exact answer textModel rephrasesCitation id + eval groundedness
Shared prod-like KB in CICross-test pollutionPer-run corpusId
Skip empty retrievalHallucination shipsExplicit no-sources scenario
Only UI search box testAPI ACL untestedProbe retrieval with auth headers
No eval on embed model changeSilent recall dropGolden set gates embed PRs
Snapshot citation HTMLLayout churndata-doc-id attributes

Example scenario

Situation: User asks a question answerable only from a tenant-private policy doc.

Expected outcome: Correct citation shown for authorized user; other tenant gets no sources—not a guess.

Why UI-only automation breaks: Citation chip appears but probe shows wrong doc_id or cross-tenant chunk.

  1. Arrange: Seed corpus with doc-private-tenantA; users A and B in separate sessions.
  2. Act: Each user asks the same policy question in chat UI.
  3. Assert: User A probe contains doc-private-tenantA; User B probe empty and no-sources UI.

TestChimp workflow: Track query_category × retrieval_hit in TrueCoverage; expand when zero-hit queries spike in prod.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Evals vs E2E: when each layer helps

LayerBest forLimitations
Offline evals (golden queries, recall@k, LLM-as-judge groundedness)Embedding/chunking/reranker regression, citation accuracy at scaleMisses UI citation rendering, auth on search routes, index_ready timing
E2E SmartTests (fixture corpus + AIMock + probe retrieval_ids)ACL integration, upload→index→query journey, citation chipsCannot economically cover thousands of queries
HybridEvals gate index pipeline; E2E gate ACL and empty-retrieval UXMap eval failures to new E2E when integration suspected

Run recall evals on every index config change. Run E2E for tenant isolation, empty retrieval UX, and one happy path per major query_category. TestChimp does not ship eval tooling—wire your eval CI separately; use AIMock in SmartTests for stable generation asserts.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

External references

Frequently asked questions

How do I test RAG retrieval quality in CI?

Maintain a golden query set with expected doc IDs—run recall@k evals on embedding, chunking, or reranker changes. E2E verifies UI citation chips match probe retrieval_ids for representative queries.

How do I test empty retrieval without hallucinations?

Use queries with no matching chunks in fixture corpus. Assert no-sources UI, probe empty retrieval_ids, and offline eval must_refuse_if_empty flags. AIMock can stabilize generation text while testing wiring.

How do I test document ACL in RAG?

Seed docs with tenant-scoped ACL. Same query as two users—probe authorized hits and denied empty retrieval. Never trust assistant prose alone for security.

Should I assert the full generated answer?

No for CI E2E—assert citation doc_ids and probe retrieval. Use offline evals or LLM-as-judge for answer quality on golden sets.

How do I avoid flake on async indexing?

Poll index_ready probe after upload or seed reindex job—never fixed sleep. Fail with indexer logs on timeout.

When are offline evals enough without RAG E2E?

Pure retrieval parameter tuning with stable UI—evals may suffice. ACL, upload pipeline, and citation UI require E2E with probes.

Which query categories dominate prod?

Compare prod vs test-run across query_category × retrieval_hit in TrueCoverage. When zero-hit or acl_denied slices rise without scenarios, run /testchimp evolve when expanding corpus or ACL rules.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo