Skip to main content

How to Test Chatbots and Conversational UIs

Short answer

Conversational UIs combine non-deterministic model output, multi-turn state, tool side effects, and streaming render timing—asserting exact assistant prose in Playwright is a maintenance trap. Layer offline evals (golden sets, LLM-as-judge) for prompt regression, AIMock + ai.act/ai.verify for deterministic E2E integration, and probe Assert on orders, tickets, and permissions—not toast copy.

Part of Testing Guides by AI and conversational UX.

Who this is for

Teams shipping chatbots, copilots, and support assistants on web (React, Next.js, Vite) where the assistant can read account data, invoke tools, or trigger refunds—not static FAQ widgets with canned responses.

Typical stacks: OpenAI/Anthropic/Gemini APIs behind a chat UI, LangChain/LlamaIndex orchestration, custom tool routers, RAG-backed answers, or embedded Intercom-style widgets with LLM backends.

Why testing conversational UI matters

Chat failures look like "the bot was rude" but often hide serious bugs:

  • Revenue loss — assistant confirms a refund or discount that never executed; user churns after "done" with no backend change.
  • Security incidents — cross-tenant context bleed in multi-turn threads; tool calls run with wrong user's OAuth token.
  • Support load — infinite clarification loops, empty responses after safety filter, streaming UI stuck on spinner.
  • Compliance exposure — medical/financial advice without required disclaimers; PII echoed from wrong retrieval chunk.

The UI can show a polished message while tools, DB, and auth boundaries disagree. E2E must assert authoritative state via probes, not assistant wording alone.

Complexity map

ScenarioEdge caseWhy tests breakApproach
Non-deterministic textBenign rephrasetoHaveText('exact') flakesAIMock golden or semantic eval; assert cards/buttons
Multi-turn contextThread id lost on refreshWrong answer turn 3Seed thread in Arrange; probe thread store
Tool call mid-chatAsync webhook lagAssert before side effectexpect.poll on probe after tool
Refusal / safetyPolicy blockEmpty bubble passesOffline eval + E2E negative prompt
Stream abortPartial messageUI stuck loadingWait abort/complete signal (streaming guide)
Empty RAG retrievalHallucinationFalse pass on fluent textAssert "no sources" UI + probe empty hits
Auth gateAnonymous vs logged-inTool 403 untestedSeed session; probe permission on tool
Suggested repliesChip click vs free textDifferent code pathsCover both interaction modes
Markdown / code blocksXSS or broken renderSnapshot noiseAssert role/structure, not HTML snapshot
Rate limit / 429Model API throttleRandom CI failureAIMock in CI; real model in nightly eval job
i18nLocale changes copyString assert failsProbe intent + structured UI state
Concurrent sendsDouble submitDuplicate ticketsDisable send during stream; probe idempotency

Architecture: what you are actually testing

User message → UI → API route → LLM (+ tools/RAG) → stream/batch response → UI render

Side effects (refund, ticket, email)

Split responsibilities:

LayerValidates
Offline evalsAnswer quality, refusal, citation accuracy on prompt change
AIMock E2EUI wiring, tool invocation, auth, probe side effects
Probe AssertDB/API truth after conversation action

Arrange: threads and users without chatting through signup

Most specs cover post-login assistant behavior, not onboarding. Pattern:

// POST /api/test/seed-chat-thread
// Body: { runId, userId, turns?: [{ role, content }] }
// Response: { threadId, accessToken? }

const { threadId } = await request.post('/api/test/seed-chat-thread', {
data: { runId, userId: seededUser.uid, turns: [] },
}).then(r => r.json());

await page.goto(`/chat?thread=${threadId}`);

Enable AIMock in test env so model calls return fixture JSON keyed by runId and turn index—real HTTP to your backend, stubbed upstream LLM.

AIMock and hybrid SmartTests

AIMock stubs model responses while exercising real routes, tool dispatch, and UI components. Pair with ai.act (semantic UI steps) and ai.verify (semantic UI checks) where selectors are volatile:

// Hybrid SmartTest — deterministic model, semantic chat UI
await ai.act('Open the support chat and ask to cancel order 12345');
await ai.verify('Assistant confirms refund initiated or shows policy denial with clear next steps');
await expect.poll(async () => {
const res = await request.get('/api/test/probe-order/12345');
return (await res.json()).status;
}, { timeout: 15_000 }).toMatch(/refund_pending|policy_denied/);

Reserve ai.act/ai.verify for brittle chat chrome (message bubbles, chip labels). Keep Arrange on seed routes and Assert on probes—never open-ended "ask the bot anything" in CI.

AIMock fixture shape

{
"turnIndex": 1,
"assistantMessage": {
"text": "I can help cancel order 12345. Confirm?",
"toolCalls": [{ "name": "lookup_order", "args": { "orderId": "12345" } }]
},
"streamChunks": ["I can help", " cancel order 12345", ". Confirm?"]
}

Map fixtures by (threadId, userMessageHash) or explicit test id in X-Test-Fixture header.

Multi-turn conversations

Script turns explicitly—do not rely on model creativity in CI:

async function sendTurn(page, text: string) {
await page.getByRole('textbox', { name: /message/i }).fill(text);
await page.keyboard.press('Enter');
await page.locator('[data-stream-complete="true"]').last().waitFor({ timeout: 30_000 });
}

await sendTurn(page, 'I need help with my order');
await ai.verify('Assistant asks for order id or shows order list');
await sendTurn(page, 'Order 98765');
await expect.poll(() => probeThreadMentions('98765')).toBe(true);

With AIMock, return deterministic follow-ups per turnIndex so ai.verify stays stable.

Structured UI vs prose asserts

Prefer asserting machine-readable UI state:

Assert onAvoid
Action buttons (Refund, Talk to human)Full paragraph text
Citation chips with data-doc-idWording of summary
Tool result cards (JSON preview)Random synonym in reply
aria-busy during streamFirst token timing
await expect(page.getByTestId('assistant-message').last())
.toHaveAttribute('data-stream-complete', 'true');
await expect(page.getByRole('button', { name: /confirm refund/i })).toBeVisible();

Tool calls and side effects

When the assistant invokes tools (refund, ticket create, calendar hold):

  1. Act — user message triggers tool in UI or backend
  2. Wait — stream complete + tool status indicator if shown
  3. Assert — probe order/ticket row, not assistant claim
await sendTurn(page, 'Create support ticket for billing issue');
await expect.poll(async () => {
const res = await request.get(`/api/test/probe-tickets?runId=${runId}`);
return (await res.json()).count;
}).toBe(1);

Cover failure paths: tool timeout, 403, validation error—assistant should surface error state; probe must show no partial write.

Safety, refusal, and negative prompts

Offline golden sets should include jailbreak and policy-violation prompts with expected refusal behavior. One E2E per critical boundary:

  • Medical/legal/financial disclaimers present when required
  • No raw PII from another user's retrieval
  • Refusal does not leak system prompt

Use AIMock returning { "refusal": true, "category": "policy" } and assert UI shows safe fallback, not empty bubble.

Streaming and loading states

Chat UIs often stream tokens—see streaming AI responses. Minimum bar:

  • Loading indicator visible before first token
  • Send button disabled while streaming
  • Final message marked complete before structural asserts
  • Abort/stop clears in-flight state

CI checklist

  1. AIMock enabled for default PR job; real model evals on prompt file changes
  2. Unique runId per worker; isolated threads and orders
  3. Seed routes disabled in production (NODE_ENV=production)
  4. Probe endpoints return authoritative status—not cached client state
  5. No waitForTimeout after send—wait on data-stream-complete or network
  6. Link SmartTests to markdown scenarios via // @Scenario:

Anti-patterns

Anti-patternWhy it failsBetter approach
Snapshot full assistant proseBreaks on benign rephraseStructure + probe
Real LLM every CI runCost, flake, rate limitsAIMock + offline evals
No tool side-effect assertUI lies "Done"Probe refund/ticket created
Single happy-path onlyMiss refusal/regressionGolden eval set for safety intents
Exact string on i18nLocale changes copyIntent probe + button roles
Shared chat thread in CICross-test pollutionPer-run threadId
Assert on first stream tokenPartial contentWait stream-complete
page.waitForTimeout(5000)Still racingexpect.poll on probe

Example scenario

Situation: User asks chatbot to cancel order #12345.

Expected outcome: Refund initiated or policy denial with clear UI state—no silent failure.

Why UI-only automation breaks: Assistant says "Done" but probe shows order still active.

  1. Arrange: AIMock returns deterministic refund confirmation; seed order 12345 with status active.
  2. Act: Send cancel request in chat UI; confirm if prompted.
  3. Assert: Probe order status refunded OR policy_denied flag set; UI shows matching action button state.

TestChimp workflow: Link refund and ticket scenarios in markdown plans; ExploreChimp on high-traffic chat paths; /testchimp evolve when new intents appear in production.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Evals vs E2E: when each layer helps

LayerBest forLimitations
Offline evals (golden Q&A sets, LLM-as-judge, RAG recall/precision)Prompt/model regression, citation accuracy, safety refusals, cost-efficient CI on hundreds of intentsDoes not catch UI integration, auth gates, streaming timing, or tool side effects
E2E SmartTests (incl. ai.act / AIMock / ai.verify)Full user journey, tool execution, permissions, probe Assert on DB stateSlower; non-deterministic without AIMock/fixtures
Hybrid (industry standard)Evals gate prompt/retrieval changes; E2E gates release integrationRequires discipline to link eval failures to new scenarios

Offline evals: maintain CSV/JSON golden sets (question, expected_intent, must_cite_doc_ids, must_refuse). Run on every PR touching prompts, system instructions, or retrieval config. Use LLM-as-judge only after calibrating against human labels—freeze judge prompt version in repo.

E2E with AIMock: one happy path per critical tool (refund, ticket, booking) plus auth-negative cases. Use ai.act/ai.verify for volatile bubble layout; probes for truth.

When evals alone suffice: copy/tone tweaks on stable intents with no new tools or permissions.

When E2E is mandatory: any flow where assistant text can lie about backend state, or where OAuth/session gates tool calls.

TestChimp does not ship eval tooling—this layering is how mature AI product teams combine both. When new intents appear in production or plan gaps surface after deploy, /testchimp evolve expands scenarios evals did not anticipate.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

External references

Frequently asked questions

How do I test a chatbot when the response text changes every time?

Use AIMock or stub the LLM with golden responses for E2E integration tests; run offline evals on prompt/version changes for semantic quality. Assert on structured UI state (buttons, citations, stream-complete flags) and probe side effects—not exact prose.

Should I use snapshot testing for chat UI?

Avoid full-message snapshots—they break on benign copy changes. Snapshot structured cards, action buttons, or JSON tool outputs instead. Use probes for business outcomes.

When are offline LLM evals enough without E2E?

Evals suffice for prompt regression on stable intents with no new tools or permissions. E2E is required when auth, tool side effects, streaming timing, or cross-tenant isolation matter. Most production teams use both.

How do I test multi-turn conversation context?

Seed thread id via Arrange, send scripted turns with AIMock returning deterministic follow-ups per turn index, then probe final state. Do not rely on live model creativity in CI.

Where should ai.act and ai.verify be used in chat tests?

Use them for brittle chat chrome—message layout, chip labels, semantic confirmations. Keep Arrange on seed routes and Assert on probes. Never replace probe Assert with ai.verify alone for refunds or tickets.

How do I test assistant tool calls without hitting production APIs?

Stub tool endpoints in test env with seed routes; AIMock plans which tool to call. E2E verifies dispatch and probe side effects; offline evals verify tool selection on prompt changes.

New conversation intents appear weekly—how do we keep tests current?

Document intents and tools in markdown test plans; link SmartTests with // @Scenario:. Run /testchimp evolve after deploy when plan gaps or production behaviour surface new paths. Add AIMock fixtures for critical tools; expand golden eval sets for quality regression.

We mock all API routes in chat tests—is that enough?

No for integration confidence—mocking every route hides tool and auth bugs. See [over-mocking gotcha](/guides/gotchas/over-mocking-e2e-misses-integration-bugs); use AIMock for the LLM only.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo