How to Test Chatbots and Conversational UIs
Short answer
Conversational UIs combine non-deterministic model output, multi-turn state, tool side effects, and streaming render timing—asserting exact assistant prose in Playwright is a maintenance trap. Layer offline evals (golden sets, LLM-as-judge) for prompt regression, AIMock + ai.act/ai.verify for deterministic E2E integration, and probe Assert on orders, tickets, and permissions—not toast copy.
Part of Testing Guides by AI and conversational UX.
Who this is for
Teams shipping chatbots, copilots, and support assistants on web (React, Next.js, Vite) where the assistant can read account data, invoke tools, or trigger refunds—not static FAQ widgets with canned responses.
Typical stacks: OpenAI/Anthropic/Gemini APIs behind a chat UI, LangChain/LlamaIndex orchestration, custom tool routers, RAG-backed answers, or embedded Intercom-style widgets with LLM backends.
Why testing conversational UI matters
Chat failures look like "the bot was rude" but often hide serious bugs:
- Revenue loss — assistant confirms a refund or discount that never executed; user churns after "done" with no backend change.
- Security incidents — cross-tenant context bleed in multi-turn threads; tool calls run with wrong user's OAuth token.
- Support load — infinite clarification loops, empty responses after safety filter, streaming UI stuck on spinner.
- Compliance exposure — medical/financial advice without required disclaimers; PII echoed from wrong retrieval chunk.
The UI can show a polished message while tools, DB, and auth boundaries disagree. E2E must assert authoritative state via probes, not assistant wording alone.
Complexity map
| Scenario | Edge case | Why tests break | Approach |
|---|---|---|---|
| Non-deterministic text | Benign rephrase | toHaveText('exact') flakes | AIMock golden or semantic eval; assert cards/buttons |
| Multi-turn context | Thread id lost on refresh | Wrong answer turn 3 | Seed thread in Arrange; probe thread store |
| Tool call mid-chat | Async webhook lag | Assert before side effect | expect.poll on probe after tool |
| Refusal / safety | Policy block | Empty bubble passes | Offline eval + E2E negative prompt |
| Stream abort | Partial message | UI stuck loading | Wait abort/complete signal (streaming guide) |
| Empty RAG retrieval | Hallucination | False pass on fluent text | Assert "no sources" UI + probe empty hits |
| Auth gate | Anonymous vs logged-in | Tool 403 untested | Seed session; probe permission on tool |
| Suggested replies | Chip click vs free text | Different code paths | Cover both interaction modes |
| Markdown / code blocks | XSS or broken render | Snapshot noise | Assert role/structure, not HTML snapshot |
| Rate limit / 429 | Model API throttle | Random CI failure | AIMock in CI; real model in nightly eval job |
| i18n | Locale changes copy | String assert fails | Probe intent + structured UI state |
| Concurrent sends | Double submit | Duplicate tickets | Disable send during stream; probe idempotency |
Architecture: what you are actually testing
User message → UI → API route → LLM (+ tools/RAG) → stream/batch response → UI render
↓
Side effects (refund, ticket, email)
Split responsibilities:
| Layer | Validates |
|---|---|
| Offline evals | Answer quality, refusal, citation accuracy on prompt change |
| AIMock E2E | UI wiring, tool invocation, auth, probe side effects |
| Probe Assert | DB/API truth after conversation action |
Arrange: threads and users without chatting through signup
Most specs cover post-login assistant behavior, not onboarding. Pattern:
// POST /api/test/seed-chat-thread
// Body: { runId, userId, turns?: [{ role, content }] }
// Response: { threadId, accessToken? }
const { threadId } = await request.post('/api/test/seed-chat-thread', {
data: { runId, userId: seededUser.uid, turns: [] },
}).then(r => r.json());
await page.goto(`/chat?thread=${threadId}`);
Enable AIMock in test env so model calls return fixture JSON keyed by runId and turn index—real HTTP to your backend, stubbed upstream LLM.
AIMock and hybrid SmartTests
AIMock stubs model responses while exercising real routes, tool dispatch, and UI components. Pair with ai.act (semantic UI steps) and ai.verify (semantic UI checks) where selectors are volatile:
// Hybrid SmartTest — deterministic model, semantic chat UI
await ai.act('Open the support chat and ask to cancel order 12345');
await ai.verify('Assistant confirms refund initiated or shows policy denial with clear next steps');
await expect.poll(async () => {
const res = await request.get('/api/test/probe-order/12345');
return (await res.json()).status;
}, { timeout: 15_000 }).toMatch(/refund_pending|policy_denied/);
Reserve ai.act/ai.verify for brittle chat chrome (message bubbles, chip labels). Keep Arrange on seed routes and Assert on probes—never open-ended "ask the bot anything" in CI.
AIMock fixture shape
{
"turnIndex": 1,
"assistantMessage": {
"text": "I can help cancel order 12345. Confirm?",
"toolCalls": [{ "name": "lookup_order", "args": { "orderId": "12345" } }]
},
"streamChunks": ["I can help", " cancel order 12345", ". Confirm?"]
}
Map fixtures by (threadId, userMessageHash) or explicit test id in X-Test-Fixture header.
Multi-turn conversations
Script turns explicitly—do not rely on model creativity in CI:
async function sendTurn(page, text: string) {
await page.getByRole('textbox', { name: /message/i }).fill(text);
await page.keyboard.press('Enter');
await page.locator('[data-stream-complete="true"]').last().waitFor({ timeout: 30_000 });
}
await sendTurn(page, 'I need help with my order');
await ai.verify('Assistant asks for order id or shows order list');
await sendTurn(page, 'Order 98765');
await expect.poll(() => probeThreadMentions('98765')).toBe(true);
With AIMock, return deterministic follow-ups per turnIndex so ai.verify stays stable.
Structured UI vs prose asserts
Prefer asserting machine-readable UI state:
| Assert on | Avoid |
|---|---|
Action buttons (Refund, Talk to human) | Full paragraph text |
Citation chips with data-doc-id | Wording of summary |
| Tool result cards (JSON preview) | Random synonym in reply |
aria-busy during stream | First token timing |
await expect(page.getByTestId('assistant-message').last())
.toHaveAttribute('data-stream-complete', 'true');
await expect(page.getByRole('button', { name: /confirm refund/i })).toBeVisible();
Tool calls and side effects
When the assistant invokes tools (refund, ticket create, calendar hold):
- Act — user message triggers tool in UI or backend
- Wait — stream complete + tool status indicator if shown
- Assert — probe order/ticket row, not assistant claim
await sendTurn(page, 'Create support ticket for billing issue');
await expect.poll(async () => {
const res = await request.get(`/api/test/probe-tickets?runId=${runId}`);
return (await res.json()).count;
}).toBe(1);
Cover failure paths: tool timeout, 403, validation error—assistant should surface error state; probe must show no partial write.
Safety, refusal, and negative prompts
Offline golden sets should include jailbreak and policy-violation prompts with expected refusal behavior. One E2E per critical boundary:
- Medical/legal/financial disclaimers present when required
- No raw PII from another user's retrieval
- Refusal does not leak system prompt
Use AIMock returning { "refusal": true, "category": "policy" } and assert UI shows safe fallback, not empty bubble.
Streaming and loading states
Chat UIs often stream tokens—see streaming AI responses. Minimum bar:
- Loading indicator visible before first token
- Send button disabled while streaming
- Final message marked complete before structural asserts
- Abort/stop clears in-flight state
CI checklist
- AIMock enabled for default PR job; real model evals on prompt file changes
- Unique
runIdper worker; isolated threads and orders - Seed routes disabled in production (
NODE_ENV=production) - Probe endpoints return authoritative status—not cached client state
- No
waitForTimeoutafter send—wait ondata-stream-completeor network - Link SmartTests to markdown scenarios via
// @Scenario:
Anti-patterns
| Anti-pattern | Why it fails | Better approach |
|---|---|---|
| Snapshot full assistant prose | Breaks on benign rephrase | Structure + probe |
| Real LLM every CI run | Cost, flake, rate limits | AIMock + offline evals |
| No tool side-effect assert | UI lies "Done" | Probe refund/ticket created |
| Single happy-path only | Miss refusal/regression | Golden eval set for safety intents |
| Exact string on i18n | Locale changes copy | Intent probe + button roles |
| Shared chat thread in CI | Cross-test pollution | Per-run threadId |
| Assert on first stream token | Partial content | Wait stream-complete |
page.waitForTimeout(5000) | Still racing | expect.poll on probe |
Example scenario
Situation: User asks chatbot to cancel order #12345.
Expected outcome: Refund initiated or policy denial with clear UI state—no silent failure.
Why UI-only automation breaks: Assistant says "Done" but probe shows order still active.
- Arrange: AIMock returns deterministic refund confirmation; seed order 12345 with status active.
- Act: Send cancel request in chat UI; confirm if prompted.
- Assert: Probe order status refunded OR policy_denied flag set; UI shows matching action button state.
TestChimp workflow: Link refund and ticket scenarios in markdown plans; ExploreChimp on high-traffic chat paths; /testchimp evolve when new intents appear in production.
Same Arrange/Act/Assert pattern as expired-coupon checkout.
Evals vs E2E: when each layer helps
| Layer | Best for | Limitations |
|---|---|---|
| Offline evals (golden Q&A sets, LLM-as-judge, RAG recall/precision) | Prompt/model regression, citation accuracy, safety refusals, cost-efficient CI on hundreds of intents | Does not catch UI integration, auth gates, streaming timing, or tool side effects |
E2E SmartTests (incl. ai.act / AIMock / ai.verify) | Full user journey, tool execution, permissions, probe Assert on DB state | Slower; non-deterministic without AIMock/fixtures |
| Hybrid (industry standard) | Evals gate prompt/retrieval changes; E2E gates release integration | Requires discipline to link eval failures to new scenarios |
Offline evals: maintain CSV/JSON golden sets (question, expected_intent, must_cite_doc_ids, must_refuse). Run on every PR touching prompts, system instructions, or retrieval config. Use LLM-as-judge only after calibrating against human labels—freeze judge prompt version in repo.
E2E with AIMock: one happy path per critical tool (refund, ticket, booking) plus auth-negative cases. Use ai.act/ai.verify for volatile bubble layout; probes for truth.
When evals alone suffice: copy/tone tweaks on stable intents with no new tools or permissions.
When E2E is mandatory: any flow where assistant text can lie about backend state, or where OAuth/session gates tool calls.
TestChimp does not ship eval tooling—this layering is how mature AI product teams combine both. When new intents appear in production or plan gaps surface after deploy, /testchimp evolve expands scenarios evals did not anticipate.
Connect scenarios to your QA workflow
Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).
Related scenarios
- AI-powered web apps — hybrid patterns across AI surfaces
- AI agent workflows — multi-tool orchestration
- LLM output validation — JSON mode and schema asserts
- RAG search — citations and empty retrieval
- Streaming responses — partial render and abort
- Stripe webhooks — async side effects after tool calls
External references
- OpenAI evals guide — golden sets and grading patterns (vendor-neutral concepts apply)
- Anthropic evaluation docs — task-based eval design
- Playwright locators — roles and test ids for chat UI
- SmartTests intro —
ai.act,ai.verify, AIMock in Git
Frequently asked questions
How do I test a chatbot when the response text changes every time?
Use AIMock or stub the LLM with golden responses for E2E integration tests; run offline evals on prompt/version changes for semantic quality. Assert on structured UI state (buttons, citations, stream-complete flags) and probe side effects—not exact prose.
Should I use snapshot testing for chat UI?
Avoid full-message snapshots—they break on benign copy changes. Snapshot structured cards, action buttons, or JSON tool outputs instead. Use probes for business outcomes.
When are offline LLM evals enough without E2E?
Evals suffice for prompt regression on stable intents with no new tools or permissions. E2E is required when auth, tool side effects, streaming timing, or cross-tenant isolation matter. Most production teams use both.
How do I test multi-turn conversation context?
Seed thread id via Arrange, send scripted turns with AIMock returning deterministic follow-ups per turn index, then probe final state. Do not rely on live model creativity in CI.
Where should ai.act and ai.verify be used in chat tests?
Use them for brittle chat chrome—message layout, chip labels, semantic confirmations. Keep Arrange on seed routes and Assert on probes. Never replace probe Assert with ai.verify alone for refunds or tickets.
How do I test assistant tool calls without hitting production APIs?
Stub tool endpoints in test env with seed routes; AIMock plans which tool to call. E2E verifies dispatch and probe side effects; offline evals verify tool selection on prompt changes.
New conversation intents appear weekly—how do we keep tests current?
Document intents and tools in markdown test plans; link SmartTests with // @Scenario:. Run /testchimp evolve after deploy when plan gaps or production behaviour surface new paths. Add AIMock fixtures for critical tools; expand golden eval sets for quality regression.
We mock all API routes in chat tests—is that enough?
No for integration confidence—mocking every route hides tool and auth bugs. See [over-mocking gotcha](/guides/gotchas/over-mocking-e2e-misses-integration-bugs); use AIMock for the LLM only.
Apply these patterns in your repo
Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.