How to Test Chatbots and Conversational UIs

Short answer

Conversational UIs combine non-deterministic model output, multi-turn state, tool side effects, and streaming render timing—asserting exact assistant prose in Playwright is a maintenance trap. Layer offline evals (golden sets, LLM-as-judge) for prompt regression, AIMock + ai.act/ai.verify for deterministic E2E integration, and probe Assert on orders, tickets, and permissions—not toast copy.

Part of Testing Guides by AI and conversational UX.

Who this is for

Teams shipping chatbots, copilots, and support assistants on web (React, Next.js, Vite) where the assistant can read account data, invoke tools, or trigger refunds—not static FAQ widgets with canned responses.

Typical stacks: OpenAI/Anthropic/Gemini APIs behind a chat UI, LangChain/LlamaIndex orchestration, custom tool routers, RAG-backed answers, or embedded Intercom-style widgets with LLM backends.

Why testing conversational UI matters

Chat failures look like "the bot was rude" but often hide serious bugs:

Revenue loss — assistant confirms a refund or discount that never executed; user churns after "done" with no backend change.
Security incidents — cross-tenant context bleed in multi-turn threads; tool calls run with wrong user's OAuth token.
Support load — infinite clarification loops, empty responses after safety filter, streaming UI stuck on spinner.
Compliance exposure — medical/financial advice without required disclaimers; PII echoed from wrong retrieval chunk.

The UI can show a polished message while tools, DB, and auth boundaries disagree. E2E must assert authoritative state via probes, not assistant wording alone.

Complexity map

Scenario	Edge case	Why tests break	Approach
Non-deterministic text	Benign rephrase	`toHaveText('exact')` flakes	AIMock golden or semantic eval; assert cards/buttons
Multi-turn context	Thread id lost on refresh	Wrong answer turn 3	Seed thread in Arrange; probe thread store
Tool call mid-chat	Async webhook lag	Assert before side effect	`expect.poll` on probe after tool
Refusal / safety	Policy block	Empty bubble passes	Offline eval + E2E negative prompt
Stream abort	Partial message	UI stuck loading	Wait abort/complete signal (streaming guide)
Empty RAG retrieval	Hallucination	False pass on fluent text	Assert "no sources" UI + probe empty hits
Auth gate	Anonymous vs logged-in	Tool 403 untested	Seed session; probe permission on tool
Suggested replies	Chip click vs free text	Different code paths	Cover both interaction modes
Markdown / code blocks	XSS or broken render	Snapshot noise	Assert role/structure, not HTML snapshot
Rate limit / 429	Model API throttle	Random CI failure	AIMock in CI; real model in nightly eval job
i18n	Locale changes copy	String assert fails	Probe intent + structured UI state
Concurrent sends	Double submit	Duplicate tickets	Disable send during stream; probe idempotency

Architecture: what you are actually testing

User message → UI → API route → LLM (+ tools/RAG) → stream/batch response → UI render
                      ↓
                 Side effects (refund, ticket, email)

Split responsibilities:

Layer	Validates
Offline evals	Answer quality, refusal, citation accuracy on prompt change
AIMock E2E	UI wiring, tool invocation, auth, probe side effects
Probe Assert	DB/API truth after conversation action

Most specs cover post-login assistant behavior, not onboarding. Pattern:

// POST /api/test/seed-chat-thread
// Body: { runId, userId, turns?: [{ role, content }] }
// Response: { threadId, accessToken? }

const { threadId } = await request.post('/api/test/seed-chat-thread', {
  data: { runId, userId: seededUser.uid, turns: [] },
}).then(r => r.json());

await page.goto(`/chat?thread=${threadId}`);

Enable AIMock in test env so model calls return fixture JSON keyed by runId and turn index—real HTTP to your backend, stubbed upstream LLM.

AIMock and hybrid SmartTests

AIMock stubs model responses while exercising real routes, tool dispatch, and UI components. Pair with ai.act (semantic UI steps) and ai.verify (semantic UI checks) where selectors are volatile:

// Hybrid SmartTest — deterministic model, semantic chat UI
await ai.act('Open the support chat and ask to cancel order 12345');
await ai.verify('Assistant confirms refund initiated or shows policy denial with clear next steps');
await expect.poll(async () => {
  const res = await request.get('/api/test/probe-order/12345');
  return (await res.json()).status;
}, { timeout: 15_000 }).toMatch(/refund_pending|policy_denied/);

Reserve ai.act/ai.verify for brittle chat chrome (message bubbles, chip labels). Keep Arrange on seed routes and Assert on probes—never open-ended "ask the bot anything" in CI.

AIMock fixture shape

{
  "turnIndex": 1,
  "assistantMessage": {
    "text": "I can help cancel order 12345. Confirm?",
    "toolCalls": [{ "name": "lookup_order", "args": { "orderId": "12345" } }]
  },
  "streamChunks": ["I can help", " cancel order 12345", ". Confirm?"]
}

Map fixtures by (threadId, userMessageHash) or explicit test id in X-Test-Fixture header.

Multi-turn conversations

Script turns explicitly—do not rely on model creativity in CI:

async function sendTurn(page, text: string) {
  await page.getByRole('textbox', { name: /message/i }).fill(text);
  await page.keyboard.press('Enter');
  await page.locator('[data-stream-complete="true"]').last().waitFor({ timeout: 30_000 });
}

await sendTurn(page, 'I need help with my order');
await ai.verify('Assistant asks for order id or shows order list');
await sendTurn(page, 'Order 98765');
await expect.poll(() => probeThreadMentions('98765')).toBe(true);

With AIMock, return deterministic follow-ups per turnIndex so ai.verify stays stable.

Structured UI vs prose asserts

Prefer asserting machine-readable UI state:

Assert on	Avoid
Action buttons (`Refund`, `Talk to human`)	Full paragraph text
Citation chips with `data-doc-id`	Wording of summary
Tool result cards (JSON preview)	Random synonym in reply
`aria-busy` during stream	First token timing

await expect(page.getByTestId('assistant-message').last())
  .toHaveAttribute('data-stream-complete', 'true');
await expect(page.getByRole('button', { name: /confirm refund/i })).toBeVisible();

Tool calls and side effects

When the assistant invokes tools (refund, ticket create, calendar hold):

Act — user message triggers tool in UI or backend
Wait — stream complete + tool status indicator if shown
Assert — probe order/ticket row, not assistant claim

await sendTurn(page, 'Create support ticket for billing issue');
await expect.poll(async () => {
  const res = await request.get(`/api/test/probe-tickets?runId=${runId}`);
  return (await res.json()).count;
}).toBe(1);

Cover failure paths: tool timeout, 403, validation error—assistant should surface error state; probe must show no partial write.

Safety, refusal, and negative prompts

Offline golden sets should include jailbreak and policy-violation prompts with expected refusal behavior. One E2E per critical boundary:

Medical/legal/financial disclaimers present when required
No raw PII from another user's retrieval
Refusal does not leak system prompt

Use AIMock returning { "refusal": true, "category": "policy" } and assert UI shows safe fallback, not empty bubble.

Streaming and loading states

Chat UIs often stream tokens—see streaming AI responses. Minimum bar:

Loading indicator visible before first token
Send button disabled while streaming
Final message marked complete before structural asserts
Abort/stop clears in-flight state

CI checklist

AIMock enabled for default PR job; real model evals on prompt file changes
Unique runId per worker; isolated threads and orders
Seed routes disabled in production (NODE_ENV=production)
Probe endpoints return authoritative status—not cached client state
No waitForTimeout after send—wait on data-stream-complete or network
Link SmartTests to markdown scenarios via // @Scenario:

Anti-patterns

Anti-pattern	Why it fails	Better approach
Snapshot full assistant prose	Breaks on benign rephrase	Structure + probe
Real LLM every CI run	Cost, flake, rate limits	AIMock + offline evals
No tool side-effect assert	UI lies "Done"	Probe refund/ticket created
Single happy-path only	Miss refusal/regression	Golden eval set for safety intents
Exact string on i18n	Locale changes copy	Intent probe + button roles
Shared chat thread in CI	Cross-test pollution	Per-run threadId
Assert on first stream token	Partial content	Wait stream-complete
`page.waitForTimeout(5000)`	Still racing	`expect.poll` on probe

Example scenario

Situation: User asks chatbot to cancel order #12345.

Expected outcome: Refund initiated or policy denial with clear UI state—no silent failure.

Why UI-only automation breaks: Assistant says "Done" but probe shows order still active.

Arrange: AIMock returns deterministic refund confirmation; seed order 12345 with status active.
Act: Send cancel request in chat UI; confirm if prompted.
Assert: Probe order status refunded OR policy_denied flag set; UI shows matching action button state.

TestChimp workflow: Link refund and ticket scenarios in markdown plans; ExploreChimp on high-traffic chat paths; /testchimp evolve when new intents appear in production.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Evals vs E2E: when each layer helps

Layer	Best for	Limitations
Offline evals (golden Q&A sets, LLM-as-judge, RAG recall/precision)	Prompt/model regression, citation accuracy, safety refusals, cost-efficient CI on hundreds of intents	Does not catch UI integration, auth gates, streaming timing, or tool side effects
E2E SmartTests (incl. `ai.act` / AIMock / `ai.verify`)	Full user journey, tool execution, permissions, probe Assert on DB state	Slower; non-deterministic without AIMock/fixtures
Hybrid (industry standard)	Evals gate prompt/retrieval changes; E2E gates release integration	Requires discipline to link eval failures to new scenarios

Offline evals: maintain CSV/JSON golden sets (question, expected_intent, must_cite_doc_ids, must_refuse). Run on every PR touching prompts, system instructions, or retrieval config. Use LLM-as-judge only after calibrating against human labels—freeze judge prompt version in repo.

E2E with AIMock: one happy path per critical tool (refund, ticket, booking) plus auth-negative cases. Use ai.act/ai.verify for volatile bubble layout; probes for truth.

When evals alone suffice: copy/tone tweaks on stable intents with no new tools or permissions.

When E2E is mandatory: any flow where assistant text can lie about backend state, or where OAuth/session gates tool calls.

TestChimp does not ship eval tooling—this layering is how mature AI product teams combine both. When new intents appear in production or plan gaps surface after deploy, /testchimp evolve expands scenarios evals did not anticipate.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

AI-powered web apps — hybrid patterns across AI surfaces
AI agent workflows — multi-tool orchestration
LLM output validation — JSON mode and schema asserts
RAG search — citations and empty retrieval
Streaming responses — partial render and abort
Stripe webhooks — async side effects after tool calls

External references

OpenAI evals guide — golden sets and grading patterns (vendor-neutral concepts apply)
Anthropic evaluation docs — task-based eval design
Playwright locators — roles and test ids for chat UI
SmartTests intro — ai.act, ai.verify, AIMock in Git

Frequently asked questions

How do I test a chatbot when the response text changes every time?

Use AIMock or stub the LLM with golden responses for E2E integration tests; run offline evals on prompt/version changes for semantic quality. Assert on structured UI state (buttons, citations, stream-complete flags) and probe side effects—not exact prose.

Should I use snapshot testing for chat UI?

Avoid full-message snapshots—they break on benign copy changes. Snapshot structured cards, action buttons, or JSON tool outputs instead. Use probes for business outcomes.

When are offline LLM evals enough without E2E?

Evals suffice for prompt regression on stable intents with no new tools or permissions. E2E is required when auth, tool side effects, streaming timing, or cross-tenant isolation matter. Most production teams use both.

How do I test multi-turn conversation context?

Seed thread id via Arrange, send scripted turns with AIMock returning deterministic follow-ups per turn index, then probe final state. Do not rely on live model creativity in CI.

Where should ai.act and ai.verify be used in chat tests?

Use them for brittle chat chrome—message layout, chip labels, semantic confirmations. Keep Arrange on seed routes and Assert on probes. Never replace probe Assert with ai.verify alone for refunds or tickets.

How do I test assistant tool calls without hitting production APIs?

Stub tool endpoints in test env with seed routes; AIMock plans which tool to call. E2E verifies dispatch and probe side effects; offline evals verify tool selection on prompt changes.

New conversation intents appear weekly—how do we keep tests current?

Document intents and tools in markdown test plans; link SmartTests with // @Scenario:. Run /testchimp evolve after deploy when plan gaps or production behaviour surface new paths. Add AIMock fixtures for critical tools; expand golden eval sets for quality regression.

We mock all API routes in chat tests—is that enough?

No for integration confidence—mocking every route hides tool and auth bugs. See [over-mocking gotcha](/guides/gotchas/over-mocking-e2e-misses-integration-bugs); use AIMock for the LLM only.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo

Who this is for​

Why testing conversational UI matters​

Complexity map​

Architecture: what you are actually testing​

Arrange: threads and users without chatting through signup​

AIMock and hybrid SmartTests​

AIMock fixture shape​

Multi-turn conversations​

Structured UI vs prose asserts​

Tool calls and side effects​

Safety, refusal, and negative prompts​

Streaming and loading states​

CI checklist​

Anti-patterns​