How to Test AI-Powered Web Applications

Short answer

AI products introduce non-deterministic UI—streaming text, dynamic suggestions, model updates. Use deterministic Playwright for shell and navigation, AIMock for stable model responses in CI, ai.act / ai.verify only where needed, offline evals for prompt quality, and probes for outcomes that must not drift with model wording.

Part of Testing Guides by industry.

Who this is for

Startups shipping chat, copilot, generative UI, or AI-assisted workflows on web—especially when copy changes daily but business rules (billing, permissions, tool calls) must hold. Not static marketing sites with a chat widget FAQ.

Typical stacks: OpenAI/Anthropic/Gemini behind API routes, tool routers, RAG pipelines, or embedded copilots in SaaS dashboards.

Why testing AI web apps matters

AI failures hide behind fluent text:

Revenue loss — copilot confirms refund or upgrade; probe shows no change; usage credits consumed without delivery.
Security incidents — tool calls run with wrong user's token; cross-tenant context in multi-turn threads.
Support load — infinite clarification loops; streaming stuck on spinner; empty responses after safety filter.
CI trap — asserting exact assistant prose breaks on benign rephrase though behavior is correct.

The UI can show "Done!" while tools, DB, and billing disagree. E2E must assert authoritative state via probes—not model phrasing alone.

Complexity map

Scenario	Edge case	Why tests break	Approach
Non-deterministic text	Benign rephrase	`toHaveText('exact')` flakes	AIMock + ai.verify category
Streaming response	Partial tokens	Assert too early	Wait complete signal (streaming guide)
Tool invocation	Async side effect	Assert before DB write	`expect.poll` probe
Token / usage limit	Over quota	Model returns prose	Probe `status=rejected`
Refusal / safety	Policy block	Empty bubble passes	Offline eval + E2E negative
RAG empty hits	Hallucination	Fluent wrong answer	Probe empty retrieval + UI state
Auth gate	Anonymous vs logged-in	Tool 403 untested	Seed session; probe permission
Suggested replies	Chip vs free text	Different code paths	Cover both modes
Model API 429	Rate limit in CI	Random fail	AIMock in PR; real model nightly
Prompt deploy	Wording shift	String asserts fail	Eval job on golden set
Billing per request	Partial charge	Revenue leak	Probe usage increment
Concurrent sends	Double submit	Duplicate tool calls	Disable during stream; probe idempotency
Canvas / generative UI	No stable selectors	Coordinate flake	ai.act + probe (canvas guide)
Multi-turn context	Thread lost on refresh	Wrong answer turn 3	Seed thread in Arrange

Architecture: three layers

User → UI → API → LLM (+ tools/RAG) → stream/batch → UI render
                      ↓
                 Side effects (DB, billing, email)

Layer	Validates	When
Offline evals	Answer quality, refusal, citations	Prompt/system change PRs
AIMock E2E	UI wiring, tools, auth, probes	Every PR in CI
Probe Assert	DB/API truth after action	Always for side effects

Do not run expensive LLM-as-judge on every Playwright spec—reserve evals for prompt regression; use AIMock for integration wiring.

Evals vs E2E: split responsibilities

Concern	Offline evals	E2E SmartTests
Answer quality / tone	Golden questions + judge or rubric	Not exact prose asserts
Refusal on policy prompts	Batch eval dataset	One E2E negative + probe
Tool called correctly	Can mock in eval harness	AIMock + probe side effect
Streaming UX / disabled states	Limited	E2E with complete signal
Auth + billing boundaries	Partial	E2E with probes required
Regression on prompt v2	Primary home	Smoke E2E only

Rule: E2E proves the product wired correctly (tools fire, permissions hold, UI states transition). Evals prove the model behaves acceptably on representative inputs.

Run evals in a separate CI job or nightly; gate prompt PRs on eval thresholds. Keep Playwright fast with AIMock.

AIMock and hybrid SmartTests

AIMock stubs upstream model responses while exercising real routes, tool dispatch, and UI:

// env: AIMOCK=1 maps prompt patterns to fixtures/ai/refund-confirm.json

await ai.act('Open copilot and ask to cancel order 12345');
await ai.verify('Assistant confirms refund initiated or shows policy denial with clear next steps');

await expect.poll(async () => {
  const res = await request.get('/api/test/probe-order/12345');
  return (await res.json()).status;
}, { timeout: 15_000 }).toMatch(/refund_pending|policy_denied/);

ai.act — semantic UI when selectors churn (chat input, dynamic panels)
ai.verify — semantic UI outcome category—not exact tokens
Probe — authoritative order status, credits, flags

See conversational UI guide for thread seeds and multi-turn patterns.

Arrange: threads, documents, limits

// POST /api/test/seed-ai-context
// Body: { runId, documentTokens, userCredits, threadId? }

const { threadId } = await request.post('/api/test/seed-ai-context', {
  data: { runId, documentTokens: 128_000, userCredits: 10 },
}).then(r => r.json());

await page.goto(`/copilot?thread=${threadId}`);

Seed boundary conditions (at token limit, zero credits) without chatting through onboarding in every spec.

Probes for AI side effects

Assert what must not drift:

Side effect	Probe
Refund / cancel	Order status
Ticket created	`ticket_id`
Usage billing	`credits_consumed`
Tool denial	`policy_denied` flag
Saved artifact	Storage row id

Never assert assistant markdown HTML snapshots in CI—they change with model and CSS.

Requirement slices to cover

copilot_request_outcome — success, limit_exceeded, policy_denied, error
tool_name — refund, search, summarize, etc.
interaction_mode — chip, free_text, voice (if applicable)

When prod shows high limit_exceeded volume but tests only cover happy summarize, evolve negative specs.

CI checklist

AIMock enabled for default PR Playwright job
Probes on every tool side-effect scenario
No exact assistant string asserts—semantic or probe only
Streaming specs wait complete/abort signal
Offline eval job on prompt-changing PRs (separate from E2E)
Auth negative: anonymous cannot invoke paid tools
// @Scenario: links for AI invariants in markdown plans

Anti-patterns

Anti-pattern	Why it fails	Better approach
Exact text assert on LLM output	Benign rephrase breaks CI	ai.verify category + probe
Full agentic test file	Slow, opaque	Deterministic Arrange/Assert
Real model every PR	Cost + flake	AIMock CI; eval nightly
No probe on "Done"	False success	Poll order/status
Eval-only, no E2E	Wiring bugs ship	AIMock integration specs
ai.act for login/navigation	Unnecessary variance	Playwright locators
Skip limit/refusal paths	Compliance risk	Negative seed + probe

Example scenario

Situation: User asks copilot to summarize a document over token limit.

Expected outcome: Request rejected; no partial charge; user sees limit guidance.

Why UI-only automation breaks: Test matches exact error string; model reword breaks CI though behaviour is correct.

Arrange: Seed document at limit boundary via fixture; stub billing probe baseline.
Act: Submit prompt in copilot UI.
Assert: Probe shows `status=rejected`, no usage increment; `ai.verify` optional for error category.

TestChimp workflow: Track `copilot_request` with `outcome=limit_exceeded` in prod vs tests.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

Conversational UI — AIMock, threads, refunds
AI streaming — partial token timing
LLM output validation — structured output
AI agent workflows — multi-step tools
Canvas interactions — generative UI
Pure agentic vs SmartTests — hybrid strategy
What is AI in QA — evals in QA strategy

External references

Frequently asked questions

Should every step use ai.act for AI product UI?

No—default to Playwright for stable flows; add hybrid AI only where copy or layout churns. Probes assert side effects (credits, flags) independent of model wording.

What is AIMock vs mocking the whole app?

AIMock stubs upstream LLM responses while your API routes, tool router, and UI run for real—CI gets deterministic text without sacrificing integration coverage.

Evals vs E2E—which catches prompt regressions?

Evals on golden datasets catch answer quality and refusal rates. E2E with AIMock catches wiring bugs (tools not called, auth wrong). Use both on different schedules.

How do I test streaming chat without flake?

Wait on stream-complete signals (data attribute, SSE end, AIMock final chunk)—see streaming guide. Assert probes only after completion.

Can I assert exact assistant markdown?

Avoid in CI—use ai.verify for semantic category and probes for side effects. Exact text belongs in offline eval rubrics, not Playwright.

Agent-written tests failed overnight—how to recover?

SmartTests in Git with scenario links; next /testchimp test patches deterministic steps. Reserve ai.act for volatile panels; probes stay stable across prompt edits.

How does TrueCoverage guide AI test expansion?

Compare tool_name and outcome prod vs test. When limit_exceeded or policy_denied dominate prod but tests only cover success, run /testchimp evolve.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo

Who this is for​

Why testing AI web apps matters​

Complexity map​

Architecture: three layers​

Evals vs E2E: split responsibilities​

AIMock and hybrid SmartTests​

Arrange: threads, documents, limits​

Probes for AI side effects​

Requirement slices to cover​

CI checklist​

Anti-patterns​