How to Test AI-Powered Web Applications
Short answer
AI products introduce non-deterministic UI—streaming text, dynamic suggestions, model updates. Use deterministic Playwright for shell and navigation, AIMock for stable model responses in CI, ai.act / ai.verify only where needed, offline evals for prompt quality, and probes for outcomes that must not drift with model wording.
Part of Testing Guides by industry.
Who this is for
Startups shipping chat, copilot, generative UI, or AI-assisted workflows on web—especially when copy changes daily but business rules (billing, permissions, tool calls) must hold. Not static marketing sites with a chat widget FAQ.
Typical stacks: OpenAI/Anthropic/Gemini behind API routes, tool routers, RAG pipelines, or embedded copilots in SaaS dashboards.
Why testing AI web apps matters
AI failures hide behind fluent text:
- Revenue loss — copilot confirms refund or upgrade; probe shows no change; usage credits consumed without delivery.
- Security incidents — tool calls run with wrong user's token; cross-tenant context in multi-turn threads.
- Support load — infinite clarification loops; streaming stuck on spinner; empty responses after safety filter.
- CI trap — asserting exact assistant prose breaks on benign rephrase though behavior is correct.
The UI can show "Done!" while tools, DB, and billing disagree. E2E must assert authoritative state via probes—not model phrasing alone.
Complexity map
| Scenario | Edge case | Why tests break | Approach |
|---|---|---|---|
| Non-deterministic text | Benign rephrase | toHaveText('exact') flakes | AIMock + ai.verify category |
| Streaming response | Partial tokens | Assert too early | Wait complete signal (streaming guide) |
| Tool invocation | Async side effect | Assert before DB write | expect.poll probe |
| Token / usage limit | Over quota | Model returns prose | Probe status=rejected |
| Refusal / safety | Policy block | Empty bubble passes | Offline eval + E2E negative |
| RAG empty hits | Hallucination | Fluent wrong answer | Probe empty retrieval + UI state |
| Auth gate | Anonymous vs logged-in | Tool 403 untested | Seed session; probe permission |
| Suggested replies | Chip vs free text | Different code paths | Cover both modes |
| Model API 429 | Rate limit in CI | Random fail | AIMock in PR; real model nightly |
| Prompt deploy | Wording shift | String asserts fail | Eval job on golden set |
| Billing per request | Partial charge | Revenue leak | Probe usage increment |
| Concurrent sends | Double submit | Duplicate tool calls | Disable during stream; probe idempotency |
| Canvas / generative UI | No stable selectors | Coordinate flake | ai.act + probe (canvas guide) |
| Multi-turn context | Thread lost on refresh | Wrong answer turn 3 | Seed thread in Arrange |
Architecture: three layers
User → UI → API → LLM (+ tools/RAG) → stream/batch → UI render
↓
Side effects (DB, billing, email)
| Layer | Validates | When |
|---|---|---|
| Offline evals | Answer quality, refusal, citations | Prompt/system change PRs |
| AIMock E2E | UI wiring, tools, auth, probes | Every PR in CI |
| Probe Assert | DB/API truth after action | Always for side effects |
Do not run expensive LLM-as-judge on every Playwright spec—reserve evals for prompt regression; use AIMock for integration wiring.
Evals vs E2E: split responsibilities
| Concern | Offline evals | E2E SmartTests |
|---|---|---|
| Answer quality / tone | Golden questions + judge or rubric | Not exact prose asserts |
| Refusal on policy prompts | Batch eval dataset | One E2E negative + probe |
| Tool called correctly | Can mock in eval harness | AIMock + probe side effect |
| Streaming UX / disabled states | Limited | E2E with complete signal |
| Auth + billing boundaries | Partial | E2E with probes required |
| Regression on prompt v2 | Primary home | Smoke E2E only |
Rule: E2E proves the product wired correctly (tools fire, permissions hold, UI states transition). Evals prove the model behaves acceptably on representative inputs.
Run evals in a separate CI job or nightly; gate prompt PRs on eval thresholds. Keep Playwright fast with AIMock.
AIMock and hybrid SmartTests
AIMock stubs upstream model responses while exercising real routes, tool dispatch, and UI:
// env: AIMOCK=1 maps prompt patterns to fixtures/ai/refund-confirm.json
await ai.act('Open copilot and ask to cancel order 12345');
await ai.verify('Assistant confirms refund initiated or shows policy denial with clear next steps');
await expect.poll(async () => {
const res = await request.get('/api/test/probe-order/12345');
return (await res.json()).status;
}, { timeout: 15_000 }).toMatch(/refund_pending|policy_denied/);
ai.act— semantic UI when selectors churn (chat input, dynamic panels)ai.verify— semantic UI outcome category—not exact tokens- Probe — authoritative order status, credits, flags
See conversational UI guide for thread seeds and multi-turn patterns.
Arrange: threads, documents, limits
// POST /api/test/seed-ai-context
// Body: { runId, documentTokens, userCredits, threadId? }
const { threadId } = await request.post('/api/test/seed-ai-context', {
data: { runId, documentTokens: 128_000, userCredits: 10 },
}).then(r => r.json());
await page.goto(`/copilot?thread=${threadId}`);
Seed boundary conditions (at token limit, zero credits) without chatting through onboarding in every spec.
Probes for AI side effects
Assert what must not drift:
| Side effect | Probe |
|---|---|
| Refund / cancel | Order status |
| Ticket created | ticket_id |
| Usage billing | credits_consumed |
| Tool denial | policy_denied flag |
| Saved artifact | Storage row id |
Never assert assistant markdown HTML snapshots in CI—they change with model and CSS.
Requirement slices to cover
copilot_request_outcome— success, limit_exceeded, policy_denied, errortool_name— refund, search, summarize, etc.interaction_mode— chip, free_text, voice (if applicable)
When prod shows high limit_exceeded volume but tests only cover happy summarize, evolve negative specs.
CI checklist
- AIMock enabled for default PR Playwright job
- Probes on every tool side-effect scenario
- No exact assistant string asserts—semantic or probe only
- Streaming specs wait complete/abort signal
- Offline eval job on prompt-changing PRs (separate from E2E)
- Auth negative: anonymous cannot invoke paid tools
// @Scenario:links for AI invariants in markdown plans
Anti-patterns
| Anti-pattern | Why it fails | Better approach |
|---|---|---|
| Exact text assert on LLM output | Benign rephrase breaks CI | ai.verify category + probe |
| Full agentic test file | Slow, opaque | Deterministic Arrange/Assert |
| Real model every PR | Cost + flake | AIMock CI; eval nightly |
| No probe on "Done" | False success | Poll order/status |
| Eval-only, no E2E | Wiring bugs ship | AIMock integration specs |
| ai.act for login/navigation | Unnecessary variance | Playwright locators |
| Skip limit/refusal paths | Compliance risk | Negative seed + probe |
Example scenario
Situation: User asks copilot to summarize a document over token limit.
Expected outcome: Request rejected; no partial charge; user sees limit guidance.
Why UI-only automation breaks: Test matches exact error string; model reword breaks CI though behaviour is correct.
- Arrange: Seed document at limit boundary via fixture; stub billing probe baseline.
- Act: Submit prompt in copilot UI.
- Assert: Probe shows `status=rejected`, no usage increment; `ai.verify` optional for error category.
TestChimp workflow: Track `copilot_request` with `outcome=limit_exceeded` in prod vs tests.
Same Arrange/Act/Assert pattern as expired-coupon checkout.
Connect scenarios to your QA workflow
Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).
Related scenarios
- Conversational UI — AIMock, threads, refunds
- AI streaming — partial token timing
- LLM output validation — structured output
- AI agent workflows — multi-step tools
- Canvas interactions — generative UI
- Pure agentic vs SmartTests — hybrid strategy
- What is AI in QA — evals in QA strategy
External references
Frequently asked questions
Should every step use ai.act for AI product UI?
No—default to Playwright for stable flows; add hybrid AI only where copy or layout churns. Probes assert side effects (credits, flags) independent of model wording.
What is AIMock vs mocking the whole app?
AIMock stubs upstream LLM responses while your API routes, tool router, and UI run for real—CI gets deterministic text without sacrificing integration coverage.
Evals vs E2E—which catches prompt regressions?
Evals on golden datasets catch answer quality and refusal rates. E2E with AIMock catches wiring bugs (tools not called, auth wrong). Use both on different schedules.
How do I test streaming chat without flake?
Wait on stream-complete signals (data attribute, SSE end, AIMock final chunk)—see streaming guide. Assert probes only after completion.
Can I assert exact assistant markdown?
Avoid in CI—use ai.verify for semantic category and probes for side effects. Exact text belongs in offline eval rubrics, not Playwright.
Agent-written tests failed overnight—how to recover?
SmartTests in Git with scenario links; next /testchimp test patches deterministic steps. Reserve ai.act for volatile panels; probes stay stable across prompt edits.
How does TrueCoverage guide AI test expansion?
Compare tool_name and outcome prod vs test. When limit_exceeded or policy_denied dominate prod but tests only cover success, run /testchimp evolve.
Apply these patterns in your repo
Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.