Skip to main content

How to Test AI-Powered Web Applications

Short answer

AI products introduce non-deterministic UI—streaming text, dynamic suggestions, model updates. Use deterministic Playwright for shell and navigation, AIMock for stable model responses in CI, ai.act / ai.verify only where needed, offline evals for prompt quality, and probes for outcomes that must not drift with model wording.

Part of Testing Guides by industry.

Who this is for

Startups shipping chat, copilot, generative UI, or AI-assisted workflows on web—especially when copy changes daily but business rules (billing, permissions, tool calls) must hold. Not static marketing sites with a chat widget FAQ.

Typical stacks: OpenAI/Anthropic/Gemini behind API routes, tool routers, RAG pipelines, or embedded copilots in SaaS dashboards.

Why testing AI web apps matters

AI failures hide behind fluent text:

  • Revenue loss — copilot confirms refund or upgrade; probe shows no change; usage credits consumed without delivery.
  • Security incidents — tool calls run with wrong user's token; cross-tenant context in multi-turn threads.
  • Support load — infinite clarification loops; streaming stuck on spinner; empty responses after safety filter.
  • CI trap — asserting exact assistant prose breaks on benign rephrase though behavior is correct.

The UI can show "Done!" while tools, DB, and billing disagree. E2E must assert authoritative state via probes—not model phrasing alone.

Complexity map

ScenarioEdge caseWhy tests breakApproach
Non-deterministic textBenign rephrasetoHaveText('exact') flakesAIMock + ai.verify category
Streaming responsePartial tokensAssert too earlyWait complete signal (streaming guide)
Tool invocationAsync side effectAssert before DB writeexpect.poll probe
Token / usage limitOver quotaModel returns proseProbe status=rejected
Refusal / safetyPolicy blockEmpty bubble passesOffline eval + E2E negative
RAG empty hitsHallucinationFluent wrong answerProbe empty retrieval + UI state
Auth gateAnonymous vs logged-inTool 403 untestedSeed session; probe permission
Suggested repliesChip vs free textDifferent code pathsCover both modes
Model API 429Rate limit in CIRandom failAIMock in PR; real model nightly
Prompt deployWording shiftString asserts failEval job on golden set
Billing per requestPartial chargeRevenue leakProbe usage increment
Concurrent sendsDouble submitDuplicate tool callsDisable during stream; probe idempotency
Canvas / generative UINo stable selectorsCoordinate flakeai.act + probe (canvas guide)
Multi-turn contextThread lost on refreshWrong answer turn 3Seed thread in Arrange

Architecture: three layers

User → UI → API → LLM (+ tools/RAG) → stream/batch → UI render

Side effects (DB, billing, email)
LayerValidatesWhen
Offline evalsAnswer quality, refusal, citationsPrompt/system change PRs
AIMock E2EUI wiring, tools, auth, probesEvery PR in CI
Probe AssertDB/API truth after actionAlways for side effects

Do not run expensive LLM-as-judge on every Playwright spec—reserve evals for prompt regression; use AIMock for integration wiring.

Evals vs E2E: split responsibilities

ConcernOffline evalsE2E SmartTests
Answer quality / toneGolden questions + judge or rubricNot exact prose asserts
Refusal on policy promptsBatch eval datasetOne E2E negative + probe
Tool called correctlyCan mock in eval harnessAIMock + probe side effect
Streaming UX / disabled statesLimitedE2E with complete signal
Auth + billing boundariesPartialE2E with probes required
Regression on prompt v2Primary homeSmoke E2E only

Rule: E2E proves the product wired correctly (tools fire, permissions hold, UI states transition). Evals prove the model behaves acceptably on representative inputs.

Run evals in a separate CI job or nightly; gate prompt PRs on eval thresholds. Keep Playwright fast with AIMock.

AIMock and hybrid SmartTests

AIMock stubs upstream model responses while exercising real routes, tool dispatch, and UI:

// env: AIMOCK=1 maps prompt patterns to fixtures/ai/refund-confirm.json

await ai.act('Open copilot and ask to cancel order 12345');
await ai.verify('Assistant confirms refund initiated or shows policy denial with clear next steps');

await expect.poll(async () => {
const res = await request.get('/api/test/probe-order/12345');
return (await res.json()).status;
}, { timeout: 15_000 }).toMatch(/refund_pending|policy_denied/);
  • ai.act — semantic UI when selectors churn (chat input, dynamic panels)
  • ai.verify — semantic UI outcome category—not exact tokens
  • Probe — authoritative order status, credits, flags

See conversational UI guide for thread seeds and multi-turn patterns.

Arrange: threads, documents, limits

// POST /api/test/seed-ai-context
// Body: { runId, documentTokens, userCredits, threadId? }

const { threadId } = await request.post('/api/test/seed-ai-context', {
data: { runId, documentTokens: 128_000, userCredits: 10 },
}).then(r => r.json());

await page.goto(`/copilot?thread=${threadId}`);

Seed boundary conditions (at token limit, zero credits) without chatting through onboarding in every spec.

Probes for AI side effects

Assert what must not drift:

Side effectProbe
Refund / cancelOrder status
Ticket createdticket_id
Usage billingcredits_consumed
Tool denialpolicy_denied flag
Saved artifactStorage row id

Never assert assistant markdown HTML snapshots in CI—they change with model and CSS.

Requirement slices to cover

  • copilot_request_outcome — success, limit_exceeded, policy_denied, error
  • tool_name — refund, search, summarize, etc.
  • interaction_mode — chip, free_text, voice (if applicable)

When prod shows high limit_exceeded volume but tests only cover happy summarize, evolve negative specs.

CI checklist

  1. AIMock enabled for default PR Playwright job
  2. Probes on every tool side-effect scenario
  3. No exact assistant string asserts—semantic or probe only
  4. Streaming specs wait complete/abort signal
  5. Offline eval job on prompt-changing PRs (separate from E2E)
  6. Auth negative: anonymous cannot invoke paid tools
  7. // @Scenario: links for AI invariants in markdown plans

Anti-patterns

Anti-patternWhy it failsBetter approach
Exact text assert on LLM outputBenign rephrase breaks CIai.verify category + probe
Full agentic test fileSlow, opaqueDeterministic Arrange/Assert
Real model every PRCost + flakeAIMock CI; eval nightly
No probe on "Done"False successPoll order/status
Eval-only, no E2EWiring bugs shipAIMock integration specs
ai.act for login/navigationUnnecessary variancePlaywright locators
Skip limit/refusal pathsCompliance riskNegative seed + probe

Example scenario

Situation: User asks copilot to summarize a document over token limit.

Expected outcome: Request rejected; no partial charge; user sees limit guidance.

Why UI-only automation breaks: Test matches exact error string; model reword breaks CI though behaviour is correct.

  1. Arrange: Seed document at limit boundary via fixture; stub billing probe baseline.
  2. Act: Submit prompt in copilot UI.
  3. Assert: Probe shows `status=rejected`, no usage increment; `ai.verify` optional for error category.

TestChimp workflow: Track `copilot_request` with `outcome=limit_exceeded` in prod vs tests.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

External references

Frequently asked questions

Should every step use ai.act for AI product UI?

No—default to Playwright for stable flows; add hybrid AI only where copy or layout churns. Probes assert side effects (credits, flags) independent of model wording.

What is AIMock vs mocking the whole app?

AIMock stubs upstream LLM responses while your API routes, tool router, and UI run for real—CI gets deterministic text without sacrificing integration coverage.

Evals vs E2E—which catches prompt regressions?

Evals on golden datasets catch answer quality and refusal rates. E2E with AIMock catches wiring bugs (tools not called, auth wrong). Use both on different schedules.

How do I test streaming chat without flake?

Wait on stream-complete signals (data attribute, SSE end, AIMock final chunk)—see streaming guide. Assert probes only after completion.

Can I assert exact assistant markdown?

Avoid in CI—use ai.verify for semantic category and probes for side effects. Exact text belongs in offline eval rubrics, not Playwright.

Agent-written tests failed overnight—how to recover?

SmartTests in Git with scenario links; next /testchimp test patches deterministic steps. Reserve ai.act for volatile panels; probes stay stable across prompt edits.

How does TrueCoverage guide AI test expansion?

Compare tool_name and outcome prod vs test. When limit_exceeded or policy_denied dominate prod but tests only cover success, run /testchimp evolve.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo