Why E2E Tests Flake and How Startups Fix Them

Short answer

Flakes are usually world-state problems, not "Playwright is random." Startups fix them by separating Arrange (fixtures/API seeds), Act (short UI path), and Assert (probes)—plus parallel-safe data, frozen clocks, and surgical ai.act only where selectors churn—not by rerunning CI or deleting coverage.

Part of Testing Guides by industry.

Who this is for

Startups whose CI is red for no product reason—especially after parallelization, new promos, shared staging users, or agent-generated UI refactors. Teams on Playwright, Cypress, or SmartTests who need a repeatable fix loop, not generic "add more waits" advice.

Why flake matters for startups

Flaky CI erodes velocity:

False alarms — engineers ignore red builds; real regressions slip through.
Merge queue pain — retry roulette burns minutes per PR on a 5-person team.
Coverage erosion — @flaky skips and deleted specs leave prod paths untested.
Wrong fixes — waitForTimeout(5000) masks Arrange bugs until staging breaks on promo day.

Most flakes trace to shared mutable state and UI-only assertions on server outcomes—not browser nondeterminism.

Complexity map

Scenario	Edge case	Why tests break	Approach
Shared coupon	One code, N workers	Worker 3: "already used"	Per-run seed coupon
Shared user account	Parallel login	Session overwrite	Seed unique email per run
Leftover orders/carts	Prior spec data	Wrong totals	Teardown or fresh seed
Clock / timezone	Promo ends at midnight	Date-bound flake	Freeze time in Arrange
Feature flags	Branch ≠ staging flags	Wrong UI path	Posture fixture per env
Async backend	Webhook lag	Assert before DB write	`expect.poll` probe
Toast-only assert	Copy change	Pass/fail unrelated	Probe authoritative state
iframe payment	Stripe frame load	Intermittent not found	frameLocator + poll (Stripe guide)
Selector drift	Vibe-coded regen	Element not found	Hybrid ai.act on volatile region
Missing isolation	`storageState` reuse	Wrong role in spec	Per-spec auth seed
Environment drift	Preview URL config	API 404	Multi-env posture (multi-environment)
Test order dependency	Spec B needs Spec A data	Order-sensitive pass	Independent Arrange per spec
Global rate limits	Staging API throttle	429 in CI	Mock or per-run quotas
File upload race	Scan async	Assert before ready	Poll scan_status probe
ExploreChimp vs CI	Manual path not in suite	Prod-only UX break	Convert session to SmartTest

World-state discipline

Treat each spec as needing a known universe:

Arrange (fixtures) → Act (UI) → Assert (probes) → Teardown (optional)

Pattern: seed route + runId

const runId = `${test.info().parallelIndex}-${Date.now()}`;

test.beforeEach(async ({ request }) => {
  await request.post('/api/test/reset-world', { data: { runId } });
});

test('checkout with run coupon', async ({ page, request }) => {
  const { cartId } = await request.post('/api/test/seed-checkout', {
    data: { runId, coupon: { code: `PROMO-${runId}` } },
  }).then(r => r.json());

  await page.goto(`/checkout?cart=${cartId}`);
  // Act ...
  const order = await request.get(`/api/test/probe-order?runId=${runId}`).then(r => r.json());
  expect(order.paymentStatus).toBe('paid');
});

/testchimp init scaffolds seed, probe, and teardown routes—agents repair harness and test together in the failing PR (fixtures in agent authoring).

Classify before fixing

Signal	Likely class	First move
Passes locally, fails CI worker N	Shared data	Unique seed per worker
Fails ~midnight UTC	Clock	Freeze `Date` in Arrange
Fails after UI PR, not logic PR	Selector	Deterministic locator; ai.act if churn
Intermittent 5–30s timeout	Async backend	Probe poll, not longer sleep
Always fails same worker	Resource lock	Isolate account/coupon

Do not increase timeout until you classify—longer timeouts hide Arrange bugs.

Hybrid ai.act: surgical, not default

Record-replay and fully agentic suites both fail startups differently—see pure scripts vs SmartTests and pure agentic vs SmartTests.

Layer	Use
Arrange	Always deterministic API/fixtures
Act	Playwright locators; `ai.act` only on volatile widgets ( promos, maps, canvas )
Assert	Probes for money, orders, permissions; `ai.verify` optional for semantic UI

// Good: deterministic truth + semantic UI where needed
await ai.act('Apply the welcome promo in the checkout sidebar');
await expect.poll(async () => {
  const o = await request.get(`/api/test/probe-order?runId=${runId}`).then(r => r.json());
  return o.discountCents;
}).toBeGreaterThan(0);

Bad: entire checkout as one ai.act chain with no probe—non-reproducible and slow.

Fix playbook

Reproduce in parallel — npx playwright test --workers=4 locally
Move setup to API — shorten UI path to the behavior under test
Replace toast asserts — probe order, balance, stage, flags
Link scenario — // @Scenario: for traceability when fixing
Gate merge — green parallel CI before merge; no @flaky without ticket + TrueCoverage review

Requirement slices to cover

Use flake fixes to keep high-traffic paths, not arbitrary specs:

Compare prod event volume vs test-run coverage before deleting a "flaky" test
Prioritize /testchimp evolve for slices that flake and drive revenue

When a flaky checkout test covers payment_method=apple_pay prod slice, fix Arrange—do not delete without replacement.

CI checklist

Unique runId per worker in all seeds
No shared staging coupons, users, or carts
expect.poll for webhooks and async jobs
Feature-flag posture matches preview URL
Global teardown for long-lived staging (optional in ephemeral CI)
Parallel job required on payment/auth PRs
Document frozen-clock specs when promos are date-bound

Anti-patterns

Anti-pattern	Why it fails	Better approach
`waitForTimeout`	Masks race	Poll probe
Delete flaky test	Coverage hole	Fix Arrange/Assert
Rerun until green	Hides flake	Parallel reproduce
Full ai.act suite	Slow, opaque	Probes + surgical AI
Shared `test@` user	Session collision	Per-run seed
UI-only 20-step setup	Breaks on CSS	API Arrange
`@flaky` without owner	Permanent skip	Ticket + TrueCoverage check

Example scenario

Situation: Checkout test passes locally but fails in CI worker 3.

Expected outcome: Order created exactly once with expected total.

Why UI-only automation breaks: Workers reuse the same coupon code; worker 3 hits 'already used' intermittently.

Arrange: Per-run seed creates unique coupon and empty cart via API.
Act: Complete checkout in UI.
Assert: Probe confirms single order row and payment status—ignore toast timing.

TestChimp workflow: Align `checkout_attempted` events between prod and test to spot untested failure paths.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

Common E2E gotchas — atomic fixes for selector drift, world-state, waits, UI-only asserts
Checkout flows — coupons, payment probes
Cart & promos — stacking, race conditions
Fintech web apps — ledger probes, idempotency
Firebase auth — parallel user seeds
Record-replay vs TestChimp — missing Arrange/Assert
E2E unreliability — why scripts flake at scale

External references

Frequently asked questions

Should we delete flaky tests to unblock CI?

Fix Arrange/Assert first—shared users, missing probes, UI-only assertions. TrueCoverage confirms whether a flaky test still covers a high-traffic path before you cut coverage.

Is Playwright inherently flaky?

Usually no—the app under test shares mutable staging data or asserts before async work completes. Per-run seeds and probe polls fix most startup flakes.

When should we use ai.act to fix flake?

When selectors churn on volatile UI regions but business outcomes are probe-stable. Never replace payment, auth, or ledger asserts with AI verify.

waitForTimeout vs expect.poll?

Prefer expect.poll on probe fields with a bounded timeout. Fixed sleeps hide races and slow CI.

How do feature flags cause flake?

Preview branch may disable a step main tests expect. Use posture fixtures that set flags consistently for the target URL.

Our eng team maintains tests with no QA—what loop works?

Markdown scenarios in Git, /testchimp test on each PR to repair SmartTests, // @Scenario: traceability, /testchimp evolve after deploy via TrueCoverage.

Flake started after enabling parallel workers—now what?

Audit shared coupons, accounts, carts, and storageState. Add runId to every seed route and reproduce with --workers=4 locally before merge.

Tests pass on laptop but fail on preview URL—flake?

Often environment drift, not timing—compare BASE_URL, flags, and API keys. See [staging vs local gotcha](/guides/gotchas/e2e-environment-drift-staging-vs-local).

Where is the index of atomic fixes?

The [Common E2E gotchas](/guides/gotchas/intro) hub lists symptom-first pages for selectors, auth pollution, teardown leaks, and more—this guide is the holistic playbook.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo

Who this is for​

Why flake matters for startups​

Complexity map​

World-state discipline​

Pattern: seed route + runId​

Classify before fixing​

Hybrid ai.act: surgical, not default​

Fix playbook​

Requirement slices to cover​

CI checklist​

Anti-patterns​