Skip to main content

Why E2E Tests Flake and How Startups Fix Them

Short answer

Flakes are usually world-state problems, not "Playwright is random." Startups fix them by separating Arrange (fixtures/API seeds), Act (short UI path), and Assert (probes)—plus parallel-safe data, frozen clocks, and surgical ai.act only where selectors churn—not by rerunning CI or deleting coverage.

Part of Testing Guides by industry.

Who this is for

Startups whose CI is red for no product reason—especially after parallelization, new promos, shared staging users, or agent-generated UI refactors. Teams on Playwright, Cypress, or SmartTests who need a repeatable fix loop, not generic "add more waits" advice.

Why flake matters for startups

Flaky CI erodes velocity:

  • False alarms — engineers ignore red builds; real regressions slip through.
  • Merge queue pain — retry roulette burns minutes per PR on a 5-person team.
  • Coverage erosion@flaky skips and deleted specs leave prod paths untested.
  • Wrong fixeswaitForTimeout(5000) masks Arrange bugs until staging breaks on promo day.

Most flakes trace to shared mutable state and UI-only assertions on server outcomes—not browser nondeterminism.

Complexity map

ScenarioEdge caseWhy tests breakApproach
Shared couponOne code, N workersWorker 3: "already used"Per-run seed coupon
Shared user accountParallel loginSession overwriteSeed unique email per run
Leftover orders/cartsPrior spec dataWrong totalsTeardown or fresh seed
Clock / timezonePromo ends at midnightDate-bound flakeFreeze time in Arrange
Feature flagsBranch ≠ staging flagsWrong UI pathPosture fixture per env
Async backendWebhook lagAssert before DB writeexpect.poll probe
Toast-only assertCopy changePass/fail unrelatedProbe authoritative state
iframe paymentStripe frame loadIntermittent not foundframeLocator + poll (Stripe guide)
Selector driftVibe-coded regenElement not foundHybrid ai.act on volatile region
Missing isolationstorageState reuseWrong role in specPer-spec auth seed
Environment driftPreview URL configAPI 404Multi-env posture (multi-environment)
Test order dependencySpec B needs Spec A dataOrder-sensitive passIndependent Arrange per spec
Global rate limitsStaging API throttle429 in CIMock or per-run quotas
File upload raceScan asyncAssert before readyPoll scan_status probe
ExploreChimp vs CIManual path not in suiteProd-only UX breakConvert session to SmartTest

World-state discipline

Treat each spec as needing a known universe:

Arrange (fixtures) → Act (UI) → Assert (probes) → Teardown (optional)

Pattern: seed route + runId

const runId = `${test.info().parallelIndex}-${Date.now()}`;

test.beforeEach(async ({ request }) => {
await request.post('/api/test/reset-world', { data: { runId } });
});

test('checkout with run coupon', async ({ page, request }) => {
const { cartId } = await request.post('/api/test/seed-checkout', {
data: { runId, coupon: { code: `PROMO-${runId}` } },
}).then(r => r.json());

await page.goto(`/checkout?cart=${cartId}`);
// Act ...
const order = await request.get(`/api/test/probe-order?runId=${runId}`).then(r => r.json());
expect(order.paymentStatus).toBe('paid');
});

/testchimp init scaffolds seed, probe, and teardown routes—agents repair harness and test together in the failing PR (fixtures in agent authoring).

Classify before fixing

SignalLikely classFirst move
Passes locally, fails CI worker NShared dataUnique seed per worker
Fails ~midnight UTCClockFreeze Date in Arrange
Fails after UI PR, not logic PRSelectorDeterministic locator; ai.act if churn
Intermittent 5–30s timeoutAsync backendProbe poll, not longer sleep
Always fails same workerResource lockIsolate account/coupon

Do not increase timeout until you classify—longer timeouts hide Arrange bugs.

Hybrid ai.act: surgical, not default

Record-replay and fully agentic suites both fail startups differently—see pure scripts vs SmartTests and pure agentic vs SmartTests.

LayerUse
ArrangeAlways deterministic API/fixtures
ActPlaywright locators; ai.act only on volatile widgets ( promos, maps, canvas )
AssertProbes for money, orders, permissions; ai.verify optional for semantic UI
// Good: deterministic truth + semantic UI where needed
await ai.act('Apply the welcome promo in the checkout sidebar');
await expect.poll(async () => {
const o = await request.get(`/api/test/probe-order?runId=${runId}`).then(r => r.json());
return o.discountCents;
}).toBeGreaterThan(0);

Bad: entire checkout as one ai.act chain with no probe—non-reproducible and slow.

Fix playbook

  1. Reproduce in parallelnpx playwright test --workers=4 locally
  2. Move setup to API — shorten UI path to the behavior under test
  3. Replace toast asserts — probe order, balance, stage, flags
  4. Link scenario// @Scenario: for traceability when fixing
  5. Gate merge — green parallel CI before merge; no @flaky without ticket + TrueCoverage review

Requirement slices to cover

Use flake fixes to keep high-traffic paths, not arbitrary specs:

  • Compare prod event volume vs test-run coverage before deleting a "flaky" test
  • Prioritize /testchimp evolve for slices that flake and drive revenue

When a flaky checkout test covers payment_method=apple_pay prod slice, fix Arrange—do not delete without replacement.

CI checklist

  1. Unique runId per worker in all seeds
  2. No shared staging coupons, users, or carts
  3. expect.poll for webhooks and async jobs
  4. Feature-flag posture matches preview URL
  5. Global teardown for long-lived staging (optional in ephemeral CI)
  6. Parallel job required on payment/auth PRs
  7. Document frozen-clock specs when promos are date-bound

Anti-patterns

Anti-patternWhy it failsBetter approach
waitForTimeoutMasks racePoll probe
Delete flaky testCoverage holeFix Arrange/Assert
Rerun until greenHides flakeParallel reproduce
Full ai.act suiteSlow, opaqueProbes + surgical AI
Shared test@ userSession collisionPer-run seed
UI-only 20-step setupBreaks on CSSAPI Arrange
@flaky without ownerPermanent skipTicket + TrueCoverage check

Example scenario

Situation: Checkout test passes locally but fails in CI worker 3.

Expected outcome: Order created exactly once with expected total.

Why UI-only automation breaks: Workers reuse the same coupon code; worker 3 hits 'already used' intermittently.

  1. Arrange: Per-run seed creates unique coupon and empty cart via API.
  2. Act: Complete checkout in UI.
  3. Assert: Probe confirms single order row and payment status—ignore toast timing.

TestChimp workflow: Align `checkout_attempted` events between prod and test to spot untested failure paths.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

External references

Frequently asked questions

Should we delete flaky tests to unblock CI?

Fix Arrange/Assert first—shared users, missing probes, UI-only assertions. TrueCoverage confirms whether a flaky test still covers a high-traffic path before you cut coverage.

Is Playwright inherently flaky?

Usually no—the app under test shares mutable staging data or asserts before async work completes. Per-run seeds and probe polls fix most startup flakes.

When should we use ai.act to fix flake?

When selectors churn on volatile UI regions but business outcomes are probe-stable. Never replace payment, auth, or ledger asserts with AI verify.

waitForTimeout vs expect.poll?

Prefer expect.poll on probe fields with a bounded timeout. Fixed sleeps hide races and slow CI.

How do feature flags cause flake?

Preview branch may disable a step main tests expect. Use posture fixtures that set flags consistently for the target URL.

Our eng team maintains tests with no QA—what loop works?

Markdown scenarios in Git, /testchimp test on each PR to repair SmartTests, // @Scenario: traceability, /testchimp evolve after deploy via TrueCoverage.

Flake started after enabling parallel workers—now what?

Audit shared coupons, accounts, carts, and storageState. Add runId to every seed route and reproduce with --workers=4 locally before merge.

Tests pass on laptop but fail on preview URL—flake?

Often environment drift, not timing—compare BASE_URL, flags, and API keys. See [staging vs local gotcha](/guides/gotchas/e2e-environment-drift-staging-vs-local).

Where is the index of atomic fixes?

The [Common E2E gotchas](/guides/gotchas/intro) hub lists symptom-first pages for selectors, auth pollution, teardown leaks, and more—this guide is the holistic playbook.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo