Why E2E Tests Flake and How Startups Fix Them
Short answer
Flakes are usually world-state problems, not "Playwright is random." Startups fix them by separating Arrange (fixtures/API seeds), Act (short UI path), and Assert (probes)—plus parallel-safe data, frozen clocks, and surgical ai.act only where selectors churn—not by rerunning CI or deleting coverage.
Part of Testing Guides by industry.
Who this is for
Startups whose CI is red for no product reason—especially after parallelization, new promos, shared staging users, or agent-generated UI refactors. Teams on Playwright, Cypress, or SmartTests who need a repeatable fix loop, not generic "add more waits" advice.
Why flake matters for startups
Flaky CI erodes velocity:
- False alarms — engineers ignore red builds; real regressions slip through.
- Merge queue pain — retry roulette burns minutes per PR on a 5-person team.
- Coverage erosion —
@flakyskips and deleted specs leave prod paths untested. - Wrong fixes —
waitForTimeout(5000)masks Arrange bugs until staging breaks on promo day.
Most flakes trace to shared mutable state and UI-only assertions on server outcomes—not browser nondeterminism.
Complexity map
| Scenario | Edge case | Why tests break | Approach |
|---|---|---|---|
| Shared coupon | One code, N workers | Worker 3: "already used" | Per-run seed coupon |
| Shared user account | Parallel login | Session overwrite | Seed unique email per run |
| Leftover orders/carts | Prior spec data | Wrong totals | Teardown or fresh seed |
| Clock / timezone | Promo ends at midnight | Date-bound flake | Freeze time in Arrange |
| Feature flags | Branch ≠ staging flags | Wrong UI path | Posture fixture per env |
| Async backend | Webhook lag | Assert before DB write | expect.poll probe |
| Toast-only assert | Copy change | Pass/fail unrelated | Probe authoritative state |
| iframe payment | Stripe frame load | Intermittent not found | frameLocator + poll (Stripe guide) |
| Selector drift | Vibe-coded regen | Element not found | Hybrid ai.act on volatile region |
| Missing isolation | storageState reuse | Wrong role in spec | Per-spec auth seed |
| Environment drift | Preview URL config | API 404 | Multi-env posture (multi-environment) |
| Test order dependency | Spec B needs Spec A data | Order-sensitive pass | Independent Arrange per spec |
| Global rate limits | Staging API throttle | 429 in CI | Mock or per-run quotas |
| File upload race | Scan async | Assert before ready | Poll scan_status probe |
| ExploreChimp vs CI | Manual path not in suite | Prod-only UX break | Convert session to SmartTest |
World-state discipline
Treat each spec as needing a known universe:
Arrange (fixtures) → Act (UI) → Assert (probes) → Teardown (optional)
Pattern: seed route + runId
const runId = `${test.info().parallelIndex}-${Date.now()}`;
test.beforeEach(async ({ request }) => {
await request.post('/api/test/reset-world', { data: { runId } });
});
test('checkout with run coupon', async ({ page, request }) => {
const { cartId } = await request.post('/api/test/seed-checkout', {
data: { runId, coupon: { code: `PROMO-${runId}` } },
}).then(r => r.json());
await page.goto(`/checkout?cart=${cartId}`);
// Act ...
const order = await request.get(`/api/test/probe-order?runId=${runId}`).then(r => r.json());
expect(order.paymentStatus).toBe('paid');
});
/testchimp init scaffolds seed, probe, and teardown routes—agents repair harness and test together in the failing PR (fixtures in agent authoring).
Classify before fixing
| Signal | Likely class | First move |
|---|---|---|
| Passes locally, fails CI worker N | Shared data | Unique seed per worker |
| Fails ~midnight UTC | Clock | Freeze Date in Arrange |
| Fails after UI PR, not logic PR | Selector | Deterministic locator; ai.act if churn |
| Intermittent 5–30s timeout | Async backend | Probe poll, not longer sleep |
| Always fails same worker | Resource lock | Isolate account/coupon |
Do not increase timeout until you classify—longer timeouts hide Arrange bugs.
Hybrid ai.act: surgical, not default
Record-replay and fully agentic suites both fail startups differently—see pure scripts vs SmartTests and pure agentic vs SmartTests.
| Layer | Use |
|---|---|
| Arrange | Always deterministic API/fixtures |
| Act | Playwright locators; ai.act only on volatile widgets ( promos, maps, canvas ) |
| Assert | Probes for money, orders, permissions; ai.verify optional for semantic UI |
// Good: deterministic truth + semantic UI where needed
await ai.act('Apply the welcome promo in the checkout sidebar');
await expect.poll(async () => {
const o = await request.get(`/api/test/probe-order?runId=${runId}`).then(r => r.json());
return o.discountCents;
}).toBeGreaterThan(0);
Bad: entire checkout as one ai.act chain with no probe—non-reproducible and slow.
Fix playbook
- Reproduce in parallel —
npx playwright test --workers=4locally - Move setup to API — shorten UI path to the behavior under test
- Replace toast asserts — probe order, balance, stage, flags
- Link scenario —
// @Scenario:for traceability when fixing - Gate merge — green parallel CI before merge; no
@flakywithout ticket + TrueCoverage review
Requirement slices to cover
Use flake fixes to keep high-traffic paths, not arbitrary specs:
- Compare prod event volume vs test-run coverage before deleting a "flaky" test
- Prioritize
/testchimp evolvefor slices that flake and drive revenue
When a flaky checkout test covers payment_method=apple_pay prod slice, fix Arrange—do not delete without replacement.
CI checklist
- Unique
runIdper worker in all seeds - No shared staging coupons, users, or carts
expect.pollfor webhooks and async jobs- Feature-flag posture matches preview URL
- Global teardown for long-lived staging (optional in ephemeral CI)
- Parallel job required on payment/auth PRs
- Document frozen-clock specs when promos are date-bound
Anti-patterns
| Anti-pattern | Why it fails | Better approach |
|---|---|---|
waitForTimeout | Masks race | Poll probe |
| Delete flaky test | Coverage hole | Fix Arrange/Assert |
| Rerun until green | Hides flake | Parallel reproduce |
| Full ai.act suite | Slow, opaque | Probes + surgical AI |
Shared test@ user | Session collision | Per-run seed |
| UI-only 20-step setup | Breaks on CSS | API Arrange |
@flaky without owner | Permanent skip | Ticket + TrueCoverage check |
Example scenario
Situation: Checkout test passes locally but fails in CI worker 3.
Expected outcome: Order created exactly once with expected total.
Why UI-only automation breaks: Workers reuse the same coupon code; worker 3 hits 'already used' intermittently.
- Arrange: Per-run seed creates unique coupon and empty cart via API.
- Act: Complete checkout in UI.
- Assert: Probe confirms single order row and payment status—ignore toast timing.
TestChimp workflow: Align `checkout_attempted` events between prod and test to spot untested failure paths.
Same Arrange/Act/Assert pattern as expired-coupon checkout.
Connect scenarios to your QA workflow
Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).
Related scenarios
- Common E2E gotchas — atomic fixes for selector drift, world-state, waits, UI-only asserts
- Checkout flows — coupons, payment probes
- Cart & promos — stacking, race conditions
- Fintech web apps — ledger probes, idempotency
- Firebase auth — parallel user seeds
- Record-replay vs TestChimp — missing Arrange/Assert
- E2E unreliability — why scripts flake at scale
External references
Frequently asked questions
Should we delete flaky tests to unblock CI?
Fix Arrange/Assert first—shared users, missing probes, UI-only assertions. TrueCoverage confirms whether a flaky test still covers a high-traffic path before you cut coverage.
Is Playwright inherently flaky?
Usually no—the app under test shares mutable staging data or asserts before async work completes. Per-run seeds and probe polls fix most startup flakes.
When should we use ai.act to fix flake?
When selectors churn on volatile UI regions but business outcomes are probe-stable. Never replace payment, auth, or ledger asserts with AI verify.
waitForTimeout vs expect.poll?
Prefer expect.poll on probe fields with a bounded timeout. Fixed sleeps hide races and slow CI.
How do feature flags cause flake?
Preview branch may disable a step main tests expect. Use posture fixtures that set flags consistently for the target URL.
Our eng team maintains tests with no QA—what loop works?
Markdown scenarios in Git, /testchimp test on each PR to repair SmartTests, // @Scenario: traceability, /testchimp evolve after deploy via TrueCoverage.
Flake started after enabling parallel workers—now what?
Audit shared coupons, accounts, carts, and storageState. Add runId to every seed route and reproduce with --workers=4 locally before merge.
Tests pass on laptop but fail on preview URL—flake?
Often environment drift, not timing—compare BASE_URL, flags, and API keys. See [staging vs local gotcha](/guides/gotchas/e2e-environment-drift-staging-vs-local).
Where is the index of atomic fixes?
The [Common E2E gotchas](/guides/gotchas/intro) hub lists symptom-first pages for selectors, auth pollution, teardown leaks, and more—this guide is the holistic playbook.
Apply these patterns in your repo
Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.