Skip to main content

Your E2E tests are unreliable? Here's why

· 6 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

End-to-end tests are a necessary evil: they are the last line of defense that something actually works in a real browser—but they break often enough that the suite becomes a burden instead of a trustworthy signal.

There are three main sources of variance that make E2E tests unreliable. Understanding them is the first step toward a suite you can actually rely on.

1. World-state variance

This is what happens when your tests run in a different world-state than the one they were written against. A common cause is a shared environment where manual testing and automated runs both happen. The world changes between runs; the next run fails for reasons that have nothing to do with the code under test.

This kind of variance does more than flake tests. It also slows feedback: if those environments only get updates after PRs merge to main, you get weaker root-cause isolation and more expensive triage when something breaks.

2. System variance

These are variances built into the stack itself: network latency, transient failures, UI paint timing, and so on. Mature frameworks like Playwright address a lot of this with built-in waiting, auto-waiting locators, and expect polling—so a big slice of “flakiness” is really tooling and patterns, not fate.

3. Product variance

Even in a steady state (with no product change) — modern web apps are not as simple as calculators. Behavior can be inherently non-deterministic, and that is only more true now that AI often sits in the user journey (for example, a splash or offer that appears only sometimes). Much of that variance may be irrelevant to what a given test is trying to prove.

When tests are authored with a fragile, UI-selector-heavy approach, those product-level variances show up as broken steps. The test is coupled to incidental UI, not to intent.


Solve for these three kinds of variance, and you get a suite that is finally trustworthy.

World-state variance → controlled environments

Use ephemeral environments loaded into known, predefined world-states, so each run matches the state the test was authored against as closely as possible.

System variance → solid automation primitives

This is largely where mature frameworks shine. With Playwright, you get strong primitives for timing and stability—so you are not reinventing waits on every test.

Product variance → intent where the UI is messy

This is where agentic steps in tests can help: natural-language instructions executed by an AI, instead of brittle coupling to selectors—only on the messy, flaky parts of the flow.

There is no free lunch: natural-language steps tend to be slower, costlier, and harder to debug than plain script. The goal is to use them surgically, not everywhere.

SmartTests: intent-based Playwright scripts

That tradeoff is what TestChimp SmartTests are designed around: intent-based Playwright scripts.

They are still scripts for the most part, with an extra capability when you need flexibility—the parts of the app that fight selector-based automation.

Instead of:

await page.locator('.anticon.anticon-plus-square.ant-tree-switcher-line-icon > svg').nth(1).click();

Now you can write:

await ai.act('Expand the tree displayed in the left pane');

Only where you need it—the brittle, shifting UI—not for every line of the test.

Because the tests remain Playwright-based, system variance is handled by the same patterns and tooling you already trust. Run them in ephemeral environments with controlled world-states, and you have a test suite with all 3 variances accounted for - a suite that you can trust.

And yes, we are cooking up something on the ephemeral-environment side too. Stay tuned...

References

The themes above—non-deterministic tests, UI-level flakiness, and mitigations—are well documented in both industry practice and research. These sources are a good starting point if you want to go deeper.

  1. Martin Fowler, Eradicating Non-Determinism in Tests — A widely cited overview of why tests become non-deterministic (including async and shared-state issues) and how to structure tests to get repeatable results.
    https://martinfowler.com/articles/nonDeterminism.html

  2. Google Testing Blog (George Pirocanac), Test Flakiness - One of the main challenges of automated testing (Dec 2020) — Describes categories of flakiness and why inconsistent automated tests slow development; follow-up posts in the same series expand on causes and responses.
    https://testing.googleblog.com/2020/12/test-flakiness-one-of-main-challenges.html

  3. Google Testing Blog (John Micco), Flaky Tests at Google and How We Mitigate Them (May 2016) — Early, concrete account of flaky tests at scale, including mitigation strategies and discussion of where UI tests skew flaky.
    https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html

  4. Wing Lam, Stefan Winter, Anjiang Wei, Tao Xie, Darko Marinov, Jonathan Bell, A Large-Scale Longitudinal Study of Flaky Tests, Proc. ACM Program. Lang. 4, OOPSLA, Article 202 (2020). Peer-reviewed study of when tests become flaky and how changes in code, tests, and dependencies contribute.
    https://doi.org/10.1145/3428270
    Conference entry: https://2020.splashcon.org/details/splash-2020-oopsla/78/A-Large-Scale-Longitudinal-Study-of-Flaky-Tests

  5. Microsoft Playwright, Auto-waiting (Actionability) — Official documentation for pre-action checks (visible, stable, receiving events, enabled, etc.) that reduce timing-driven failures.
    https://playwright.dev/docs/actionability

  6. Microsoft Playwright, Assertions — Describes auto-retrying assertions (expect) that wait until conditions hold, complementary to actionability for stable checks.
    https://playwright.dev/docs/test-assertions

  7. Heroku / 12factor, Dev/prod parity (The Twelve-Factor App) — Classic framing for keeping development, staging, and production sufficiently aligned so “works in my environment” mismatches show up earlier; relevant when reasoning about world-state and shared environments.
    https://12factor.net/dev-prod-parity

  8. Google Research (Diego Cavalcanti), De-Flake Your Tests: Automatically Locating Root Causes of Flaky Tests in Code at Google, ICSME 2020 — Empirical work on locating flaky-test root causes in code at Google scale; reports high accuracy for the proposed technique in their evaluation.
    https://research.google/pubs/de-flake-your-tests-automatically-locating-root-causes-of-flaky-tests-in-code-at-google/

Simplified View: No-Code Editor - Full Code Power

· 3 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

TestChimp tests have always been plain Playwright under the hood — with extra capabilities like plain-English steps and lightweight scenario linking via code comments. That gives you fixtures, hooks, page objects, and the test organization you expect from a serious engineering setup.

Most QA teams are a mix of technical and non-technical teammates. Code-only authoring keeps contribution narrow. A separate no-code tool often means a second suite that drifts from the “real” tests and never gets the same CI treatment.

We added Simplified View in the web IDE so you do not have to choose.

SmartTests Simplified View in the web IDE

What Simplified View Is

Simplified View is a no-code surface for creating and editing SmartTests that still compiles to fully functional Playwright scripts. Everyone works on the same test; people just choose how they interact with the tests.

Your teammates can:

  • Add plain English steps - that are run agentically.
  • Use structured building blocks for common actions — less boilerplate and fewer syntax slips.
  • Drop in free-form code when you need it — custom waits, tricky selectors, helpers: full Playwright, no lock-in.

You pick the level of code per step and per person, not one rule for the whole team.

Why this matters

Non-technical members can contribute directly to test automation — not only by filing tickets for engineers to translate later. They build and edit steps in Simplified View; the result is still Playwright your automation folks can refine, reuse, and run in the same pipelines as everything else.

That lifts throughput for the whole team: more people can ship checks in parallel, fewer scenarios sit in a queue waiting for a coder, and engineers spend time on structure and hard cases instead of retyping flows from docs. Underneath, it stays real Playwright — deterministic runs, familiar debugging, ExploreChimp, CI, and Git workflows you already rely on.

Getting Started

Open a SmartTest in the web IDE and switch to Simplified View to author or edit steps. When you need the full script, switch to code view; both views stay aligned with the same underlying test.

For more on creating and editing SmartTests, see Creating Smart Tests.

Further Reading

If you’re interested in how no-code and low-code approaches impact QA team velocity and collaboration in general, these resources provide useful perspectives:

Prioritize Test Cases based on Real User Behaviour - TrueCoverage

· 3 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

Most testing strategies are built in isolation from how users actually use a product.

Teams typically decide what to test based on:

  • feature specifications
  • developer intuition
  • past bugs

But production tells a different story.

Some features get heavy usage.
Some interactions are part of critical journeys.
Some screens are where users consistently drop off.

If your testing strategy doesn’t account for this, QA effort is being optimized in the dark.

This is the problem TrueCoverage solves.

TrueCoverage UI


Start With Real User Behaviour

TrueCoverage analyzes event data from production alongside events generated during test runs.

Instead of only measuring which tests executed, it looks at how users actually move through your product.

From this data, TestChimp derives four signals — what we call the 4Ds — to guide smarter QA planning.


The 4Ds of Product Behaviour

Event core stats

Demand

How frequently an event or interaction occurs.

High-demand interactions represent the most commonly used parts of your product.

Ensuring these features are covered in regression tests delivers the highest ROI by protecting the core capabilities of your application.


Depth

Where an interaction occurs in the user journey.

Depth distinguishes top-of-funnel interactions from deeper product workflows.

Early interactions often influence:

  • onboarding success
  • activation rates
  • user satisfaction

Testing depth helps ensure your QA strategy protects the critical entry points of your product.


Duration

How much time users spend interacting with a feature.

High duration often indicates either complex workflows or user friction.

Both require deeper testing. These areas benefit from:

  • scenario-based tests across different paths
  • validation of edge cases and error conditions
  • robustness testing for complex flows

Duration highlights where more thorough testing is needed beyond the happy path.


Drop-off

Where users exit a journey.

Drop-off points are some of the highest-value areas for testing.

If many users abandon the product at a particular step, that interaction deserves attention.

Testing around drop-off points helps uncover:

  • hidden bugs
  • validation issues
  • confusing UX
  • performance bottlenecks

Turning Behaviour Into QA Strategy

The 4Ds transform production behaviour into actionable testing insights.

For example:

  • High demand events → prioritize regression coverage
  • Top-of-funnel interactions → ensure reliability and stability
  • High drop-off points → investigate bugs or UX issues
  • Long duration flows → add scenario tests covering variations

Instead of guessing where to invest testing effort, teams can align QA with real product usage.


TrueCoverage

TrueCoverage compares:

  • event sequences from production
  • event sequences from test runs

This reveals the gap between:

  • what users actually do
  • what your tests actually cover

When that gap becomes visible, improving coverage becomes far more targeted.

Not more tests, but better tests aligned with real user behaviour.


Because ultimately, quality isn’t defined by how many tests you run.

It’s defined by how well your tests protect the journeys your users depend on.

In testing theory, this approach is referred to as "Signals Based Testing" - coined by Wayne Roseberry of Microsoft, and cited on leading books on Testing such as Taking Testing Seriously.

Shift-Left with Git Branch-Aware Testing

· 4 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

The traditional QA bottleneck is a well-known pain point for modern development teams. For years, the industry has pushed to "shift-left" – to move testing earlier in the development lifecycle. However, a major technical hurdle has always remained: the environment gap.

When QA happens on a global "staging" environment or only after code merges to the main branch, the feedback loop is too slow. Bugs found post-merge cause expensive context-switching for developers and delay releases.

Today, we’re bridging that gap. We’ve added full branch awareness to the TestChimp platform, enabling true shift-left testing at the PR level.

Shift-Left Git Testing

Why Branch-Aware Testing?

Branch-aware testing means your QA process mirrors your Git workflow. Instead of testing "the app," you test the "feature-in-progress."

1. Test Authoring at the Feature Level

You can now switch between repository feature branches directly within TestChimp. File versions are maintained per branch, allowing QAs to sync with branch-specific remote content.

Most importantly, QAs can author tests and raise Pull Requests from TestChimp that merge directly into the feature branch. This ensures that by the time a developer is ready to merge their code, the corresponding tests are already part of the PR.

[!TIP] Security & Outsourcing: Our new GitHub App-based approach means you don't need to give external QA resources full repository access. They can work exclusively on the tests and plans folders (with PRs raised via TestChimp platform), maintaining a tight security posture.

2. Branch-Specific Test Execution

Gone are the days of manually pointing tests at different URLs. In your project settings, you can now configure a template string for branch-specific deployment URLs (e.g., Vercel or Netlify preview URLs).

When you run tests on a branch, TestChimp resolves the correct URL and injects it as a BASE_URL environment variable. Your scripts simply consume process.env.BASE_URL, ensuring they always target the correct preview deployment.

Branch Management UI

3. Exploratory Testing & Smart Bug Diffing

Exploratory testing is no longer a "post-release" activity. All exploratory runs can now be executed against the branch-specific deployment.

Our agents are now smart enough to report only new bugs found on the feature branch compared to the default branch. This allows you to instantly see what UX, performance, accessibility, or internationalization issues were introduced by a specific PR – before they ever touch production.

4. QA Intelligence: Sliced by Branch

In the Atlas page, you can now filter results by branch to see exactly how a specific screen or flow was affected by a PR. This level of granularity allows teams to answer the questions that actually matter during code review:

  • "What user stories are breaking in this PR?"
  • "Are unrelated scenarios being affected by these changes?"

Seamless CI Integration

If you already have a CI pipeline that generates preview URLs, TestChimp fits right in. Simply pass that preview URL as the BASE_URL environment variable in your CI action, and your tests will execute against the live branch deployment with zero extra configuration.

Strategic Planning, Tactical Execution

While test authoring and execution are now branch-aware, we’ve intentionally kept Test Planning artifacts product-scoped.

Strategy should be stable. Planning artifacts continue to sync with the repo's default branch, ensuring your high-level test coverage goals remain consistent even as individual features are developed and tested in parallel branches.

The Future is Shift-Left

By moving QA participation closer to the development phase, you’re not just catching bugs – you’re preventing them from ever reaching the main branch. Branch-aware testing turns QA from a gatekeeper into a core part of the feature development engine.

UX Bug Traceability: Translating Bugs to Business Impact

· 2 min read
Nuwan Samarasekera
Founder & CEO, TestChimp
  • Which bugs are slowing down your checkout flow?
  • Are there localization issues in onboarding causing drop-off?

Most QA setups fall short of answering such questions that matter to the business.

The Gap Between QA and Business Impact

Traditional QA workflows surface issues --- broken buttons, layout glitches, validation errors, performance slowdowns. But when leadership asks:

  • How is this affecting conversion?
  • Is onboarding friction hurting activation?
  • Which workflow is most at risk?

The connection between bugs and business outcomes is often missed entirely - resulting in QA being perceived as a cost center, than what it really is - revenue protection.

That's the gap we're addressing today by adding UX bug traceability to our QA intelligence layer.

Guided Exploration, Not Random Crawling

Our secret sauce: Our exploratory agents don't wander randomly.

They use your existing automation tests as guides. Because tests are already linked to scenarios (via structured code comments), that same traceability naturally extends to exploratory runs and their findings.

This means every discovered issue can be traced back to:

  • The scenario
  • The user story
  • The business objective behind it

Exploration becomes business-contextual --- not isolated.

UX Bug Traceability Linkage

Structured for Insight at Any Level

In TestChimp, user stories are organized in nested folders. That structure becomes powerful when paired with traceable exploratory results.

Insights roll up automatically:

  • By area of the application
  • By workflow
  • By product surface
  • By team ownership

You can zoom in to a single scenario or zoom out to understand impact across an entire product area.

UX Bug Traceability Screenshot

Beyond a Floating Bug List

Instead of maintaining a detached list of issues, you gain visibility into:

  • Which flows are degrading user experience
  • Which exact bugs are causing latency in key user journeys - that mater to your revenue
  • Where UX friction is tied to user retention

It's no longer just about "bugs found."

It's about translating them understanding what is hurting your revenue --- and prioritizing accordingly.

Test Planning as Code: Your Test Artifacts, Version-Controlled and Agent-Ready

· 5 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

We used to live in forms.

Historically, dropdowns and text fields were the default way we planned and managed work. But in the agentic era, the winning UX isn’t a fancy form. It’s plain, boring text.

We already see it everywhere. We use skills.md for upskilling agents. claude.md for context. Spec-based development in Cursor. But look at your test planning tools. Jira, Linear—they were built in a pre-AI era. They’re database-centric, form-heavy, and fundamentally hostile to agentic workflows.

Research shows test planning activities are still applied inconsistently, and that gap can lead to negative delivery and cost outcomes in software projects (planning activities in software testing process).

Traditionally, a test plan is a structured way to define test objectives, scope, risks, resources, and schedules—so teams can communicate what execution should accomplish (ISTQB on test plan purpose and content).

And modern testing standards treat planning as a continuous process across the lifecycle, not one-and-done documentation (ISO/IEC/IEEE 29119).

Shouldn’t test planning be as modern as coding?

Recent work suggests test artifacts can be managed more like software assets with explicit lifecycle concerns (test artifacts and lifecycle in software evolution).

And established testing guidance emphasizes requirements-based prioritization when deciding what to execute next (ISTQB test case prioritization).

That’s why we’ve reimagined test planning for the agentic era. We call it Test Planning as Code.


Plans as strongly typed markdown

In TestChimp, your plans live as strongly typed markdown files—user stories and test scenarios as .md files with YAML frontmatter, organized in folders and version-controlled alongside your codebase.

Test Scenario as Markdown

There are some pretty significant advantages to maintaining stories and scenarios as simple .md files.

At its core, a test plan defines test objectives, scope, risks, resources, and schedules—so execution stays aligned to what “done” means (ISTQB on test plan purpose and content).

First, they sync to your code repository. That means your coding and testing agents can read them and work on them directly. No proprietary API, no “export for AI”—just the same files your team already uses.

Second, you can organize them in a nested folder structure however you want. By area. By journey. By team. That structure gives agents broader context. They see related stories and linked scenarios, not just isolated tickets floating in a database.

This is what actually gets stored in your repo. No proprietary formats. No lock-in. Just plain markdown.


“I don’t want to manage status in a text editor”

Humans still need workflows. We need priorities, due dates, and assignees.

TestChimp layers those workflows on top of the files. You get the familiarity of a structured UI for human workflows—rich forms, status dropdowns, filters—without losing the benefits of file-first planning. The source of truth stays in the files; the platform makes them easier to work with.

User Story Form

And because TestChimp indexes everything, the AI can actually work with your plan. It can help write or refine a user story more accurately. It can suggest relevant test scenarios. It can even detail them out, grounded in your actual requirements.

Linked Scenarios


Linking tests is trivial

Once you have scenarios, linking tests is trivial. Just add a comment in your test code:

// @Scenario: Login - Invalid Credentials Error

No spreadsheets. No manual mapping. No juggling multiple tools.

Export to Git keeps your test plan in the repo—stories and scenarios live under a path you choose, with full history and pull-request workflows. Your agents and your CI see the same files.

Export to Git


Coverage at any granularity

As tests run, coverage insights update automatically. And because your stories are organized in folders, you can see coverage at any granularity—per story, per area, per component.

If you’re working in a team where different groups own different parts of the system, you already know how useful this is.

Requirement Traceability

You can finally answer the question: Which scenarios are due next week, ready for testing, but still missing coverage? No spreadsheet. No manual roll-up. Just select the folder, apply the filters, and look at the Insights tab.


Wrapping up

Test Planning as Code is a different take on where test artifacts should live and who should be able to use them. Files in the repo instead of rows in a database; workflows layered on for humans, and that same file structure giving agents the context they need. If that approach resonates—or you’re just curious how it works in practice—we’ve documented the full workflow in the Test Planning section: authoring user stories, authoring test scenarios, export to Git, and requirement traceability.

Requirement Traceability, Without the Spreadsheet Circus

· 2 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

Q: How do you currently get requirement traceability?
Which user stories and scenarios are covered by tests, and what’s failing?

A:
For most teams, it looks something like this:

User stories live in Jira.
Test cases live somewhere else.
The mapping between them lives in an Excel sheet that someone manually maintains.

That spreadsheet is periodically uploaded to a test management tool like PractiTest. Then test execution results are pushed via an API to get a view of coverage and failures.

It works—until it doesn’t.


The problem with today’s approach

This is how requirement traceability is typically achieved today: a hodgepodge of tools stitched together with process and hope.

  • Multiple sources of truth
  • Manually maintained spreadsheets that inevitably go stale
  • Fragile workflows that break as teams and test suites scale

No single system actually owns the full picture. Instead, teams spend time keeping artifacts in sync rather than improving product quality.


A simpler model with TestChimp

In TestChimp, requirement traceability isn’t an afterthought. It’s built in.

You already author detailed user stories and break them down into meaningful test scenarios directly in the platform - with AI assistance that understands your product through your existing test scripts and documentation.

Linking tests to those scenarios is intentionally simple. In your test script, add a comment:

// @Scenario: <scenario title>

That’s it! TestChimp takes care of the rest:

  • Automatically links tests to scenarios
  • Tracks execution results across runs
  • Aggregates outcomes at scenario, story, and suite level

Requirement Traceability

You get clear, real-time dashboards that let you answer business relevant questions:

  • Which user stories are missing test coverage?
  • Which scenarios are currently failing?
  • Which tests are flaky or unstable?

All without juggling multiple tools or maintaining brittle Excel sheets.

One system, end to end

Instead of retrofitting traceability after the fact, TestChimp treats it as a first-class concept - connecting requirements, scenarios, and executions in one place.

  • No spreadsheets.
  • No manual syncing.
  • Just a single system that understands what you’re building, how it’s tested, and where the gaps are.

Special Purpose Testing Agents

· 3 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

If you’re already familiar with ExploreChimp, you know it’s like having a driver navigate your web app for you. Guided by SmartTests, ExploreChimp scans the DOM, screenshots, network calls, and browser metrics to spot bugs that traditional automation scripts can't see.

While ExploreChimp gives you broad coverage, some problems only show up when you deliberately push specific edges.

That’s why we’re launching our "Troop of Special Purpose Testing Agents".

They’re all guided by the same SmartTests, but each agent is purpose-built to tackle a specific class of bugs.

Here is the starting line up that we are launching today:


Form Validation Tester: Meet “Deadpool”

Writing negative test cases for forms is soul-crushing work. You have to think of every wrong input a user could throw at your app, then write tests to catch it.

Form Validation Agent In Action

Our Form Validation Tester, affectionately nicknamed Deadpool, does all the heavy lifting for you. You only need to define the “Happy Path” - the correct way to fill a form. From there, Deadpool goes rogue:

  • Past and future dates
  • Negative numbers
  • Random strings
  • Whitespace-only inputs
  • Invalid data formats, like numbers in text fields

It pushes your forms to the limits to ensure your validation logic holds up - all without writing a single line of negative test code.


Theme Tester: Spot Invisible Problems

Themes are more than just aesthetic. Switching between dark mode, high contrast, or custom color palettes can break visual harmony that users expect to "just work".

Theme Agent In Action

The Theme Tester loops through all the themes your app supports, hunting for:

  • Contrast issues
  • Text visibility problems
  • Ugly color combinations

It can toggle themes via cookies, local storage, or by interacting with your app’s UI - whatever works best for your setup.


Localization Tester: No More "Lost in Translation"

Supporting multiple locales introduces a whole new set of bugs. Dates, currencies, text overflow, RTL layouts, and even cultural appropriateness can break your user experience.

Localization Agent Configuration

Our Localization Tester handles it all:

  • Detects broken translations or dangling template strings
  • Checks date and currency formatting across locales
  • Verifies layout integrity in RTL languages
  • Flags potential cultural missteps

With this agent on your team, you can support global audiences with confidence.


Screen Discovery Agent: Building Your App’s GPS

No test scripts yet? No problem.

The Screen Discovery Agent methodically crawls your app, visiting key screens. It automatically generates your initial SmartTest suite, so the rest of the troop can get to work.

Expanded Behaviour Map

tip

Note: You can easily add more user journeys with our Chrome Extension.


More Agents Coming Soon

This is just the beginning. We’re already working on more powerful agents, including: • RBAC Tester – to verify role-based authorizations work as intended • Network Resilience Checker – to see how your app behaves when connectivity gets fuzzy, backend breaks...

And we’re always looking for more ideas. If you’ve got a pain point in testing, we want to hear about it!


Ready to Try the Troop?

Stop stressing over the worst parts of testing. Let our agents handle the tedious tasks so you can focus on what really matters: building amazing experiences.

From Bug Report to Pull Request: The TestChimp x OpenHands Integration

· 3 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

Let’s be honest. Finding a bug is only the start.

Then comes the context switching – reproducing the issue, digging through logs, writing a fix that doesn’t break something else…

As of today, that workflow is outdated.

Fix Bug Cover

Today, we are launching our OpenHands Integration. This isn’t just a “Chat with AI” wrapper. It is a fully automated pipeline that takes a bug found in TestChimp and turns it into a ready-to-merge Pull Request in your GitHub repository. Here is how it works, why it actually fixes things (instead of hallucinating), and how to set it up.

The "Context Gap" (Why AI usually fails at debugging)

Most AI coding agents are smart, but blind. You tell them “The cart button is broken,” and they hallucinate a fix because they can’t see the state of the application. We solved the Context Gap. When TestChimp captures a bug (whether manually or via our automated agents), we record the entire runtime reality of that failure. When you click “Fix” via the OpenHands integration, we feed the cloud agent the complete necessary context including:

  • Visual Bounding Boxes: We show the agent exactly where the bug is physically located on the screen.

  • API Payloads: The agent sees the actual network requests and response bodies that triggered the error.

  • Console Logs: JavaScript errors, warnings, and stack traces captured at the exact moment of failure.

  • DOM Context: The full element selectors and structure information.

  • Screen-State: Specifics on which screen and state the app was in.

The OpenHands agent doesn’t guess anymore. It fetches these artif acts on-demand, analyzes your codebase, and writes a precise fix.

The Workflow: One Click, Real Code

We built this for speed. Here is what the new flow looks like:

1. Spot the Bug (or Batch Them)

You can select a single bug or use the checkboxes to select multiple bugs at once. If you have five related UI glitches, select them all. The agent is smart enough to identify common issues, group them, and address them together.

2. Click “Fix”

Hit the tool icon next to the bug. TestChimp validates your config and sends the context package to the OpenHands cloud instance.

Fix Bug Screenshot

3. Watch it Work Live

This is the cool part. We pop a success modal with a direct link to the OpenHands Conversation. You can click that link and watch the agent “think” in real-time. You see it analyzing the screenshots, reading the API logs, and reasoning through the code changes.

4. Review the PR

Once the agent is done, it automatically raises a Pull Request in your connected GitHub repository. You review the code, run your CI, and merge.

Technical Setup (How to turn it on)

This feature is available now for TestChimp Teams subscribers. Prerequisites: You need an OpenHands account (cloud or self-hosted) and your GitHub repository must be connected to both OpenHands and TestChimp. Note: The repo connected in OpenHands must match the repo configured in TestChimp. Configuration Steps:

  • Go to Project Settings -> Integrations -> OpenHands.
  • Enter your OpenHands API Key.
  • Select your Installation Type (Cloud or Self-hosted).
  • Click Save Configuration.

Why this matters

We are moving from “Bug Tracking” to “Bug Killing.” By giving an autonomous agent access to on-demand artifacts like bounding boxes and DOM states, we are removing the manual labor from regression testing. Stop fixing bugs manually. Let the chimp handle it.

SmartTests Now Support The Full Playwright Ecosystem

· 4 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

We’re excited to announce that SmartTests now fully support the core Playwright testing patterns and constructs you know and love. This means you can write maintainable, well-structured test suites that leverage Playwright’s powerful features while still getting all the AI-powered adaptability that makes TestChimp SmartTests special.

What Are SmartTests?

For those new to SmartTests, you can think of a SmartTest as a Playwright scripts with couple of twists:

Intent Comments:

SmartTest Steps include intent comments that describe what you’re trying to accomplish. When a test runs, it executes as a standard Playwright script for speed and determinism. But when a step fails, our AI agent steps in to fix the issue on the fly and raises a PR with the changes – giving you the best of both worlds: fast script execution and intelligent adaptability.

Screen-state annotations:

Markers that specify the screen and state the UI is at a given step in the script. These annotations are authored and used by ExploreChimp to tag the bugs to the correct screen-state in the SiteMap.

What's New: Full Playwright Compatibility

SmartTests now support all the essential Playwright patterns that help you build professional, maintainable test suites:

1. Hooks for Setup and Teardown

SmartTests now support all four Playwright hooks at both file and suite levels:

beforeAll– Run once before all tests in a suite – afterAll – Run once after all tests in a suite – beforeEach – Run before each test – afterEach – Run after each test

This means you can set up test data, initialize page objects, authenticate users, and clean up resources exactly as you would in standard Playwright tests.

2. Page Object Models (POMs)

SmartTests fully support the Page Object Model pattern, allowing you to encapsulate page interactions in reusable classes. This keeps your tests clean, maintainable, and aligned with best practices.

Example:

import { Page } from '@playwright/test';

class SignInPage {
constructor(private page: Page) {}

async navigate() {
await this.page.goto('/signin');
}

async login(email: string, password: string) {
await this.page.fill('#email', email);
await this.page.fill('#password', password);
await this.page.click('#sign-in-button');
}
}

test('user can sign in', async ({ page }) => {
const signInPage = new SignInPage(page);
await signInPage.navigate();
await signInPage.login('user@example.com', 'password123');
});

3. Fixtures for File Uploads

SmartTests support Playwright fixtures, making it easy to handle file uploads and other test artifacts. Upload your fixture files (like test data, images, or documents) under the fixtures folder in the SmartTests tab, and they will be available during test execution.

4. Playwright Configuration

SmartTests folder contains a playwright.config.js file in your project to configure the Playwright execution environment. This is essential for:

  • Browser Authentication: Set up HTTP basic auth for staging environments
  • Custom Headers: Add authorization tokens, API keys, or custom headers
  • Base URLs: Configure default URLs for your test environment
  • Viewport Settings: Set default browser viewport sizes And more: All standard Playwright configuration options

Example playwright.config.js:

const { defineConfig } = require(‘@playwright/test’);

module.exports = defineConfig({
use: {
baseURL: ‘https://staging.example.com’,
httpCredentials: {
username: ‘staging-user’,
password: ‘staging-password’
},
extraHTTPHeaders: {
‘Authorization’: ‘Bearer your-token’,
X-Environment’: ‘staging’
}
}
});

5. Test Suites with Multiple Tests

SmartTests support organizing multiple tests in a single file using Playwright’s test.describe() blocks. You can create nested suites, group related tests together, and apply suite-level hooks – just like in standard Playwright.

Why This Matters

These additions mean SmartTests are now fully compatible with Playwright’s ecosystem. You can:

✅ Write maintainable tests using industry-standard patterns like POMs and hooks

✅ Organize your test suite with proper grouping and structure

✅ Handle complex setups with configuration files and fixtures

✅ Reuse existing Playwright knowledge without learning new patterns

✅ Still get AI-powered fixes when tests fail – the best of both worlds!

Getting Started

If you’re already using SmartTests, you can start using these features immediately. Just structure your tests using standard Playwright patterns, and SmartTests will handle the rest.

For new users, SmartTests work just like Playwright tests – with the added benefit of AI-powered failure recovery & stepwise execution enabling guided exploration.

What's Next?

SmartTests continue to evolve, and we’re committed to maintaining full compatibility with Playwright’s ecosystem while adding intelligent features that make testing easier and more reliable. Stay tuned for more updates!

Got questions or feedback? We’d love to hear from you! Drop us a line at contact@testchimp.io.