Skip to main content

TestChimp Partners with Bunnyshell

· 4 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

As AI coding agents become more prevalent, they are changing more than just how code gets written.

They're changing when software should be tested.

Today, we're excited to announce our partnership with Bunnyshell to bring PR-scoped ephemeral environments directly into the AI-powered QA workflows executed by TestChimp.

This partnership solves a problem that is becoming increasingly common as organizations adopt AI-assisted development at scale.

Bunnyshell Partnership Announcement

The Hidden Challenge of AI-Driven Development

AI dramatically increases PR volume.

Not only are there more pull requests being created, but those pull requests often contain substantially more changes than their human-authored counterparts.

Historically, many teams followed a workflow similar to this:

  1. Developers create PRs
  2. PRs are merged into the main branch
  3. A release is deployed to a shared staging environment at end of sprint
  4. QA validates the release

This workflow worked reasonably well when development velocity was constrained by human output.

However, as AI agents begin generating code continuously, several problems emerge:

  • More PRs are merged between testing cycles
  • Individual PRs contain more changes
  • Regressions become harder to isolate
  • Root-cause analysis becomes increasingly expensive

By the time QA identifies a problem in staging, the issue may have originated from one of dozens of recently merged pull requests.

  • Finding the offending change becomes a detective exercise.
  • Reverting safely becomes difficult.
  • Confidence in releases decreases.

The Solution: E2E tests in each PR

What if every PR was E2E tested before it reaches the main branch?

Ideally, every PR should arrive with:

  • New end-to-end tests
  • Updates to existing affected tests
  • Validation that those tests pass
  • Evidence that the feature behaves as intended

This significantly reduces the amount of uncertainty that accumulates in shared environments.

The challenge, of course, is environment availability. To test a PR, you need an environment that actually contains the PR's changes. Note just a frontend (like what firebase / vercel provide) - but full-stack isolated environment.

For small applications, developers can often spin everything up locally. For larger systems, that quickly becomes impractical.

This is exactly where Bunnyshell shines - ephemeral environments, spun up at lightning speed - deployed on the cloud.

How Bunnyshell Solves the Environment Problem

Bunnyshell allows teams to define their application infrastructure using a simple YAML specification.

Think of it as a blueprint describing everything required to run your application:

  • Frontend
  • Backend services
  • Databases
  • Networking
  • Environment variables
  • Dependencies between services

Once this blueprint exists, Bunnyshell can automatically provision isolated environments on demand - and deploy them to your K8s cluster. Don't worry - TestChimp SKILL transitively loads Bunnyshell skill and authors the YAML file for your infrastructure.

Instead of testing changes in a shared staging environment, every pull request receives its own dedicated clean environment for agents to work on.

  • No shared environment.
  • No interference from other testing work (manual testing / other test suites running etc.).
  • No waiting for deployment windows.

When you run "/testchimp test" workflow, TestChimp can now provision an ephemeral environment via your Bunnyshell config - scoped to the current PR, load up necessary test data through already defined fixtures, and execute testing on this environment.

Result: You can now merge your agent authored PR with confidence.

This partnership brings together two complementary capabilities crucial for QA shift-left paradigm:

Bunnyshell provides isolated, production-like environments for every pull request.

TestChimp provides AI-powered exploration, validation, and automated test creation.

Together, they enable a workflow where every PR can be validated in isolation before it reaches main.

The icing on the cake: TestChimp users will get 15% off their Bunnyshell bills!

Simply use code: TESTCHIMP15 when signing up.

Manual Testing with Traceability

· 5 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

:::tip Updated workflow Since this post, the Manual tab also supports open-ended sessions (objective instead of scenario), Add bug on steps (deferred until session finish), and Atlas-backed screen/state selection. See the current guide: Manual test session capture. :::

Manual testing is still where teams catch the “this feels wrong” stuff:

  • confusing UX
  • copy bugs
  • edge-case flows
  • integration weirdness that doesn’t show up in clean scripted runs

But there’s a persistent problem: manual testing evidence doesn’t stay connected to what was planned.

Test plans live in one place. Notes and screenshots live in another. Pass/fail outcomes live in Slack threads or Jira comments. And the mapping back to the scenario is usually… memory.

Today we’re announcing a new workflow in TestChimp: Manual testing with traceability.


The idea

If your team already does test planning (user stories → scenarios), then manual execution should be:

  • tied to a specific scenario
  • captured as step-by-step evidence
  • recorded with environment + release context
  • marked passed / failed
  • queryable later as execution history (not a dead document)

That’s what this feature does.


Manual Test Session Capture (in the Chrome extension)

In the TestChimp Chrome extension, there’s now a Manual tab.

It lets a tester record a manual session while they execute a planned scenario, with traceability stored directly on the record.

Manual test session capture

What gets captured:

  • Steps: each interaction is captured as a step
  • Screenshots: uploaded automatically
  • Notes: add notes to the latest step, optionally highlighting a UI element/area
  • Outcome: mark the session as passed or failed
  • Context: environment + release (and optional git branch context)

How it works (quick walkthrough)

  1. Open the Chrome extension and switch to the Manual tab
  2. Click Create Manual Test Record
  3. Select the test scenario you’re executing (required)
  4. (Optional) pick the git branch context
  5. Click Start Capture and run the test as usual
  6. Add notes when needed (with optional element/area attachments)
  7. Click End capture, then mark passed or failed
  8. Open View execution to see the full record in TestChimp

If you want the full documentation, see Manual Test Session Capture.


Why this matters (beyond “we captured a GIF”)

Manual testing isn’t going away. But it needs to stop being unstructured.

With scenario-linked manual records, you can answer real questions without archaeology:

  • Which scenarios were manually executed for this release?
  • Where are we relying on manual validation because automation doesn’t exist yet?
  • What’s the evidence behind a “pass” when something regresses later?
  • What’s failing on a specific branch or environment?

It’s manual testing… but operationalized.


Unified coverage: manual + automated, in one view

The bigger win is what happens after you capture manual execution.

Because manual sessions are linked to the same scenarios as your SmartTests, TestChimp can provide unified requirement coverage insights across:

  • Automated runs (SmartTests in CI or in the Web IDE)
  • Manual runs (scenario-linked manual sessions with evidence + pass/fail)

So instead of two separate worlds (a test management tool for manual, and CI dashboards for automation), you get one requirement-centric view:

  • scenario coverage status
  • recent execution history (pass/fail)
  • evidence trail for manual validation
  • clear gaps where scenarios have no automated coverage yet

This is the foundation for keeping your suite honest: manual validation is visible and it doesn’t get conflated with “we have automation”.


How coding agents consume this to prioritize test authoring

Once coverage and history are unified at the scenario layer, agents can treat it as an ordered backlog—especially in workflows like /testchimp test (PR-level) and /testchimp evolve (portfolio-level).

In practice, the agent pulls:

  • Requirement coverage (what scenarios are covered vs missing tests)
  • Execution history (what’s failing or flaky right now)
  • (Optionally) TrueCoverage signals (what real users do most, where they drop off)

Then it prioritizes authoring work where it has the highest leverage:

  • uncovered high-priority scenarios first
  • gaps in the exact folder/feature area the team owns
  • high-traffic paths with low coverage (when TrueCoverage is enabled)

The end result is a tighter loop: manual + automated executions feed the same insights, and those insights drive agents toward the most important missing tests—rather than “write more tests” as a generic goal.


Manual vs Script Gen (important distinction)

This workflow is intentionally not about generating automation.

  • Use Manual when you want an auditable record of a tester executing a scenario (pass/fail, notes, screenshots).
  • Use Script Gen when you want to capture steps and generate a Playwright SmartTest.

Try it

  • Install the Chrome extension
  • Plan scenarios in Test Planning
  • Run your next manual regression session through the Manual tab

If you have feedback on what would make manual execution records more useful (branch/release filtering, richer notes, better rollups), we’re actively iterating.

TestChimp now supports native mobile testing

· 4 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

TL;DR: TestChimp now supports native mobile app testing on both iOS and Android. This brings the same seamless workflow we unlock for your web testing - just say "/testchimp test".

TestChimp native mobile testing support


What shipped

Mobile is not a separate product bolted on the side. It is the same plan → repo → agent → CI loop you use for web SmartTests, extended to native apps via Mobilewright—a Playwright-style API and toolchain for iOS and Android.

Create a TestChimp project with project type iOS or Android, connect Git for your plans and tests folders, install the TestChimp skill on Claude or Cursor, and after each PR say /testchimp test. The platform keeps doing what you expect: wiring RUM, reading scenarios, closing coverage gaps, and surfacing analytics—now on screens that live inside your app, not only in the browser.

For setup details and parity tables, see Mobile testing (iOS and Android).


Five value props for Claude-based test authoring—four are live on mobile

TestChimp’s agentic QA model rests on five pillars. On native mobile, four are fully supported today:

Value propWhat it gives youMobile status
Requirement traceabilityPlans ↔ tests feedback loop; scenarios stay linked to coverageSupported
TrueCoverageReal user behaviour ↔ tests feedback loop; production informs what to automateSupported
QA workflow executionSeed/probe endpoints, fixtures for reusable world-states, test authoring, scenario linkingSupported
ExploreChimpAnalytics on screenshots, logs, and network from exploratory runsSupported
Smart StepsIntent-based steps in test scripts (ai.act, ai.verify, …)Not yet

Smart Steps remain web-only for now. Native mobile tests use standard Mobilewright APIs for UI interaction—the same deterministic, async execution model you know from Playwright, without the intent-comment layer on top.

Everything else—the closed loops between requirements, production behaviour, fixtures, and tests—carries over.


The same seamless workflow as web

You do not need a new playbook. The habit stays the same:

  1. Install the TestChimp skill on Claude or Cursor.
  2. After each PR, run /testchimp test (or your team’s equivalent in the agent host).

TestChimp then orchestrates the work you would otherwise stitch together manually:

  • RUM libraries — Wire up testchimp-rum-ios and testchimp-rum-android so production and test runs speak the same event vocabulary.
  • Instrumentation — Understand real user behaviour: segments, interaction flows, and scenarios—not just “the app launched.”
  • Plans and stories — Read markdown scenarios, pull requirement traceability insights, and see what is still untested.
  • Test authoring — Author Mobilewright tests to cover gaps, with traceability annotations where your plan expects them.
  • Spot analytics — Run ExploreChimp-style analysis on new screens: visuals, logs, network.

You still get continuous transparency of QA posture in one platform—requirements, coverage, failures, and exploration—whether the surface is a browser tab or a native view controller.


Familiar tests, less flakiness

Mobile tests are authored in a Playwright-familiar style via Mobilewright: auto-waits, async execution, and fixtures that behave like the ecosystem you already trust on web. That consistency matters when agents (and humans) move between repos that ship both web and mobile.

Fair credit where it is due: the reliability characteristics of that execution model come from Mobilewright—and we are grateful they exist. Mobilewright moved our timeline for serious native support forward by at least a year. If you need cloud-hosted real devices in CI, Mobile Use integrates with the same stack.


What to do next

If you are already on TestChimp for web, create an iOS or Android project, point Git at your plans and tests folders, and run /testchimp test on your next mobile PR. Smart Steps will follow; the feedback loops you care about for shipping quality are already there.

Why Test Plans in Code if Jira can expose an MCP?

· 2 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

Why Store Test Plans in Code if Jira can expose an MCP?

If Jira can expose an MCP to fetch a list of stories, and another call to fetch or update each story — is there an advantage in maintaining them in code instead (what we enable with TestChimp)?

This question comes up often when teams try to retrofit agents into existing workflows. And there are legitimate reasons for doing that — switching costs are real. But if you’re in a greenfield-ish project, the upside of a code-first approach can be significant.

The difference is akin to comparing someone who has read the entire library to someone who has a library card.

Why Jira MCP isnt a substitute

Technically, the person with the library card also has access to everything. But access and understanding are not the same thing.

Apply the same idea to your codebase. Theoritically, you could store your code in some remote SaaS as individual files and expose three MCP tools:

  • list_files
  • read_file
  • upsert_file

Your agent would technically have “full access” to the codebase.

But that would be massively inefficient. Having the code available as colocated local files gives agents advantages that cannot be replicated through API calls:

  • Local indexing optimized for agentic retrieval
  • Structural understanding through folder organization
  • Faster whole-code operations like grep and find
  • Reading surrounding context naturally
  • Faster iteration during multi-step reasoning (Chain of thought)

The agent doesn’t just access the code - it starts to understand the shape of it.

Now imagine extending those same advantages beyond code. What if your knowledge base, user stories, and test scenarios lived in a form the agent could access natively?

Now your agent has business context about your product (similar to how it has code context). Not through a tool called one record at a time, but as something it can index, understand in aggregate, capture structural relationships from, and navigate naturally. It can find related stories. Connect scenarios. Understand patterns. Build context over time.

The difference isn’t access.

It’s whether the agent has a library card - or whether it has actually read the library.

To build or to buy - that is thy question

· 2 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

"To build or to buy - that is thy question". In the era of LLMs, many teams seem stuck in a strange middle ground: doing neither.

Build vs Buy Illustration

When you can - theoretically - build it, purchasing, suddenly feels icky. Pre-LLM era, teams often bought things because building them was hard, expensive, or outside their expertise. Now that math feels different:

“You can build ANYTHING.”

However, many teams misread that as:

“You can build EVERYTHING.”

Those are two very different statements. Say there are 4 products you could spend time building: A, B, C and D. You can build any of them. The catch: if you choose to build A, that takes away focus from B, C and D. Try building all 4, and you end up with sub-optimal versions of each.

So what SHOULD you build? Your business. Your product. The thing that translates directly into revenue.

You can technically build a CRM, a Slack clone, and everything in between. But that comes at the cost of focusing on your own product.

Secondly, teams often heavily discount TCO (Total Cost of Ownership), which is very different from build cost:

  • Cost of upkeep - fixing bugs, maintaining infra, adding features, monitoring, testing
  • Opportunity cost - time spent maintaining non-core systems is time not spent improving your actual product
  • Loss of potential capabilities - your internal CRM probably won’t be as feature-rich as HubSpot. Their team wakes up every day thinking about making CRM better. You don’t. Your competition that chose to buy - they get to leverage all of those present and future capabilities while you are stuck living with your barebones version.

Yes - you CAN build ANYTHING. The new game is choosing which ones you build vs buy - carefully doing the math on the ROI based on TCO.

#BuildOrBuy #SaaS #BuildInPublic #StartupLife #AgenticAI

SKILLs are becoming SaaS’s best distribution hack (here’s why)

· 3 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

For years, the hardest part of selling a complex technical product was not the demo—it was the learning curve. Buyers had to internalize workflows, edge cases, and “the right way” to use each feature before they could reliably get value.

That is changing fast. Agent Skills—portable folders of instructions, checklists, and resources that teach an AI agent how to work with your product—are starting to look like one of the most attractive distribution mechanisms for technical SaaS. Instead of hoping every customer reads the docs in the right order, you ship a repeatable operating procedure the agent can follow on demand.

A skill turns every “new user” into a “power user”

A well-designed Agent Skill effectively turns every user into a power user: one that knows which workflows to follow, how to use the product correctly, and how to extract maximum value from every feature.

That compresses time-to-value—the path to the “aha moment”—because the agent is not improvising from vague prompts; it is executing your intended playbook.

What we are seeing at TestChimp

We have been seeing this firsthand since launching the TestChimp Agent Skill.

For teams, the workflow is intentionally simple:

  1. Author a few user stories (or import from Jira).
  2. Install the TestChimp skill on your coding agent.
  3. After each PR, simply say /testchimp test.

The skill teaches Claude how to coordinate with TestChimp to:

  • instrument the app for TrueCoverage,
  • fetch and interpret coverage gaps,
  • write tests that addresses the gaps and link them to scenarios correctly,
  • run targeted exploratory testing to catch UX issues,
  • and use AI-native test steps in tests where they help.

The upgrade loop: your perfect user ships with your product

The best part is what happens when you ship new features.

With a properly designed, self-updating TestChimp Agent Skill, your "user" continuously learns your latest workflows, capabilities, and best practices—and applies them the way you intended. Your agent-side “instruction manual” can move as fast as your product, without requiring every human user to re-read release notes and learn every new capability you ship.

If you are building technical SaaS in the agent era, the product surface area is no longer only your UI and APIs. It is also the skill: the packaged expertise that turns your users in to power users.


References and further reading

Authoritative guides and registries for Agent Skills (format, discovery, and ecosystem):

Boiling the lake - QA style

· 3 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

Boil the lake - credits: https://garryslist.org/posts/boil-the-ocean

Garry Tan recently introduced a simple but powerful idea: The old adage “don’t boil the ocean” is bad advice in the AI agent era. Well - at the very least, “lakes” are now very much “boilable”.

The core insight is: AI compresses certain work by orders of magnitude. That doesn’t just make things faster - it fundamentally changes what’s feasible.

Most people ask the wrong question:

“What existing human workflows can we speed up with AI?”

That’s incremental thinking. The real leverage comes from asking:

“What powerful workflows did we avoid entirely because they were too expensive to do with humans?”

Those are your “lakes”. And with AI, many of them go from infeasible → trivial.


The QA lake

In QA - making “test authoring faster” is akin to the former. The bigger ROI lies in the granular workflows that get unlocked now that agents can take autonomy in your test automation.

The Big Idea:

Could agents execute a workflow - where they continuously monitor “planned reality” (user stories / scenarios) and “production reality” (real user behaviour patterns) to improve the “tested reality” (test suite + test infra) - in a continuous feedback loop. All of it done in the background - looping you in for approval of plans it makes.

Feedback Loop enabled by TestChimp

This is exactly the future we were building TestChimp for - where agents participate in each phase of QA; where agents access real world insights / plan artifacts to self-direct its work strategically.


Claude + TestChimp

Today, we are adding the final piece of the puzzle: A SKILL that you can install on Claude / Cursor that enables just that.

  • In TestChimp, test plans are already maintained as Markdowns in repo - directly accessible to agents.
  • Requirements are linked to tests via in-code comments - that Agents can author.
  • Test executions are auto-tracked by our Playwright plugin
  • Event ingests are tracked across prod and test - to generate TrueCoverage insights.

The Skill “upskills” Claude to read those insights via our CLI / MCP, to plan and execute the entire QA workflow:

  • Understand coverage gaps, prioritize (using signals exposed by TestChimp) and plan
  • Author fixtures that emulate real-world situations observed
  • Update test infrastructure (seed / probe endpoints) as needed
  • Author tests - (provisioning PR-local envs to test in and validating tests work)
  • Update instrumentations to learn about real user behaviour (for future cycles - covering new user journeys introduced)

QA workflow orchestrated by TestChimp - Overview


The best part: All of this is condensed to just 2 commands - enabling a frictionless DevX:

  • /testchimp test -> (Run after each PR) Updates plans, authors seeds / fixtures, author tests, validate them in PR scoped isolated environments, instrument code for TrueCoverage

  • /testchimp evolve -> (Run periodically / on deploy) Audits test coverage aligned with requirements and real-user insights, to “evolve” your QA infra & test suite to cover critical under-tested areas and do corrective actions & run targeted exploratory runs.


Claude can write tests. With the right feedback loop, it can fully manage an effective, self-evolving QA posture that de-risks your product continuously. This is what TestChimp enables, by making each phase of QA agent-native, informed by requirements and real user behaviour insights, in a tight feedback loop.

Fixtures - the 'unsung hero' in agentic test automation

· 4 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

In E2E tests, Page Object Models (POMs) were the “popular kids”. Everyone knew them, everyone praised them. Yet not many knew of (or extensively used) "fixtures".

While there are many use cases of fixtures, a prominent one is - they let you pipe pre-created entities to tests that represent specific situations (a user with a valid subscription, a premium tier org etc.).

Ok - before we go into why it matters, let's back off a bit.

Arranging the world-state for the test

Every functional test boils down to 3 steps (the 3A's):

Arrange -> Act -> Assert

In plain terms:

Given a situation (e.g. a user with an expired credit card),
When a set of actions are done (attempting checkout),
Expect a defined outcome (error message, no order created).

Here’s where things went sideways for a long time.

Phase change with CC test authoring

When humans were authoring tests - especially using web-based SaaS / No-code tools - they were constrained to the UI layer, due to a couple of reasons:

  1. Tools operated outside of the system
  2. QA lacked coding skills / were not allowed to work with system code due to organizational frictions

So everything had to be set up through the UI (or live system APIs), which made POMs the “sexy abstraction”: they made UI-driven setup bearable.

But that setup was never the ideal. It was the workaround.

Arriving at the situation is not the test. It is incidental complexity introduced by tooling and human limitations.

The Shape Shift in Test Automation with Claude

When Claude is authoring, it is not bound by that restriction. It has the full context of your codebase and can operate across layers. It can author seed / probe endpoints, generate data, and construct precise system states directly.

This is where fixtures shine.

Fixtures expose these pre-built states as reusable, composable building blocks:

  • “User with expired card”
  • “Account with failed payment retries”
  • “Cart with out-of-stock item”

More importantly, fixtures provision those entities with full data-isolation per test run (so that parallel workers running tests, retries etc. don’t interfere with each other). This removes many anti-patterns common in pure UI-layer test authoring - such as depending on order of tests (one to create the entities, one to update, another to delete - each depending on prior).

Shape Shifting of Test Automation Work with CC

Now your tests change shape:

  • Arrange → mostly handled via reusable, API-backed fixtures
  • Act → only the actions that actually matter
  • Assert → UI checks plus direct state validation via probe endpoints

The result: faster tests, more reliable tests, and far less noise.

TrueCoverage - Write fixtures that mirror real-world

Here’s where things get even more interesting:

What if Claude could learn what situations occur in the real world? Then, it can author fixtures that emulate them - prioritized by impact - resulting in coverage that actually de-risks your product against real user behaviour.

Production informed feedback loop for fixtures + tests

This is exactly what TestChimps’ TrueCoverage unlocks: a feedback loop - where agents can continuously learn from production insights and generate fixtures that mirror real-world situations.

  • Not guessed. Not happy-path-heavy assumptions.
  • Actual situations your users experience.

That’s when your test suite stops being synthetic - and starts becoming representative of “what your users experience”.

POMs helped us survive UI-driven testing.

Fixtures unlock systemic scenario coverage in the agentic automation era.

Further reading

The Real Reason Claude Beats Every UI Testing Tool

· 5 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

Web-based test authoring hits a structural ceiling against Claude / Cursor-class tooling—not because those agents are “smarter at clicking,” but because test automation is not UI steps.

It cuts across system infra, test infra, and the test layer. Treat it as UI-only and you get slow, flaky suites.

Phase change with CC test authoring


A Simple Example: Checkout with an Expired Card

Scenario: checkout with an expired card.

What most UI-driven tests look like

Via UI you often:

  • create a new user
  • sign up
  • verify email
  • add a card
  • manipulate expiry (if even possible)
  • add items to cart
  • navigate to checkout

Then:

  • click “checkout”
  • assert error message

Long, brittle, and mostly setup, not the behavior you care about.


What a Well-Structured Test Looks Like

Arrange → Act → Assert, applied properly.

Arrange (system + test infra)

Build state directly instead of simulating it through the product UI.

POST /test/seed/user
{
plan: "premium",
paymentMethod: {
status: "expired"
}
}

This is a seed endpoint — a test-specific API that creates the exact state you need.


Fixture (test infra abstraction)

Wrap seeds in fixtures so tests stay readable:

const user = await createUserFixture({
paymentStatus: "expired"
}, testInfo);

Fixtures hide setup details and scope isolation for parallel runs and retries.


Act (test layer)

await page.goto("/checkout");
await checkout(user);

Only the UI you need for the behavior under test.

Assert (UI + system validation)

UI:

await expect(errorBanner).toContain("Payment method expired");

Probe the system too:

GET /test/probe/order-status?userId=...

Validate:

  • no order was created
  • payment was not processed

UI can lie; backend state usually doesn’t.


Seed & Probe: The System infra for testing

“API testing” is the wrong mental bucket. You want two test-shaped capabilities:

Seed endpoints

  • construct state directly
  • bypass irrelevant flows
  • deterministic

Probe endpoints

  • verify backend state
  • confirm side effects
  • act as your test oracle

Without both: slow (UI-heavy setup) or shallow (UI-only asserts).


Why API Support Alone Doesn’t Solve This

Record-and-play “API steps” in Mabl/Katalon-style tools still hit production-shaped APIs: multi-step flows, side effects, and no way to create impossible-but-needed states (e.g. a coupon that expired yesterday). Chaining those calls simulates state; it does not give you deterministic seed/probe primitives.


The Real Limitation of No-Code Platforms

Platforms like Mabl or Katalon operate outside your system.

They cannot:

  • introduce seed endpoints
  • define probe endpoints
  • evolve system-level test primitives
  • share abstractions with backend code

So they are constrained to:

“Whatever the system already exposes”

Which forces:

  • UI-driven setup
  • or fragile API chains

The model stays step flows, not state definitions—test-layer only, while serious automation cuts through system + test infra + tests.

Shape Shifting of Test Automation Work with CC


Fixture Design: Where Reliability Comes From

Fixtures are what make suites parallel, retry-safe, and deterministic.

A bad fixture:

user@example.com

This breaks when:

  • tests run in parallel
  • retries reuse polluted state

A good fixture uses runtime context:

const uniqueId = `${testInfo.testId}-${testInfo.retry}`;
const email = `user-${uniqueId}@example.com`;

That pattern is the difference between stable and flaky at scale.


Why This Matters More Now

No-code tools optimized for “what can be done from the outside?” because QA rarely owned system changes. Agents in the repo do own them—adding seed/probe routes, fixtures, and tests is now cheap in engineering time, not a special project.


A Better Mental Model

Ask what state, how to build it fastest, how to prove it in the system—not “how do I click through the app to get there.”

seed → fixture → minimal UI → probe


Safety Considerations

Seed and probe routes must be test-only: right environment, authenticated, disabled or guarded in production—by design, not bolted on later.


Caveat: Claude-authored scripts are still selector-bound

Agents excel at emitting Playwright—locators, waits, structure—but that still freezes intent → selector at author time. Shipping UI brings selector drift, variance (themes, experiments, i18n, hydration), and layout noise—the same flake class, just produced faster.

Products like Spur and Momentic often move intent vs live UI to execution time (where “smart” stability lives), but frequently inside proprietary authoring—awkward next to git-native tests.

Split the work: Claude keeps seed → fixture → stable UI → probe explicit; reserve execution-time resolution for the messy spans via optional intelligent steps—not a fully opaque “magic” suite.

TestChimp’s Playwright runtime (@testchimp/playwright / ai-wright, e.g. ai.act, ai.verify) does exactly that: execution-time smarts where selectors fail you, without giving up versioned repo tests—mostly scripts, selectively runtime-resolved UI.


TestChimp: helping Claude write the right tests

The harder problem than syntax is what to test—and whether plan, runs, and production still line up. Without a bridge, they drift.

TestChimp connects planned (stories, scenarios, plans), tested (runs, requirement coverage, artifacts), and production (real usage / TrueCoverage-style signals) realities. We turn that into actionable context for agents—gaps, scenarios to tighten, seeds/probes/fixtures to add—not vanity dashboards.

Claude can write tests really well. TestChimp creates the feedback loop that helps Claude write the right tests.

Code-native authoring plus planned → tested → production gives Claude a tight feedback loop to learn from and optimize over time; optional AI steps (above) handle selector pain where it concentrates.


Final Thought

UI-scripting automation buys slowness, flake, and churn. State orchestrationseed → fixture → minimal UI → probe—buys speed, reliability, and clearer reasoning. E2E can approach lower-layer discipline when the stack cooperates; tools that never touch system + test infra will not get you there by themselves.

TrueCoverage for agentic QA: learn how users really behave, then evolve tests on purpose

· 11 min read
Nuwan Samarasekera
Founder & CEO, TestChimp

If you are an autonomous QA agent, your hardest problem is not syntax. It is prioritization under uncertainty: which journeys matter, which world-states recur in the wild, and which gaps actually de-risk the product. If you are a human delegating QA to an agent, your hardest problem is trust: how do you know the agent is not optimizing for easy coverage while the business burns on paths real users depend on?

Production Feedback Loop enabled by TestChimp

TrueCoverage is a way to ground both sides in the same signal: what production traffic is trying to tell you, expressed in a form tests can participate in. This post is framed in two layers:

  1. Concept and utility — what TrueCoverage means independent of any vendor, why it fits the agentic era, and what becomes feasible once you have it.
  2. How TestChimp implements it — how @testchimp/rum-js, and @testchimp/playwright plugin, and summarized analytics APIs close the loop so agents (and humans supervising them) can learn, decide, and evolve QA continuously.

Part I — The idea: production as the curriculum for QA

What “TrueCoverage” means as a concept

Classical coverage answers: did my code execute? That is necessary and insufficient. It does not tell you whether the behaviors users rely on are the behaviors your suite exercises under conditions that resemble reality.

TrueCoverage, means:

  • You observe meaningful user-journey steps in production (not every click—semantic steps that map to product risk: checkout started, export completed, permission denied, and so on).
  • You observe the same vocabulary during automated test runs, with a way to know which tests produced which events.
  • You compare the two streams so you can see demand, sequencing, friction, and slices of the real world (roles, entitlements, cart shape) where real usage and automated coverage diverge.

The outcome is not a bigger dashboard. It is a closed feedback loop: production teaches you what “normal” and “important” mean for this product; tests and fixtures prove you still protect those paths after every change.

Why this approach matches how good agents already work

Agents that ship useful QA behave like scientists with a budget: they form hypotheses (“checkout without a saved payment method might be undertested”), gather evidence, run a targeted experiment (a test + fixture), and update the model. The weak link is almost always evidence. Product specs are incomplete. Ticket backlogs are biased. Code coverage is blind to which user stories matter.

Production behavior is imperfect—sampling, seasonality, and product experiments all apply—but it is ground truth for impact ordering. When an agent can query “how often does this situation occur?” and “what usually happens next?”, it stops guessing which regressions would hurt the most.

The elephant in the room: instrumentation used to be expensive

For years, the honest reason teams did not do this everywhere was operational cost:

  • Designing event names and metadata so they are stable, low-cardinality, and privacy-safe is skilled work.
  • Plumbing init, helpers, env-specific keys, and batching behavior across a large frontend is tedious.
  • Maintaining that layer across refactors—without breaking analytics or leaking identifiers—is ongoing tax.
  • Interpreting raw event lakes often required a data partner, not a QA engineer.

So the idea of aligning tests with real journeys was always sensible; the implementation and upkeep were the barrier. Teams defaulted to intuition, bug history, and line coverage because those scaled with human attention spans.

Why that burden collapses in the agentic era

Agentic coding changes the economics:

  • Boilerplate (init wrappers, typed emit helpers, progress trackers, event documentation) is exactly the sort of work models do quickly and consistently.
  • Refactor propagation—rename a flow, split a route, move state—becomes a task you can assign: “keep emitCheckoutProgress aligned with the new module boundaries.”
  • Governance at scale—dot-scoped metadata keys, cardinality rules, “no raw IDs in metadata”—can be enforced as repeatable policies in code review and in agent instructions, not as tribal memory.

What becomes feasible once agents can “see” real usage

Below are some capabilities that gets unlocked when an agent can pull summarized production-test deltas on demand.

1. Fixtures that mimic real-world situations—not demo data

Suppose checkout emits a semantic event checkout_attempted with bounded metadata such as user.has_fop (form of payment on file: true / false). Production aggregates might show that a large share of attempts happen with user.has_fop=false, while your automated runs almost always hit true because the seed user is “too perfect.”

An agent can:

  • Treat that skew as a coverage gap on a risk-bearing slice, not a vanity metric.
  • Author or extend a Playwright fixture (or API seed flow) that creates a user without FOP, then add a test that asserts the expected behavior (validation, alternate payment path, error copy, telemetry).
  • Document the event slice in repo-local knowledge (plans/events/*.event.md style) so the next agent does not reinvent the schema.

The point is not “more metadata.” The point is metadata that matches how the product branches in reality, so fixture work is evidence-backed.

2. Journey prioritization from sequences, not screenshots

Agents excel at graph-like reasoning when you give them a graph. TrueCoverage-style child event trees and transition summaries answer questions humans ask in war rooms—“after someone opens the importer, what do they actually do next?”—without watching session replays for hours.

Example: production might show that after import_started, the modal next step is usually mapping_confirmed, but a non-trivial fraction goes to import_cancelled within seconds. If tests always march the happy path to mapping_confirmed, you may be blind to early abandonment bugs (performance, confusing copy, default file type issues).

An agent can prioritize a short journey test for the high-drop branch, or an instrumentation pass if the “cancel” events are too coarse to explain why.

3. Using Demand, Duration, Drop-off, and Depth as a shared prioritization language

TrueCoverage analytics align well with a compact strategy: the 4Ds (how TrueCoverage metrics work)—Demand (how often something shows up), Duration (dwell and pacing), Drop-off (abandonment and terminal sessions), Depth (where a step sits in the funnel). Depth is especially important for prioritization because top-of-funnel steps guard everything downstream: if sign-up, workspace creation, or the first checkout screen is flaky, slow, or wrong, users and sessions never reach the deeper flows your suite might obsess over—so automation that skips straight to “step seven” can look green while production is bleeding at the door.

Together the 4Ds steer agents away from covering easy code and toward protecting painful journeys.

Concrete prioritization examples:

  • High demand + absent in test-tagged traffic → add or extend regression coverage soon.
  • Early funnel (shallow depth) + high demand or high drop-off → harden entry paths first: stronger tests, fixtures, and instrumentation for the gate events; defer deep-journey expansion until those steps are reliably exercised—otherwise you optimize coverage for journeys most real sessions never complete.
  • High drop-off + shallow tests → add negative paths, resilience, and performance-aware checks.
  • High duration → broaden scenarios (large payloads, slow networks) rather than a single happy-path click-through.

This is the difference between an agent that writes “a test” and an agent that writes the test the business would have asked for if it had perfect memory of last month’s traffic.

4. Continuous “evolve QA” instead of annual suite audits

When digestible analytics are API-accessible, QA improvement becomes a loop aligned with shipping:

Analyze aggregated production vs automated scopes → Plan instrumentation/tests/fixtures → Execute in the repo → Verify in CI → repeat on the next meaningful traffic shift.

Humans stay in control of goals and risk appetite; agents handle volume, consistency, and follow-through.


Part II — How TestChimp turns the concept into an agent-ready system

The conceptual loop needs three mechanical pieces: emit in the app, tag during automation, compare in a platform. TestChimp wires all three and exposes the result as summaries agents can consume without becoming data engineers.

TrueCoverage powered agentic QA loop in TestChimp

1. @testchimp/rum-js: production speaks the same language as tests

The application under test integrates @testchimp/rum-js (see the library README for init, emit, flush, configuration, and event constraints). Typical practice:

  • Call testchimp.init() once at bootstrap with projectId, apiKey, and an environment tag (for example production vs staging).
  • Prefer a single helper (for example emitProductEvent) wrapping testchimp.emit({ title, metadata }) so event names and metadata stay consistent.
  • Control volume through config (caps per session, repeats per title, batching intervals, kill switches)—agents can tune this deliberately instead of flooding pipelines.

Agent-relevant discipline: keep titles semantic (subscription_renewed) rather than noisy (blue_button_clicked). Keep metadata low-cardinality and non-identifying—think user.role, org.plan_tier, cart.is_empty—not raw IDs or free text. That is how the platform can return per-value coverage without privacy explosions. Dot-scoped keys like user.has_fop help agents map analytics slices directly to fixture dimensions.

Product overview: TrueCoverage intro.

2. Playwright reporter: the same events, tagged with test identity

Automated runs are only comparable to production if tests emit the same event titles (or a deliberate, documented mapping) and the platform can tell automation apart from anonymous traffic. TestChimp’s Playwright integration—@testchimp/playwright—tags RUM events with test identity during runs so coverage comparisons can answer: “Did this suite actually exercise checkout_attempted in the last seven days of CI?”

That is what makes “coverage” mean behavioral coverage of real journeys, not merely “we ran N tests.”

3. Execution scopes: compare apples to apples, on purpose

Agents should treat scopes as first-class inputs (see TrueCoverage workflow docs in your agent instructions). In practice:

  • A base scope anchored on the environment that best reflects real users (often production) drives funnel-relative metrics: frequency, transitions, terminal behavior, session counts.
  • A comparison scope (often QA or staging) answers what automation (or a specific branch/release) is doing in the same vocabulary.
  • automationEmitsOnly on comparison or child-tree scopes is how you ensure “covered” means test-tagged emits, not a manual tester clicking around on the same environment.

Getting this wrong is how teams accidentally overstate coverage. Getting it right is how agents earn trust from humans who offload QA.

4. Data APIs and MCP tools: digested signal for decisioning

TestChimp exposes TrueCoverage through APIs mirrored in MCP tools (for example list-rum-environments, get-truecoverage-events, get-truecoverage-event-details, get-truecoverage-child-event-tree, transition and time-series endpoints, metadata key discovery). The agent workflow is intentionally summary-first:

  1. List environments → pick stable tags for base vs comparison.
  2. Fetch event summaries → find high-impact titles with weak or absent coverage status vs the comparison scope.
  3. Drill into details → metadata distributions, example sessions, next-event trees.
  4. Translate gaps → SmartTests, fixtures, seed endpoints, or new emits if the product changed but telemetry did not.

Because payloads are aggregated—frequencies, histograms, coverage badges—agents can make informed plans without pulling raw session oceans into the context window.

5. Closing the loop: from insight to repo changes to measurable improvement

The full agentic story looks like this in practice:

  1. Instrument with rum-js using repo conventions (shared helper, env-based keys, documented events).
  2. Run Playwright suites with the reporter so emits carry test identity.
  3. Query TrueCoverage summaries after meaningful windows (release, marketing push, seasonal spike).
  4. Decide using 4Ds + transitions + metadata slices what to add or fix.
  5. Implement tests and fixtures; optionally evolve instrumentation when new risk-bearing branches appear in the UI.
  6. Re-query to confirm the comparison scope now hits the slices that production demanded.

That is autonomous QA with a compass: still bounded by human policy (privacy, release priorities, acceptable sampling), but no longer guessing whether the suite matches reality.


Who this is for

  • Agents: treat TrueCoverage as your grounding layer—the bridge from “what humans wrote in tickets” to “what the world actually does.” Use it to prioritize fixtures, journeys, and instrumentation before you spend budget on breadth nobody needs.
  • Humans offloading QA: ask for scopes, metadata discipline, and before/after dashboards—simple checks that the agent is optimizing production-aligned risk, not arbitrary line counts.

Accountable product and compliance choices still sit with people; TrueCoverage cheapens the cost of being well-informed—for agents reasoning over code and humans steering risk—which, in the agentic era, is the difference between automation that merely runs and automation that continuously earns the right to ship.


Further reading