4 posts tagged with "AI"

AI-assisted testing and experimentation

How to Find Duplicate Tests in a Playwright Suite (Semantic Graph for Agentic QA)

July 4, 2026 · 10 min read

Founder & CEO, TestChimp

TL;DR: When coding agents can write dozens of Playwright tests in a single session, the bottleneck shifts from authoring to governance: are the new tests distinct and useful, or just near-duplicates of what you already have? Semantic Graph is a free, open-source CLI that scans your suite, embeds each test semantically, clusters related tests, and renders an interactive graph so you—and your agent—can spot redundancy before it compounds.

Semantic Graph visualization — folder tree, 2D similarity graph, and cluster list view

The new problem: agents author tests en masse

For most of the last decade, the hard part of E2E testing was throughput: humans could not write and maintain enough tests to keep up with product velocity.

That constraint is collapsing. With Claude Code, Cursor, and agent skills like the TestChimp skill, a single prompt can produce a folder of well-formed Playwright specs in minutes. Coverage gaps that used to take a sprint to close can shrink to an afternoon.

The bottleneck has moved.

Era	Primary constraint	What "good" looked like
Manual QA	Authoring speed	Enough tests to cover the happy path
Human + low-code tools	UI-layer setup friction	Stable POMs, fewer flakes
Agentic QA	Suite quality at scale	Distinct, high-signal tests—not copies

When an agent is rewarded for adding tests—closing coverage gaps, responding to PR feedback, or filling in scenarios from a test plan—it has no innate sense of "this already exists, slightly reworded." Left unchecked, suites balloon with:

Duplicate tests that assert the same behaviour under different titles
Near-duplicates that differ only in fixture data or selector phrasing
Clustered redundancy where five tests all exercise the same checkout edge case
Invisible overlap across folders, because no human (and no agent) holds the entire suite in working memory

This is the QA equivalent of boiling the lake in the wrong direction: lots of heat, little new coverage. Worse, duplicate tests inflate CI time, confuse failure triage, and give a false sense of depth—your line count grows while your behavioural breadth stalls.

The question is no longer "Can we write more tests?" It is:

"Are we writing useful, distinct tests—or just duplicative ones?"

That question needs a semantic answer, not a filename diff.

What is Semantic Graph?

Semantic Graph is an open-source tool from TestChimp that maps your Playwright test suite by meaning, not syntax.

It is published as @testchimp/semantic-graph on npm and lives in the TestChimp/semantic-graph repository. Run one command against your tests directory; the CLI:

Scans *.spec.ts, *.test.ts, and related Playwright files
Parses each test's suite path, title, intent comments, scenario annotations, and body
Embeds the canonical test text with an embedding model (OpenAI or Voyage AI)
Clusters tests by semantic similarity using DBSCAN
Lays out a 2D graph with UMAP so similar tests appear close together
Names clusters with a lightweight LLM pass (e.g. "auth", "checkout", "api-contracts")
Serves a local interactive UI at http://localhost:3859

No database. No TestChimp account required. Embeddings are computed in memory each run—ideal for local audits, pre-merge reviews, or giving an agent a structural view of the suite before it authors more tests.

How it works (the pipeline)

Understanding the pipeline helps you interpret the graph—and tune how agents use it.

1. Parse tests into embedding-ready text

The core library (@testchimp/semantic-graph-core) includes a vendored Playwright-aware parser. For each test it builds canonical text:

Suite: checkout > guest flow
Test: rejects expired coupon at payment step
Body:
Scenario: Guest checkout with invalid coupon
// intent: verify error copy and no charge created
await page.goto('/checkout');
...

Parsing captures intent comments and scenario annotations—the same metadata agents should be authoring anyway when following requirement traceability conventions. Two tests with different selectors but the same intent will land close together in embedding space.

2. Embed with cosine similarity

Each test's text is sent to an embedding API in batches (default model: text-embedding-3-small for OpenAI, voyage-4 for Voyage). The tool computes cosine similarity between vectors and applies configurable thresholds:

Signal	Default threshold	Meaning
Graph edge	≥ 0.75	Tests are semantically related
Similar	≥ 0.80	Worth reviewing together
Potential duplicate	≥ 0.92	Strong dedup candidate

These thresholds mirror how humans judge redundancy: not byte-identical, but "would a failure in one make the other pointless?"

3. Cluster with DBSCAN

Similar embeddings are grouped with DBSCAN density clustering—no need to pick k clusters upfront. Each cluster gets an LLM-generated label (e.g. "settings-page", "admin-tasks") so the legend is readable at a glance.

4. Visualize with UMAP + D3

A seeded UMAP projection maps high-dimensional embeddings to 2D coordinates. The bundled UI (built with D3.js) renders:

Graph view — nodes as tests, edges as similarity links; click a node to see nearest neighbours and duplicate flags
Clusters view — grouped list with colour-coded legend
Folder tree — scope the graph to a directory or single file

Zoom into tests/checkout/ before a refactor. Scan the whole suite before a release. Hand the URL to an agent and ask it to propose merges.

Why this matters for agentic QA workflows

Semantic Graph is not a replacement for TrueCoverage—production-informed prioritization—or requirement traceability. It solves a orthogonal problem: intra-suite redundancy.

Here is where it fits in a modern agent loop:

Before the agent writes

Run Semantic Graph and attach the cluster summary to the agent's context. Instructions become concrete:

"We already have four tests in the checkout cluster covering coupon validation. Do not add another unless you are testing a different failure mode."

This is cheaper and more reliable than asking the agent to grep test titles.

After the agent writes

Re-run the graph on the PR branch. New nodes that snap onto existing clusters—or spike duplicate scores above 0.92—are review flags. Pair with CI the same way you gate on lint or coverage deltas.

During suite health reviews

Quarterly "suite diet" sessions used to mean spreadsheets and gut feel. Now: filter to clusters with high internal similarity, merge or delete, and measure CI time recovered.

Complement to production signals

TrueCoverage tells you what behaviours users need tested. Semantic Graph tells you whether your existing tests are saying the same thing twice. Both are necessary for a suite that is broad and lean.

What you see in the UI

The demo above shows the full workflow:

Left panel — folder tree mirroring your repo layout; click a folder or file to scope the view
Graph mode — force-directed layout; proximate nodes are semantically alike
Clusters mode — tests bucketed with named themes
Popover — click any test to see top similar neighbours, similarity scores, and potential duplicate badges

The UI ships inside the npm package—no separate install. It is the same "freebie" static app published as @testchimp/semantic-graph-viz in the monorepo for anyone who wants to embed or fork it.

Try it yourself

Prerequisites

Node.js 18+
An API key for embeddings (and cluster naming):
- OpenAI — one key covers embeddings + LLM, or
- Anthropic + Voyage — Claude for cluster labels, Voyage for embeddings (Anthropic does not ship an embedding API)

Quick start (OpenAI)

export PROVIDER=openai
export API_KEY=sk-...

npx @testchimp/semantic-graph visualize --tests-dir ./tests

Open the printed URL (default port 3859). Add --verbose for embedding progress and diagnostics.

Claude + Voyage

export PROVIDER=anthropic
export API_KEY=sk-ant-...
export VOYAGE_API_KEY=pa-...

npx @testchimp/semantic-graph visualize --tests-dir ./tests

All options

Flag	Description
`--tests-dir <path>`	Root folder to scan (required)
`--port <n>`	Listen port (default `3859`)
`--verbose` / `-v`	Diagnostics to stderr

See the README for environment variables, monorepo build instructions, and npm publish details.

Continuous governance with TestChimp

Semantic Graph is deliberately local and standalone—a flashlight you can shine on any Playwright repo, TestChimp customer or not.

For continuous duplicate detection, requirement traceability, release confidence, and keeping suites healthy as agents keep authoring, see TestChimp—the git-native QA governance platform built for agentic teams. Install the TestChimp Agent Skill and run /testchimp test after each PR to orchestrate coverage, exploration, and plan alignment in one loop.

FAQ

What test file types are supported?

The scanner picks up *.spec.ts, *.spec.js, *.test.ts, *.test.js, and .mjs / .cjs variants under your chosen root—standard Playwright test layouts.

Does it require a TestChimp account?

No. Semantic Graph runs entirely locally. You only need embedding (and optionally LLM) API keys.

How is this different from code coverage?

Code coverage measures which lines executed. Semantic Graph measures whether test intentions overlap. A suite can have high line coverage and still be full of redundant scenarios.

How is this different from duplicate detection by test name?

Titles lie. Agents especially love paraphrasing: "should reject invalid coupon" vs "guest user sees error for expired promo code." Embeddings capture the full body and intent, not the string on line one.

Can I use it in CI?

Today the primary interface is the local visualize command and JSON APIs (/api/graph, /api/similar). For CI gates, parse the API responses or run before review and archive the graph output. Continuous server-side governance is on the TestChimp platform roadmap.

What embedding models are supported?

Defaults: text-embedding-3-small (OpenAI) and voyage-4 (Voyage). Override with EMBEDDING_MODEL. LLM cluster naming defaults to gpt-5-nano or claude-3-5-haiku-latest.

Is the source code open?

Yes. MIT-licensed monorepo: github.com/TestChimp/semantic-graph. Packages: @testchimp/semantic-graph-core, @testchimp/semantic-graph, @testchimp/semantic-graph-viz.

Summary

Agentic QA solved test authoring at scale. The next discipline is test distinctness at scale—ensuring every new spec adds behavioural breadth, not noise.

Semantic Graph gives you a semantic map of your Playwright suite: embeddings for meaning, DBSCAN for clusters, UMAP for intuition, and a local UI for humans and agents alike. Run it before you merge agent-authored tests. Run it when CI gets slow. Run it when you suspect the lake is boiling but not reducing risk.

Get started: github.com/TestChimp/semantic-graph · npx @testchimp/semantic-graph visualize

References and further reading

TestChimp Semantic Graph repository — source, README, and issue tracker
@testchimp/semantic-graph on npm — CLI package
Playwright Test documentation — supported project layouts
OpenAI Embeddings guide — text-embedding-3-small and related models
Voyage AI documentation — embeddings when using Claude as the LLM provider
UMAP: Uniform Manifold Approximation and Projection — dimensionality reduction for the 2D layout
DBSCAN clustering — density-based cluster assignment
Fixtures in agentic test automation — complementary TestChimp blog on Arrange-layer quality
TrueCoverage for agentic QA — production-informed test prioritization
TestChimp Agent Skills — orchestrate QA workflows in Claude and Cursor

SKILLs are becoming SaaS’s best distribution hack (here’s why)

May 11, 2026 · 3 min read

Nuwan Samarasekera

Founder & CEO, TestChimp

For years, the hardest part of selling a complex technical product was not the demo—it was the learning curve. Buyers had to internalize workflows, edge cases, and “the right way” to use each feature before they could reliably get value.

That is changing fast. Agent Skills—portable folders of instructions, checklists, and resources that teach an AI agent how to work with your product—are starting to look like one of the most attractive distribution mechanisms for technical SaaS. Instead of hoping every customer reads the docs in the right order, you ship a repeatable operating procedure the agent can follow on demand.

A skill turns every “new user” into a “power user”

A well-designed Agent Skill effectively turns every user into a power user: one that knows which workflows to follow, how to use the product correctly, and how to extract maximum value from every feature.

That compresses time-to-value—the path to the “aha moment”—because the agent is not improvising from vague prompts; it is executing your intended playbook.

What we are seeing at TestChimp

We have been seeing this firsthand since launching the TestChimp Agent Skill.

For teams, the workflow is intentionally simple:

Author a few user stories (or import from Jira).
Install the TestChimp skill on your coding agent.
After each PR, simply say /testchimp test.

The skill teaches Claude how to coordinate with TestChimp to:

instrument the app for TrueCoverage,
fetch and interpret coverage gaps,
write tests that addresses the gaps and link them to scenarios correctly,
run targeted exploratory testing to catch UX issues,
and use AI-native test steps in tests where they help.

The upgrade loop: your perfect user ships with your product

The best part is what happens when you ship new features.

With a properly designed, self-updating TestChimp Agent Skill, your "user" continuously learns your latest workflows, capabilities, and best practices—and applies them the way you intended. Your agent-side “instruction manual” can move as fast as your product, without requiring every human user to re-read release notes and learn every new capability you ship.

If you are building technical SaaS in the agent era, the product surface area is no longer only your UI and APIs. It is also the skill: the packaged expertise that turns your users in to power users.

References and further reading

Authoritative guides and registries for Agent Skills (format, discovery, and ecosystem):

Boiling the lake - QA style

April 28, 2026 · 3 min read

Nuwan Samarasekera

Founder & CEO, TestChimp

Boil the lake - credits: https://garryslist.org/posts/boil-the-ocean

Garry Tan recently introduced a simple but powerful idea: The old adage “don’t boil the ocean” is bad advice in the AI agent era. Well - at the very least, “lakes” are now very much “boilable”.

The core insight is: AI compresses certain work by orders of magnitude. That doesn’t just make things faster - it fundamentally changes what’s feasible.

Most people ask the wrong question:

“What existing human workflows can we speed up with AI?”

That’s incremental thinking. The real leverage comes from asking:

“What powerful workflows did we avoid entirely because they were too expensive to do with humans?”

Those are your “lakes”. And with AI, many of them go from infeasible → trivial.

The QA lake

In QA - making “test authoring faster” is akin to the former. The bigger ROI lies in the granular workflows that get unlocked now that agents can take autonomy in your test automation.

The Big Idea:

Could agents execute a workflow - where they continuously monitor “planned reality” (user stories / scenarios) and “production reality” (real user behaviour patterns) to improve the “tested reality” (test suite + test infra) - in a continuous feedback loop. All of it done in the background - looping you in for approval of plans it makes.

Feedback Loop enabled by TestChimp

This is exactly the future we were building TestChimp for - where agents participate in each phase of QA; where agents access real world insights / plan artifacts to self-direct its work strategically.

Claude + TestChimp

Today, we are adding the final piece of the puzzle: A SKILL that you can install on Claude / Cursor that enables just that.

In TestChimp, test plans are already maintained as Markdowns in repo - directly accessible to agents.
Requirements are linked to tests via in-code comments - that Agents can author.
Test executions are auto-tracked by our Playwright plugin
Event ingests are tracked across prod and test - to generate TrueCoverage insights.

The Skill “upskills” Claude to read those insights via our CLI / MCP, to plan and execute the entire QA workflow:

Understand coverage gaps, prioritize (using signals exposed by TestChimp) and plan
Author fixtures that emulate real-world situations observed
Update test infrastructure (seed / probe endpoints) as needed
Author tests - (provisioning PR-local envs to test in and validating tests work)
Update instrumentations to learn about real user behaviour (for future cycles - covering new user journeys introduced)

QA workflow orchestrated by TestChimp - Overview

The best part: All of this is condensed to just 2 commands - enabling a frictionless DevX:

/testchimp test -> (Run after each PR) Updates plans, authors seeds / fixtures, author tests, validate them in PR scoped isolated environments, instrument code for TrueCoverage
/testchimp evolve -> (Run periodically / on deploy) Audits test coverage aligned with requirements and real-user insights, to “evolve” your QA infra & test suite to cover critical under-tested areas and do corrective actions & run targeted exploratory runs.

Claude can write tests. With the right feedback loop, it can fully manage an effective, self-evolving QA posture that de-risks your product continuously. This is what TestChimp enables, by making each phase of QA agent-native, informed by requirements and real user behaviour insights, in a tight feedback loop.

ai-wright: AI Steps in Playwright Scripts

November 10, 2025 · 3 min read

Nuwan Samarasekera

Founder & CEO, TestChimp

Bring AI-native actions and verifications into your Playwright tests – open source, vision-enabled, and BYOL.

The Problem

Most “AI testing” frameworks make you throw away what already works.

They replace your entire test suite with “agentic” systems — where an LLM drives every click, assertion, and navigation step.

Sounds cool… until you hit:

Slow, flaky, or non-deterministic runs
Proprietary test formats
Complete vendor lock-in

For most teams, that’s a non-starter.

What if you could keep your existing Playwright scripts, and just inject AI where it’s actually needed – the ambiguous, messy, or dynamic parts of your app?

The Idea

ai-wright brings AI steps to Playwright.

You still write regular Playwright tests – deterministic, fast, inspectable – but when you hit a fuzzy point, you can drop in a step like:

await ai.act('Click on a top rated campaign', { page, test });

await ai.verify('The campaign description should not contain offensive words"', { page, test });

That’s it. AI only handles that step.

Everything else stays Playwright-native.

Why It’s Different

Vision-Enabled Existing libraries (like ZeroStep and auto-playwright) use sanitized HTML – which misses what’s actually on screen.

This causes many issues:

HTML ≠ UI reality – static DOM can’t reveal if elements are disabled, visible, obscured, or off-screen – resulting in LLMs attempting interaction with non-interactive elements.
Loss of semantics – sanitized HTML strips ARIA roles, computed text, layout cues, and shadow DOM content, which are critical for accurate reasoning.
Unbounded prompt size – large DOMs can often get too verbose, requiring truncation (resulting in loss of context).
Fragile selectors – HTML-based approaches force LLMs to guess selectors; ai-wright uses precise SoM IDs bound to live DOM nodes, enabling accurate one-shot execution.
ai-wright is vision-enabled: it blends SOM (Set-Of-Marks) annotated screenshots + structured DOM context for grounded, visual reasoning.

The result: AI that operates just like a normal user would – based on what it sees on the screen.

Better Reasoning

Instead of one-shot “guess the next click”, ai-wright uses a multi-step reasoning loop.

It plans ahead, performs coarse-grained objective handling (e.g., “fill out login form,” not just “click button”), and adapts to UI state changes – minimizing retries and random flailing.

It can identify blockers (such as Modals etc.), and execute pre-steps before actioning on the objective.

BYOL (Bring Your Own License)

ai-wright is LLM-agnostic – unlike existing solutions which require either proprietary licenses or supports specific providers only.

You can use your own OpenAI, Claude, Gemini key, or your self-hosted model – avoiding vendor lock-in.

You can choose to use your TestChimp license as well – which will proxy the LLM calls, removing separate token costs for you.

Fully Open Source

Unlike agentic SaaS offerings which are closed source, proprietary solutions, ai-wright is fully open source, giving you complete transparency and community support.

ai-wright lets you inject AI where it matters — the tricky, ambiguous, or dynamic parts of your app — without giving up the speed, determinism, and maintainability of Playwright.

With vision-enabled reasoning, resilient multi-step planning, LLM flexibility, and a fully open source foundation, ai-wright bridges the best of both worlds: reliable, scriptable tests and AI-powered intelligence where you need it most – without any vendor lock-in.

AI where it helps, plain Playwright everywhere else.

The new problem: agents author tests en masse​

What is Semantic Graph?​

How it works (the pipeline)​

1. Parse tests into embedding-ready text​

2. Embed with cosine similarity​

3. Cluster with DBSCAN​

4. Visualize with UMAP + D3​

Why this matters for agentic QA workflows​

Before the agent writes​

After the agent writes​

During suite health reviews​

Complement to production signals​

What you see in the UI​

Try it yourself​

Prerequisites​

Quick start (OpenAI)​

Claude + Voyage​

All options​

Continuous governance with TestChimp​

FAQ​

What test file types are supported?​

Does it require a TestChimp account?​

How is this different from code coverage?​

How is this different from duplicate detection by test name?​

Can I use it in CI?​

What embedding models are supported?​

Is the source code open?​

Summary​

References and further reading​

A skill turns every “new user” into a “power user”​

What we are seeing at TestChimp​

The upgrade loop: your perfect user ships with your product​

References and further reading​

The QA lake​

Claude + TestChimp​

The Problem​

The Idea​

Why It’s Different​

The new problem: agents author tests en masse

What is Semantic Graph?

How it works (the pipeline)

1. Parse tests into embedding-ready text

2. Embed with cosine similarity

3. Cluster with DBSCAN

4. Visualize with UMAP + D3

Why this matters for agentic QA workflows

Before the agent writes

After the agent writes

During suite health reviews

Complement to production signals

What you see in the UI

Try it yourself

Prerequisites

Quick start (OpenAI)

Claude + Voyage

All options

Continuous governance with TestChimp

FAQ

What test file types are supported?

Does it require a TestChimp account?

How is this different from code coverage?

How is this different from duplicate detection by test name?

Can I use it in CI?

What embedding models are supported?

Is the source code open?

Summary

References and further reading

A skill turns every “new user” into a “power user”

What we are seeing at TestChimp

The upgrade loop: your perfect user ships with your product

References and further reading

The QA lake

Claude + TestChimp

The Problem

The Idea

Why It’s Different