Skip to main content

How to Test AI Agent Workflows and Tool Calling

Short answer

AI agents plan multi-step workflows, invoke tools with side effects, retry on failure, and sometimes pause for human approval—testing only the final assistant message misses the dangerous parts. Combine offline evals on tool-selection trajectories, AIMock for deterministic planning responses in E2E, stubbed tool APIs with probe Assert, and hybrid ai.act/ai.verify on approval UI—not live third-party calls in every PR.

Part of Testing Guides by AI and conversational UX.

Who this is for

Teams shipping agentic products: copilots that book meetings, file tickets, run refunds, query CRMs, or orchestrate internal APIs through LLM tool routers (OpenAI function calling, Anthropic tools, LangGraph, custom planners).

Not for: single-shot chat with no tools, or batch offline pipelines with no user-facing workflow UI.

Why testing agent workflows matters

Agent bugs compound across steps:

  • Revenue loss — double refund when retry lacks idempotency; wrong SKU updated in ERP.
  • Security incidents — agent calls admin tool with viewer session; cross-tenant CRM read via wrong accountId in tool args.
  • Support load — infinite replan loops; human approval stuck pending while UI shows "complete".
  • Compliance exposure — agent sends email or exports data without audit log; PII written to wrong ticket.

The planner can produce eloquent summaries while tool execution fails silently. E2E must assert tool dispatch, idempotent side effects, and approval gates via probes.

Complexity map

ScenarioEdge caseWhy tests breakApproach
Wrong tool selectedSimilar tool namesBad side effectOffline golden trajectories + one E2E per critical tool
Tool timeoutUpstream 504Agent retries foreverStub slow/fail tool; assert retry cap
Partial failureStep 2 of 5 failsOrphan recordsProbe compensation/rollback
IdempotencySame tool call twiceDouble chargeProbe single row; idempotency key
Human approvalPending stateAuto-executesAssert blocked until approve click
Parallel toolsRace on DBFlaky orderSerialize in prod; test both orders
Auth scopeTool needs OAuth403 mid-workflowSeed token; probe deny before Act
Long-running jobAsync completionAssert too earlyPoll job status probe
Plan visibilityHidden chain-of-thoughtUntested stepsAssert step UI or audit log probe
Model swapPlanner behavior shiftWrong tool regressionEval gate on planner prompt
Rate limitsTool API 429Random CI failStub tools in CI
CancellationUser aborts mid-planOrphan tool callsProbe no partial writes

Agent architecture under test

User goal → Planner (LLM) → Tool calls → External APIs / DB

Human approval gate (optional)

Workflow state machine (running / pending / done / failed)

Test at three boundaries:

  1. Planning — offline evals: given user goal + context, expect tool sequence
  2. Integration — AIMock planner + real tool router + stubbed HTTP tools
  3. Truth — probe Assert on workflow row, audit log, side-effect tables

Offline evals: golden trajectories

Maintain fixtures independent of Playwright:

{
"id": "refund-damaged-item",
"context": { "orderId": "12345", "reason": "damaged" },
"expectedTools": ["lookup_order", "check_refund_policy", "initiate_refund"],
"forbiddenTools": ["delete_account"]
}

Grade with:

  • Exact match on tool names and ordered sequence (strict workflows)
  • Set match when order flexible
  • LLM-as-judge on whether trajectory satisfies policy—calibrate against human labels

Run evals on PRs touching planner prompts, tool schemas, or permission middleware.

AIMock for planner responses in E2E

Stub planning LLM calls while exercising real orchestration code:

// AIMock returns fixed tool plan for runId + scenario id
{
"toolCalls": [
{ "name": "lookup_order", "arguments": { "orderId": "12345" } },
{ "name": "initiate_refund", "arguments": { "orderId": "12345", "amount": 4999 } }
]
}

Your backend still validates args, enforces auth, and hits stubbed tool HTTP endpoints that return deterministic payloads.

Stub tool endpoints (Arrange)

// POST /api/test/stub-tool/lookup_order
// Returns fixture order for runId

// POST /api/test/stub-tool/initiate_refund
// Writes refund row idempotently on idempotency-key header

Playwright Arrange registers scenario + runId; tool router points to stub base URL in test env.

E2E pattern: happy path with probe Assert

await ai.act('Ask the agent to refund order 12345 for damaged item');
await ai.verify('Agent shows plan summary or approval prompt before executing refund');
await page.getByRole('button', { name: /approve/i }).click();
await expect.poll(async () => {
const res = await request.get('/api/test/probe-refund/12345');
return (await res.json()).status;
}, { timeout: 20_000 }).toBe('pending');
await ai.verify('Agent confirms refund initiated');

Never stop at ai.verify alone when money or PII moves—probe is authoritative.

Human-in-the-loop approval

Cover:

CaseAssert
Approval requiredTool not called until approve; probe no refund row
RejectPlan cancelled; probe unchanged
TimeoutWorkflow fails gracefully; user can retry
Re-approval after editNew plan id; old partial writes rolled back
await expect.poll(() => probeWorkflowStatus(runId)).toBe('pending_approval');
await page.getByRole('button', { name: /reject/i }).click();
await expect.poll(() => probeRefundCount('12345')).toBe(0);

Failure, retry, and idempotency

ScenarioArrangeAssert
Tool 500 first callStub returns 500 then 200Probe single side effect after retry
Duplicate deliveryReplay same tool idempotency keyProbe one row
Planner retry loopStub always 500Workflow fails with user-visible error; probe no partial

See also webhooks and async processing for event-driven tool completions.

Tool auth and tenancy

Seed users with different roles and tenants. Negative E2E:

  • Viewer cannot invoke admin_export even if planner suggests it—probe 403 on tool route
  • Tool args cannot reference another tenant's accountId—probe isolation

Observability and audit

Assert audit log probe for regulated workflows:

await expect.poll(async () => {
const res = await request.get(`/api/test/probe-audit?workflowId=${wfId}`);
return (await res.json()).events.map(e => e.type);
}).toContain('tool.initiate_refund');

CI checklist

  1. All external tools stubbed in default PR job
  2. AIMock planner fixtures versioned beside SmartTests
  3. One E2E per business-critical tool (not every CRUD tool)
  4. Eval job gates planner prompt changes
  5. Unique runId per worker; idempotency keys include runId
  6. Human approval specs use explicit button roles, not coordinates

Anti-patterns

Anti-patternWhy it failsBetter approach
Live Salesforce/Stripe in CIFlake, cost, PIIStub tool HTTP + probe
Assert final chat message onlyMiss double executionProbe side effects
No idempotency testDouble charge in prodReplay tool call fixture
Skip approval gateAuto-executes dangerous opsNegative + approve paths
500-tool eval in one E2EUnmaintainableEval trajectories + sparse E2E
Shared workflow idParallel pollutionrunId-scoped workflows

Example scenario

Situation: User asks agent to schedule a meeting with legal and send the agenda email.

Expected outcome: Calendar hold created and email queued exactly once after user approves the plan.

Why UI-only automation breaks: Agent says "Scheduled!" but probe shows no calendar event and duplicate emails on retry.

  1. Arrange: AIMock planner returns calendar_create + email_send tools; stub both; seed user OAuth token for test.
  2. Act: Submit request; approve plan in UI.
  3. Assert: Probe one calendar event and one email job; workflow status completed.

TestChimp workflow: Track agent_tool × workflow_step in TrueCoverage; expand when new calendar tool ships without E2E.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Evals vs E2E: when each layer helps

LayerBest forLimitations
Offline evals (golden trajectories, tool-arg validation, LLM-as-judge on plans)Planner regression, forbidden-tool policies, sequence correctness on prompt changesCannot verify HTTP auth wiring, approval UI, idempotent DB writes, or stub vs prod tool schema drift
E2E SmartTests (AIMock planner + ai.act/ai.verify + probe Assert)Tool router integration, human approval, retries, session-scoped authExpensive to cover every tool combination
HybridEvals on every planner change; E2E on release-critical workflowsRequires mapping eval failures to new scenarios

Use golden trajectories when adding tools or changing system instructions—catch wrong-tool selection before merge. Use E2E when a tool moves money, PII, or crosses tenant boundaries. LLM-as-judge helps grade ambiguous plans but must be calibrated; do not use as sole gate until correlated with human review.

TestChimp does not ship eval tooling—combine your eval pipeline with AIMock SmartTests. When TrueCoverage shows new agent_tool usage in prod without test scenarios, /testchimp evolve closes the gap.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

External references

Frequently asked questions

How do I test agent tool calls without calling real APIs?

Stub tool endpoints in test env with seed routes returning fixed payloads; AIMock supplies deterministic planner output. Probe DB for side effects the tool should create. Evals verify tool selection offline; E2E verifies integration and auth.

What if the agent picks the wrong tool?

Offline evals with golden trajectories catch selection regression on prompt changes. E2E tests one happy path per critical tool with AIMock planning responses and forbidden-tool negative cases.

How do I test human approval gates?

Assert workflow stays pending_approval via probe until user clicks approve; reject path must leave side effects unchanged. Never rely on assistant prose claiming approval happened.

How do I test retries and idempotency?

Stub tool to fail once then succeed, or POST duplicate idempotency keys. Probe exactly one side-effect row. Assert workflow reaches completed or failed—not stuck looping.

Should every tool have an E2E test?

No—cover business-critical and high-risk tools in E2E; use offline eval trajectories for breadth. TrueCoverage highlights prod tools missing from either layer.

Where do ai.act and ai.verify fit in agent tests?

Use for approval dialogs, plan summaries, and volatile status copy. Keep tool truth on probes and audit logs. ai.verify alone is insufficient for refunds or exports.

How do we cover every agent tool in prod?

Compare prod vs test-run across agent_tool × workflow_step in TrueCoverage. When new tools spike in prod without scenarios, run /testchimp evolve when shipping tools. Add AIMock planner fixtures and stub endpoints per critical tool.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo