How to Test AI Agent Workflows and Tool Calling

Short answer

AI agents plan multi-step workflows, invoke tools with side effects, retry on failure, and sometimes pause for human approval—testing only the final assistant message misses the dangerous parts. Combine offline evals on tool-selection trajectories, AIMock for deterministic planning responses in E2E, stubbed tool APIs with probe Assert, and hybrid ai.act/ai.verify on approval UI—not live third-party calls in every PR.

Part of Testing Guides by AI and conversational UX.

Who this is for

Teams shipping agentic products: copilots that book meetings, file tickets, run refunds, query CRMs, or orchestrate internal APIs through LLM tool routers (OpenAI function calling, Anthropic tools, LangGraph, custom planners).

Not for: single-shot chat with no tools, or batch offline pipelines with no user-facing workflow UI.

Why testing agent workflows matters

Agent bugs compound across steps:

Revenue loss — double refund when retry lacks idempotency; wrong SKU updated in ERP.
Security incidents — agent calls admin tool with viewer session; cross-tenant CRM read via wrong accountId in tool args.
Support load — infinite replan loops; human approval stuck pending while UI shows "complete".
Compliance exposure — agent sends email or exports data without audit log; PII written to wrong ticket.

The planner can produce eloquent summaries while tool execution fails silently. E2E must assert tool dispatch, idempotent side effects, and approval gates via probes.

Complexity map

Scenario	Edge case	Why tests break	Approach
Wrong tool selected	Similar tool names	Bad side effect	Offline golden trajectories + one E2E per critical tool
Tool timeout	Upstream 504	Agent retries forever	Stub slow/fail tool; assert retry cap
Partial failure	Step 2 of 5 fails	Orphan records	Probe compensation/rollback
Idempotency	Same tool call twice	Double charge	Probe single row; idempotency key
Human approval	Pending state	Auto-executes	Assert blocked until approve click
Parallel tools	Race on DB	Flaky order	Serialize in prod; test both orders
Auth scope	Tool needs OAuth	403 mid-workflow	Seed token; probe deny before Act
Long-running job	Async completion	Assert too early	Poll job status probe
Plan visibility	Hidden chain-of-thought	Untested steps	Assert step UI or audit log probe
Model swap	Planner behavior shift	Wrong tool regression	Eval gate on planner prompt
Rate limits	Tool API 429	Random CI fail	Stub tools in CI
Cancellation	User aborts mid-plan	Orphan tool calls	Probe no partial writes

Agent architecture under test

User goal → Planner (LLM) → Tool calls → External APIs / DB
                ↓
         Human approval gate (optional)
                ↓
         Workflow state machine (running / pending / done / failed)

Test at three boundaries:

Planning — offline evals: given user goal + context, expect tool sequence
Integration — AIMock planner + real tool router + stubbed HTTP tools
Truth — probe Assert on workflow row, audit log, side-effect tables

Offline evals: golden trajectories

Maintain fixtures independent of Playwright:

{
  "id": "refund-damaged-item",
  "context": { "orderId": "12345", "reason": "damaged" },
  "expectedTools": ["lookup_order", "check_refund_policy", "initiate_refund"],
  "forbiddenTools": ["delete_account"]
}

Grade with:

Exact match on tool names and ordered sequence (strict workflows)
Set match when order flexible
LLM-as-judge on whether trajectory satisfies policy—calibrate against human labels

Run evals on PRs touching planner prompts, tool schemas, or permission middleware.

AIMock for planner responses in E2E

Stub planning LLM calls while exercising real orchestration code:

// AIMock returns fixed tool plan for runId + scenario id
{
  "toolCalls": [
    { "name": "lookup_order", "arguments": { "orderId": "12345" } },
    { "name": "initiate_refund", "arguments": { "orderId": "12345", "amount": 4999 } }
  ]
}

Your backend still validates args, enforces auth, and hits stubbed tool HTTP endpoints that return deterministic payloads.

Stub tool endpoints (Arrange)

// POST /api/test/stub-tool/lookup_order
// Returns fixture order for runId

// POST /api/test/stub-tool/initiate_refund
// Writes refund row idempotently on idempotency-key header

Playwright Arrange registers scenario + runId; tool router points to stub base URL in test env.

E2E pattern: happy path with probe Assert

await ai.act('Ask the agent to refund order 12345 for damaged item');
await ai.verify('Agent shows plan summary or approval prompt before executing refund');
await page.getByRole('button', { name: /approve/i }).click();
await expect.poll(async () => {
  const res = await request.get('/api/test/probe-refund/12345');
  return (await res.json()).status;
}, { timeout: 20_000 }).toBe('pending');
await ai.verify('Agent confirms refund initiated');

Never stop at ai.verify alone when money or PII moves—probe is authoritative.

Human-in-the-loop approval

Cover:

Case	Assert
Approval required	Tool not called until approve; probe no refund row
Reject	Plan cancelled; probe unchanged
Timeout	Workflow fails gracefully; user can retry
Re-approval after edit	New plan id; old partial writes rolled back

await expect.poll(() => probeWorkflowStatus(runId)).toBe('pending_approval');
await page.getByRole('button', { name: /reject/i }).click();
await expect.poll(() => probeRefundCount('12345')).toBe(0);

Failure, retry, and idempotency

Scenario	Arrange	Assert
Tool 500 first call	Stub returns 500 then 200	Probe single side effect after retry
Duplicate delivery	Replay same tool idempotency key	Probe one row
Planner retry loop	Stub always 500	Workflow fails with user-visible error; probe no partial

See also webhooks and async processing for event-driven tool completions.

Tool auth and tenancy

Seed users with different roles and tenants. Negative E2E:

Viewer cannot invoke admin_export even if planner suggests it—probe 403 on tool route
Tool args cannot reference another tenant's accountId—probe isolation

Observability and audit

Assert audit log probe for regulated workflows:

await expect.poll(async () => {
  const res = await request.get(`/api/test/probe-audit?workflowId=${wfId}`);
  return (await res.json()).events.map(e => e.type);
}).toContain('tool.initiate_refund');

CI checklist

All external tools stubbed in default PR job
AIMock planner fixtures versioned beside SmartTests
One E2E per business-critical tool (not every CRUD tool)
Eval job gates planner prompt changes
Unique runId per worker; idempotency keys include runId
Human approval specs use explicit button roles, not coordinates

Anti-patterns

Anti-pattern	Why it fails	Better approach
Live Salesforce/Stripe in CI	Flake, cost, PII	Stub tool HTTP + probe
Assert final chat message only	Miss double execution	Probe side effects
No idempotency test	Double charge in prod	Replay tool call fixture
Skip approval gate	Auto-executes dangerous ops	Negative + approve paths
500-tool eval in one E2E	Unmaintainable	Eval trajectories + sparse E2E
Shared workflow id	Parallel pollution	runId-scoped workflows

Example scenario

Situation: User asks agent to schedule a meeting with legal and send the agenda email.

Expected outcome: Calendar hold created and email queued exactly once after user approves the plan.

Why UI-only automation breaks: Agent says "Scheduled!" but probe shows no calendar event and duplicate emails on retry.

Arrange: AIMock planner returns calendar_create + email_send tools; stub both; seed user OAuth token for test.
Act: Submit request; approve plan in UI.
Assert: Probe one calendar event and one email job; workflow status completed.

TestChimp workflow: Track agent_tool × workflow_step in TrueCoverage; expand when new calendar tool ships without E2E.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Evals vs E2E: when each layer helps

Layer	Best for	Limitations
Offline evals (golden trajectories, tool-arg validation, LLM-as-judge on plans)	Planner regression, forbidden-tool policies, sequence correctness on prompt changes	Cannot verify HTTP auth wiring, approval UI, idempotent DB writes, or stub vs prod tool schema drift
E2E SmartTests (AIMock planner + `ai.act`/`ai.verify` + probe Assert)	Tool router integration, human approval, retries, session-scoped auth	Expensive to cover every tool combination
Hybrid	Evals on every planner change; E2E on release-critical workflows	Requires mapping eval failures to new scenarios

Use golden trajectories when adding tools or changing system instructions—catch wrong-tool selection before merge. Use E2E when a tool moves money, PII, or crosses tenant boundaries. LLM-as-judge helps grade ambiguous plans but must be calibrated; do not use as sole gate until correlated with human review.

TestChimp does not ship eval tooling—combine your eval pipeline with AIMock SmartTests. When TrueCoverage shows new agent_tool usage in prod without test scenarios, /testchimp evolve closes the gap.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

Conversational UI — chat surface patterns
RAG search — retrieval tools in agents
Webhooks async — async tool completion
LLM output validation — structured tool args
RBAC permissions — tool authorization

External references

OpenAI function calling
Anthropic tool use
Playwright test fixtures — isolated runId per worker
SmartTests intro — AIMock and hybrid steps

Frequently asked questions

How do I test agent tool calls without calling real APIs?

Stub tool endpoints in test env with seed routes returning fixed payloads; AIMock supplies deterministic planner output. Probe DB for side effects the tool should create. Evals verify tool selection offline; E2E verifies integration and auth.

What if the agent picks the wrong tool?

Offline evals with golden trajectories catch selection regression on prompt changes. E2E tests one happy path per critical tool with AIMock planning responses and forbidden-tool negative cases.

How do I test human approval gates?

Assert workflow stays pending_approval via probe until user clicks approve; reject path must leave side effects unchanged. Never rely on assistant prose claiming approval happened.

How do I test retries and idempotency?

Stub tool to fail once then succeed, or POST duplicate idempotency keys. Probe exactly one side-effect row. Assert workflow reaches completed or failed—not stuck looping.

Should every tool have an E2E test?

No—cover business-critical and high-risk tools in E2E; use offline eval trajectories for breadth. TrueCoverage highlights prod tools missing from either layer.

Where do ai.act and ai.verify fit in agent tests?

Use for approval dialogs, plan summaries, and volatile status copy. Keep tool truth on probes and audit logs. ai.verify alone is insufficient for refunds or exports.

How do we cover every agent tool in prod?

Compare prod vs test-run across agent_tool × workflow_step in TrueCoverage. When new tools spike in prod without scenarios, run /testchimp evolve when shipping tools. Add AIMock planner fixtures and stub endpoints per critical tool.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo

Who this is for​

Why testing agent workflows matters​

Complexity map​

Agent architecture under test​

Offline evals: golden trajectories​

AIMock for planner responses in E2E​

Stub tool endpoints (Arrange)​

E2E pattern: happy path with probe Assert​

Human-in-the-loop approval​

Failure, retry, and idempotency​

Tool auth and tenancy​

Observability and audit​

CI checklist​

Anti-patterns​