How to Test AI Agent Workflows and Tool Calling
Short answer
AI agents plan multi-step workflows, invoke tools with side effects, retry on failure, and sometimes pause for human approval—testing only the final assistant message misses the dangerous parts. Combine offline evals on tool-selection trajectories, AIMock for deterministic planning responses in E2E, stubbed tool APIs with probe Assert, and hybrid ai.act/ai.verify on approval UI—not live third-party calls in every PR.
Part of Testing Guides by AI and conversational UX.
Who this is for
Teams shipping agentic products: copilots that book meetings, file tickets, run refunds, query CRMs, or orchestrate internal APIs through LLM tool routers (OpenAI function calling, Anthropic tools, LangGraph, custom planners).
Not for: single-shot chat with no tools, or batch offline pipelines with no user-facing workflow UI.
Why testing agent workflows matters
Agent bugs compound across steps:
- Revenue loss — double refund when retry lacks idempotency; wrong SKU updated in ERP.
- Security incidents — agent calls admin tool with viewer session; cross-tenant CRM read via wrong
accountIdin tool args. - Support load — infinite replan loops; human approval stuck pending while UI shows "complete".
- Compliance exposure — agent sends email or exports data without audit log; PII written to wrong ticket.
The planner can produce eloquent summaries while tool execution fails silently. E2E must assert tool dispatch, idempotent side effects, and approval gates via probes.
Complexity map
| Scenario | Edge case | Why tests break | Approach |
|---|---|---|---|
| Wrong tool selected | Similar tool names | Bad side effect | Offline golden trajectories + one E2E per critical tool |
| Tool timeout | Upstream 504 | Agent retries forever | Stub slow/fail tool; assert retry cap |
| Partial failure | Step 2 of 5 fails | Orphan records | Probe compensation/rollback |
| Idempotency | Same tool call twice | Double charge | Probe single row; idempotency key |
| Human approval | Pending state | Auto-executes | Assert blocked until approve click |
| Parallel tools | Race on DB | Flaky order | Serialize in prod; test both orders |
| Auth scope | Tool needs OAuth | 403 mid-workflow | Seed token; probe deny before Act |
| Long-running job | Async completion | Assert too early | Poll job status probe |
| Plan visibility | Hidden chain-of-thought | Untested steps | Assert step UI or audit log probe |
| Model swap | Planner behavior shift | Wrong tool regression | Eval gate on planner prompt |
| Rate limits | Tool API 429 | Random CI fail | Stub tools in CI |
| Cancellation | User aborts mid-plan | Orphan tool calls | Probe no partial writes |
Agent architecture under test
User goal → Planner (LLM) → Tool calls → External APIs / DB
↓
Human approval gate (optional)
↓
Workflow state machine (running / pending / done / failed)
Test at three boundaries:
- Planning — offline evals: given user goal + context, expect tool sequence
- Integration — AIMock planner + real tool router + stubbed HTTP tools
- Truth — probe Assert on workflow row, audit log, side-effect tables
Offline evals: golden trajectories
Maintain fixtures independent of Playwright:
{
"id": "refund-damaged-item",
"context": { "orderId": "12345", "reason": "damaged" },
"expectedTools": ["lookup_order", "check_refund_policy", "initiate_refund"],
"forbiddenTools": ["delete_account"]
}
Grade with:
- Exact match on tool names and ordered sequence (strict workflows)
- Set match when order flexible
- LLM-as-judge on whether trajectory satisfies policy—calibrate against human labels
Run evals on PRs touching planner prompts, tool schemas, or permission middleware.
AIMock for planner responses in E2E
Stub planning LLM calls while exercising real orchestration code:
// AIMock returns fixed tool plan for runId + scenario id
{
"toolCalls": [
{ "name": "lookup_order", "arguments": { "orderId": "12345" } },
{ "name": "initiate_refund", "arguments": { "orderId": "12345", "amount": 4999 } }
]
}
Your backend still validates args, enforces auth, and hits stubbed tool HTTP endpoints that return deterministic payloads.
Stub tool endpoints (Arrange)
// POST /api/test/stub-tool/lookup_order
// Returns fixture order for runId
// POST /api/test/stub-tool/initiate_refund
// Writes refund row idempotently on idempotency-key header
Playwright Arrange registers scenario + runId; tool router points to stub base URL in test env.
E2E pattern: happy path with probe Assert
await ai.act('Ask the agent to refund order 12345 for damaged item');
await ai.verify('Agent shows plan summary or approval prompt before executing refund');
await page.getByRole('button', { name: /approve/i }).click();
await expect.poll(async () => {
const res = await request.get('/api/test/probe-refund/12345');
return (await res.json()).status;
}, { timeout: 20_000 }).toBe('pending');
await ai.verify('Agent confirms refund initiated');
Never stop at ai.verify alone when money or PII moves—probe is authoritative.
Human-in-the-loop approval
Cover:
| Case | Assert |
|---|---|
| Approval required | Tool not called until approve; probe no refund row |
| Reject | Plan cancelled; probe unchanged |
| Timeout | Workflow fails gracefully; user can retry |
| Re-approval after edit | New plan id; old partial writes rolled back |
await expect.poll(() => probeWorkflowStatus(runId)).toBe('pending_approval');
await page.getByRole('button', { name: /reject/i }).click();
await expect.poll(() => probeRefundCount('12345')).toBe(0);
Failure, retry, and idempotency
| Scenario | Arrange | Assert |
|---|---|---|
| Tool 500 first call | Stub returns 500 then 200 | Probe single side effect after retry |
| Duplicate delivery | Replay same tool idempotency key | Probe one row |
| Planner retry loop | Stub always 500 | Workflow fails with user-visible error; probe no partial |
See also webhooks and async processing for event-driven tool completions.
Tool auth and tenancy
Seed users with different roles and tenants. Negative E2E:
- Viewer cannot invoke
admin_exporteven if planner suggests it—probe 403 on tool route - Tool args cannot reference another tenant's
accountId—probe isolation
Observability and audit
Assert audit log probe for regulated workflows:
await expect.poll(async () => {
const res = await request.get(`/api/test/probe-audit?workflowId=${wfId}`);
return (await res.json()).events.map(e => e.type);
}).toContain('tool.initiate_refund');
CI checklist
- All external tools stubbed in default PR job
- AIMock planner fixtures versioned beside SmartTests
- One E2E per business-critical tool (not every CRUD tool)
- Eval job gates planner prompt changes
- Unique runId per worker; idempotency keys include runId
- Human approval specs use explicit button roles, not coordinates
Anti-patterns
| Anti-pattern | Why it fails | Better approach |
|---|---|---|
| Live Salesforce/Stripe in CI | Flake, cost, PII | Stub tool HTTP + probe |
| Assert final chat message only | Miss double execution | Probe side effects |
| No idempotency test | Double charge in prod | Replay tool call fixture |
| Skip approval gate | Auto-executes dangerous ops | Negative + approve paths |
| 500-tool eval in one E2E | Unmaintainable | Eval trajectories + sparse E2E |
| Shared workflow id | Parallel pollution | runId-scoped workflows |
Example scenario
Situation: User asks agent to schedule a meeting with legal and send the agenda email.
Expected outcome: Calendar hold created and email queued exactly once after user approves the plan.
Why UI-only automation breaks: Agent says "Scheduled!" but probe shows no calendar event and duplicate emails on retry.
- Arrange: AIMock planner returns calendar_create + email_send tools; stub both; seed user OAuth token for test.
- Act: Submit request; approve plan in UI.
- Assert: Probe one calendar event and one email job; workflow status completed.
TestChimp workflow: Track agent_tool × workflow_step in TrueCoverage; expand when new calendar tool ships without E2E.
Same Arrange/Act/Assert pattern as expired-coupon checkout.
Evals vs E2E: when each layer helps
| Layer | Best for | Limitations |
|---|---|---|
| Offline evals (golden trajectories, tool-arg validation, LLM-as-judge on plans) | Planner regression, forbidden-tool policies, sequence correctness on prompt changes | Cannot verify HTTP auth wiring, approval UI, idempotent DB writes, or stub vs prod tool schema drift |
E2E SmartTests (AIMock planner + ai.act/ai.verify + probe Assert) | Tool router integration, human approval, retries, session-scoped auth | Expensive to cover every tool combination |
| Hybrid | Evals on every planner change; E2E on release-critical workflows | Requires mapping eval failures to new scenarios |
Use golden trajectories when adding tools or changing system instructions—catch wrong-tool selection before merge. Use E2E when a tool moves money, PII, or crosses tenant boundaries. LLM-as-judge helps grade ambiguous plans but must be calibrated; do not use as sole gate until correlated with human review.
TestChimp does not ship eval tooling—combine your eval pipeline with AIMock SmartTests. When TrueCoverage shows new agent_tool usage in prod without test scenarios, /testchimp evolve closes the gap.
Connect scenarios to your QA workflow
Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).
Related scenarios
- Conversational UI — chat surface patterns
- RAG search — retrieval tools in agents
- Webhooks async — async tool completion
- LLM output validation — structured tool args
- RBAC permissions — tool authorization
External references
- OpenAI function calling
- Anthropic tool use
- Playwright test fixtures — isolated runId per worker
- SmartTests intro — AIMock and hybrid steps
Frequently asked questions
How do I test agent tool calls without calling real APIs?
Stub tool endpoints in test env with seed routes returning fixed payloads; AIMock supplies deterministic planner output. Probe DB for side effects the tool should create. Evals verify tool selection offline; E2E verifies integration and auth.
What if the agent picks the wrong tool?
Offline evals with golden trajectories catch selection regression on prompt changes. E2E tests one happy path per critical tool with AIMock planning responses and forbidden-tool negative cases.
How do I test human approval gates?
Assert workflow stays pending_approval via probe until user clicks approve; reject path must leave side effects unchanged. Never rely on assistant prose claiming approval happened.
How do I test retries and idempotency?
Stub tool to fail once then succeed, or POST duplicate idempotency keys. Probe exactly one side-effect row. Assert workflow reaches completed or failed—not stuck looping.
Should every tool have an E2E test?
No—cover business-critical and high-risk tools in E2E; use offline eval trajectories for breadth. TrueCoverage highlights prod tools missing from either layer.
Where do ai.act and ai.verify fit in agent tests?
Use for approval dialogs, plan summaries, and volatile status copy. Keep tool truth on probes and audit logs. ai.verify alone is insufficient for refunds or exports.
How do we cover every agent tool in prod?
Compare prod vs test-run across agent_tool × workflow_step in TrueCoverage. When new tools spike in prod without scenarios, run /testchimp evolve when shipping tools. Add AIMock planner fixtures and stub endpoints per critical tool.
Apply these patterns in your repo
Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.