How to Test LLM Output Quality in E2E
Short answer
LLM outputs feed parsers, dashboards, and downstream tools—exact string match breaks on benign rephrasing while no validation ships invalid JSON silently. Gate prompt changes with offline golden sets and schema evals; gate integration with AIMock fixtures matching output_schema and E2E probe Assert on parsed records—not assistant prose snapshots.
Part of Testing Guides by AI and conversational UX.
Who this is for
Teams shipping structured LLM outputs: JSON mode, classification labels, extraction forms, summary cards, tool argument payloads, or compliance checklists generated from documents.
Why testing LLM output matters
- Downstream crashes — invalid JSON breaks
JSON.parsein client or webhook handlers - Silent wrong data — extracted invoice total off by an order of magnitude
- Brand/compliance drift — tone or disclaimer changes without review
- Schema evolution — new field omitted; old clients fail partially
UI may render a pretty card while stored record is wrong—probe the persisted parse result.
Complexity map
| Scenario | Edge case | Why tests break | Approach |
|---|---|---|---|
| JSON mode | Trailing prose / invalid JSON | Parser throws | Schema eval + try/catch probe |
| Optional fields | Model omits nullable | Undefined access | JSON Schema required eval |
| Enum labels | Synonym instead of enum | Switch miss | Constrained enum eval |
| Numeric extraction | $1,234.56 vs 1234.56 | Math wrong | Normalize in probe assert |
| Multi-label | Partial tag set | Incomplete routing | Set comparison eval |
| Long output | Truncated mid-object | Broken JSON | Token limit eval cases |
| Prompt version | Quality drop | Silent regression | Golden set per prompt_version |
| i18n output | Locale field wrong | Wrong template | Eval per locale fixture |
| Repair loop | Auto-retry on bad JSON | Infinite loop | Cap retries; probe failure state |
| PII in output | Model echoes input | Compliance | Regex eval + redaction probe |
Validation strategies
| Strategy | When to use |
|---|---|
| JSON Schema | Structured extraction, API contracts |
| Golden set exact match | Small finite label sets |
| Semantic similarity eval | Summaries where wording varies |
| LLM-as-judge | Subjective quality—after human calibration |
| AIMock in E2E | UI integration with fixed payload |
| Probe Assert | DB row matches parsed fields |
Offline golden sets
{
"id": "extract-invoice-001",
"input": "fixture/invoices/acme-001.txt",
"prompt_version": "extract-v3",
"expected_schema": "schemas/invoice.json",
"expected_fields": {
"vendor": "Acme Corp",
"total_cents": 499900,
"currency": "USD"
}
}
Run on PRs touching prompts, model id, or temperature. Track pass rate by prompt_version.
JSON Schema validation in eval CI
import Ajv from 'ajv';
import schema from './schemas/invoice.json';
const ajv = new Ajv();
const validate = ajv.compile(schema);
const output = await callModel(fixture);
if (!validate(output)) throw new Error(JSON.stringify(validate.errors));
Pair with field-level asserts on business-critical numbers—not only schema validity.
LLM-as-judge (use carefully)
Pattern:
- Human-label 50–100 examples
- Tune judge rubric until correlation > target threshold
- Freeze judge system prompt in repo (
judge-v2.txt) - Run judge on golden set in CI
Use for summarization quality, not as sole gate for numeric extraction—use exact field asserts there.
AIMock for deterministic E2E
When UI displays parsed LLM output:
// AIMock returns fixed JSON matching output_schema
{
"vendor": "Acme Corp",
"total_cents": 499900,
"line_items": [{ "sku": "WIDGET", "qty": 2 }]
}
await ai.act('Upload invoice PDF and click Extract');
await page.locator('[data-extraction-complete="true"]').waitFor();
await expect(page.getByTestId('field-vendor')).toHaveText('Acme Corp');
await expect.poll(async () => {
const res = await request.get(`/api/test/probe-extraction?runId=${runId}`);
return (await res.json()).total_cents;
}).toBe(499900);
Semantic UI checks with ai.verify
For narrative summaries where DOM structure is stable but wording varies:
await ai.verify('Summary mentions refund deadline and 30-day window');
Pair with probe on structured fields—never ai.verify alone for money or compliance flags.
Prompt version tracking
Tag prod and test events with prompt_version. When deploy bumps version without eval run, TrueCoverage highlights gap. Block merge if golden set pass rate drops below threshold for that version.
Repair and fallback paths
Test when model returns invalid JSON:
- UI shows retry/error state
- Probe shows no partial DB write
- Repair prompt limited to N attempts
Anti-patterns
| Anti-pattern | Why it fails | Better approach |
|---|---|---|
toHaveText full summary | Rephrase flakes | Schema + ai.verify |
| No eval on prompt PR | Silent quality drop | Golden set CI gate |
| Judge-only gate for numbers | Hallucinated totals pass | Field exact match |
| Snapshot entire JSON in E2E | Key order noise | Probe normalized fields |
| Skip invalid JSON path | Prod parser crash | Negative eval + E2E |
| Real LLM every E2E | Cost + variance | AIMock + eval split |
Example scenario
Situation: User extracts structured fields from an uploaded contract.
Expected outcome: Parsed JSON validates against schema and persisted record matches seeded contract terms.
Why UI-only automation breaks: UI preview looks correct but probe shows wrong total_cents or missing signatory field.
- Arrange: AIMock returns schema-valid extraction JSON; seed contract fixture with known totals.
- Act: Upload contract and trigger extraction.
- Assert: JSON Schema pass; probe row matches expected_fields; required compliance flag present.
TestChimp workflow: Track prompt_version × output_schema in TrueCoverage when new extraction templates ship.
Same Arrange/Act/Assert pattern as expired-coupon checkout.
Evals vs E2E: when each layer helps
| Layer | Best for | Limitations |
|---|---|---|
| Offline evals (golden sets, JSON Schema, LLM-as-judge) | Prompt/model regression, numeric and enum accuracy at scale | Misses upload UI, auth, file parsing pipeline, and DB persistence bugs |
| E2E SmartTests (AIMock + schema UI + probe Assert) | End-to-end extract→review→save journey | Too slow for hundreds of document variants |
| Hybrid | Evals on every prompt change; E2E on critical document types | Link eval failures to new SmartTests when integration breaks |
Invest in golden sets when prompts stabilize and volume is high. Use E2E when files cross auth boundaries or touch paid features. TestChimp does not ship eval tooling—combine your eval pipeline with AIMock SmartTests.
Connect scenarios to your QA workflow
Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).
Related scenarios
- Conversational UI — unstructured chat
- RAG search — cited answers
- AI web apps — hybrid testing strategy
- File uploads — document ingest
External references
Frequently asked questions
How do I validate LLM JSON output in CI?
Run offline evals with JSON Schema validation and field-level asserts on critical numbers. E2E uses AIMock matching output_schema plus probe on persisted records—not raw model calls every PR.
LLM-as-judge vs human labels for evals?
Calibrate judge prompts against a human-labeled subset; freeze judge version in CI. Use for summarization triage, not sole gate for numeric extraction—use exact field asserts there.
Should E2E tests call the real model?
Default no—use AIMock for integration and probes for truth. Reserve real model evals for prompt-change jobs or nightly pipelines.
How do I test invalid JSON from the model?
Fixture AIMock or eval case returning malformed JSON. Assert UI error/retry and probe shows no partial save.
Exact match or semantic match for summaries?
Semantic evals or LLM-as-judge offline; ai.verify in E2E for high-level checks. Structured fields always use schema or exact probes.
How do prompt changes connect to E2E scenarios?
Tag events with prompt_version. When eval fails or TrueCoverage shows new output_schema in prod, add AIMock fixture + SmartTest via /testchimp evolve linked to markdown scenarios.
What if output wording changes but meaning is correct?
That is why exact string E2E fails—use schema probes for facts and semantic evals for narrative. Avoid snapshotting full model prose in Playwright.
Apply these patterns in your repo
Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.