How to Test LLM Output Quality in E2E

Short answer

LLM outputs feed parsers, dashboards, and downstream tools—exact string match breaks on benign rephrasing while no validation ships invalid JSON silently. Gate prompt changes with offline golden sets and schema evals; gate integration with AIMock fixtures matching output_schema and E2E probe Assert on parsed records—not assistant prose snapshots.

Part of Testing Guides by AI and conversational UX.

Who this is for

Teams shipping structured LLM outputs: JSON mode, classification labels, extraction forms, summary cards, tool argument payloads, or compliance checklists generated from documents.

Why testing LLM output matters

Downstream crashes — invalid JSON breaks JSON.parse in client or webhook handlers
Silent wrong data — extracted invoice total off by an order of magnitude
Brand/compliance drift — tone or disclaimer changes without review
Schema evolution — new field omitted; old clients fail partially

UI may render a pretty card while stored record is wrong—probe the persisted parse result.

Complexity map

Scenario	Edge case	Why tests break	Approach
JSON mode	Trailing prose / invalid JSON	Parser throws	Schema eval + try/catch probe
Optional fields	Model omits nullable	Undefined access	JSON Schema `required` eval
Enum labels	Synonym instead of enum	Switch miss	Constrained enum eval
Numeric extraction	`$1,234.56` vs `1234.56`	Math wrong	Normalize in probe assert
Multi-label	Partial tag set	Incomplete routing	Set comparison eval
Long output	Truncated mid-object	Broken JSON	Token limit eval cases
Prompt version	Quality drop	Silent regression	Golden set per prompt_version
i18n output	Locale field wrong	Wrong template	Eval per locale fixture
Repair loop	Auto-retry on bad JSON	Infinite loop	Cap retries; probe failure state
PII in output	Model echoes input	Compliance	Regex eval + redaction probe

Validation strategies

Strategy	When to use
JSON Schema	Structured extraction, API contracts
Golden set exact match	Small finite label sets
Semantic similarity eval	Summaries where wording varies
LLM-as-judge	Subjective quality—after human calibration
AIMock in E2E	UI integration with fixed payload
Probe Assert	DB row matches parsed fields

Offline golden sets

{
  "id": "extract-invoice-001",
  "input": "fixture/invoices/acme-001.txt",
  "prompt_version": "extract-v3",
  "expected_schema": "schemas/invoice.json",
  "expected_fields": {
    "vendor": "Acme Corp",
    "total_cents": 499900,
    "currency": "USD"
  }
}

Run on PRs touching prompts, model id, or temperature. Track pass rate by prompt_version.

JSON Schema validation in eval CI

import Ajv from 'ajv';
import schema from './schemas/invoice.json';

const ajv = new Ajv();
const validate = ajv.compile(schema);
const output = await callModel(fixture);
if (!validate(output)) throw new Error(JSON.stringify(validate.errors));

Pair with field-level asserts on business-critical numbers—not only schema validity.

LLM-as-judge (use carefully)

Pattern:

Human-label 50–100 examples
Tune judge rubric until correlation > target threshold
Freeze judge system prompt in repo (judge-v2.txt)
Run judge on golden set in CI

Use for summarization quality, not as sole gate for numeric extraction—use exact field asserts there.

AIMock for deterministic E2E

When UI displays parsed LLM output:

// AIMock returns fixed JSON matching output_schema
{
  "vendor": "Acme Corp",
  "total_cents": 499900,
  "line_items": [{ "sku": "WIDGET", "qty": 2 }]
}

await ai.act('Upload invoice PDF and click Extract');
await page.locator('[data-extraction-complete="true"]').waitFor();
await expect(page.getByTestId('field-vendor')).toHaveText('Acme Corp');
await expect.poll(async () => {
  const res = await request.get(`/api/test/probe-extraction?runId=${runId}`);
  return (await res.json()).total_cents;
}).toBe(499900);

Semantic UI checks with ai.verify

For narrative summaries where DOM structure is stable but wording varies:

await ai.verify('Summary mentions refund deadline and 30-day window');

Pair with probe on structured fields—never ai.verify alone for money or compliance flags.

Prompt version tracking

Tag prod and test events with prompt_version. When deploy bumps version without eval run, TrueCoverage highlights gap. Block merge if golden set pass rate drops below threshold for that version.

Repair and fallback paths

Test when model returns invalid JSON:

UI shows retry/error state
Probe shows no partial DB write
Repair prompt limited to N attempts

Anti-patterns

Anti-pattern	Why it fails	Better approach
`toHaveText` full summary	Rephrase flakes	Schema + ai.verify
No eval on prompt PR	Silent quality drop	Golden set CI gate
Judge-only gate for numbers	Hallucinated totals pass	Field exact match
Snapshot entire JSON in E2E	Key order noise	Probe normalized fields
Skip invalid JSON path	Prod parser crash	Negative eval + E2E
Real LLM every E2E	Cost + variance	AIMock + eval split

Example scenario

Situation: User extracts structured fields from an uploaded contract.

Expected outcome: Parsed JSON validates against schema and persisted record matches seeded contract terms.

Why UI-only automation breaks: UI preview looks correct but probe shows wrong total_cents or missing signatory field.

Arrange: AIMock returns schema-valid extraction JSON; seed contract fixture with known totals.
Act: Upload contract and trigger extraction.
Assert: JSON Schema pass; probe row matches expected_fields; required compliance flag present.

TestChimp workflow: Track prompt_version × output_schema in TrueCoverage when new extraction templates ship.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Evals vs E2E: when each layer helps

Layer	Best for	Limitations
Offline evals (golden sets, JSON Schema, LLM-as-judge)	Prompt/model regression, numeric and enum accuracy at scale	Misses upload UI, auth, file parsing pipeline, and DB persistence bugs
E2E SmartTests (AIMock + schema UI + probe Assert)	End-to-end extract→review→save journey	Too slow for hundreds of document variants
Hybrid	Evals on every prompt change; E2E on critical document types	Link eval failures to new SmartTests when integration breaks

Invest in golden sets when prompts stabilize and volume is high. Use E2E when files cross auth boundaries or touch paid features. TestChimp does not ship eval tooling—combine your eval pipeline with AIMock SmartTests.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

Conversational UI — unstructured chat
RAG search — cited answers
AI web apps — hybrid testing strategy
File uploads — document ingest

External references

Frequently asked questions

How do I validate LLM JSON output in CI?

Run offline evals with JSON Schema validation and field-level asserts on critical numbers. E2E uses AIMock matching output_schema plus probe on persisted records—not raw model calls every PR.

LLM-as-judge vs human labels for evals?

Calibrate judge prompts against a human-labeled subset; freeze judge version in CI. Use for summarization triage, not sole gate for numeric extraction—use exact field asserts there.

Should E2E tests call the real model?

Default no—use AIMock for integration and probes for truth. Reserve real model evals for prompt-change jobs or nightly pipelines.

How do I test invalid JSON from the model?

Fixture AIMock or eval case returning malformed JSON. Assert UI error/retry and probe shows no partial save.

Exact match or semantic match for summaries?

Semantic evals or LLM-as-judge offline; ai.verify in E2E for high-level checks. Structured fields always use schema or exact probes.

How do prompt changes connect to E2E scenarios?

Tag events with prompt_version. When eval fails or TrueCoverage shows new output_schema in prod, add AIMock fixture + SmartTest via /testchimp evolve linked to markdown scenarios.

What if output wording changes but meaning is correct?

That is why exact string E2E fails—use schema probes for facts and semantic evals for narrative. Avoid snapshotting full model prose in Playwright.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo

Who this is for​

Why testing LLM output matters​

Complexity map​

Validation strategies​

Offline golden sets​

JSON Schema validation in eval CI​

LLM-as-judge (use carefully)​

AIMock for deterministic E2E​

Semantic UI checks with ai.verify​

Prompt version tracking​

Repair and fallback paths​

Anti-patterns​