Skip to main content

How to Test LLM Output Quality in E2E

Short answer

LLM outputs feed parsers, dashboards, and downstream tools—exact string match breaks on benign rephrasing while no validation ships invalid JSON silently. Gate prompt changes with offline golden sets and schema evals; gate integration with AIMock fixtures matching output_schema and E2E probe Assert on parsed records—not assistant prose snapshots.

Part of Testing Guides by AI and conversational UX.

Who this is for

Teams shipping structured LLM outputs: JSON mode, classification labels, extraction forms, summary cards, tool argument payloads, or compliance checklists generated from documents.

Why testing LLM output matters

  • Downstream crashes — invalid JSON breaks JSON.parse in client or webhook handlers
  • Silent wrong data — extracted invoice total off by an order of magnitude
  • Brand/compliance drift — tone or disclaimer changes without review
  • Schema evolution — new field omitted; old clients fail partially

UI may render a pretty card while stored record is wrong—probe the persisted parse result.

Complexity map

ScenarioEdge caseWhy tests breakApproach
JSON modeTrailing prose / invalid JSONParser throwsSchema eval + try/catch probe
Optional fieldsModel omits nullableUndefined accessJSON Schema required eval
Enum labelsSynonym instead of enumSwitch missConstrained enum eval
Numeric extraction$1,234.56 vs 1234.56Math wrongNormalize in probe assert
Multi-labelPartial tag setIncomplete routingSet comparison eval
Long outputTruncated mid-objectBroken JSONToken limit eval cases
Prompt versionQuality dropSilent regressionGolden set per prompt_version
i18n outputLocale field wrongWrong templateEval per locale fixture
Repair loopAuto-retry on bad JSONInfinite loopCap retries; probe failure state
PII in outputModel echoes inputComplianceRegex eval + redaction probe

Validation strategies

StrategyWhen to use
JSON SchemaStructured extraction, API contracts
Golden set exact matchSmall finite label sets
Semantic similarity evalSummaries where wording varies
LLM-as-judgeSubjective quality—after human calibration
AIMock in E2EUI integration with fixed payload
Probe AssertDB row matches parsed fields

Offline golden sets

{
"id": "extract-invoice-001",
"input": "fixture/invoices/acme-001.txt",
"prompt_version": "extract-v3",
"expected_schema": "schemas/invoice.json",
"expected_fields": {
"vendor": "Acme Corp",
"total_cents": 499900,
"currency": "USD"
}
}

Run on PRs touching prompts, model id, or temperature. Track pass rate by prompt_version.

JSON Schema validation in eval CI

import Ajv from 'ajv';
import schema from './schemas/invoice.json';

const ajv = new Ajv();
const validate = ajv.compile(schema);
const output = await callModel(fixture);
if (!validate(output)) throw new Error(JSON.stringify(validate.errors));

Pair with field-level asserts on business-critical numbers—not only schema validity.

LLM-as-judge (use carefully)

Pattern:

  1. Human-label 50–100 examples
  2. Tune judge rubric until correlation > target threshold
  3. Freeze judge system prompt in repo (judge-v2.txt)
  4. Run judge on golden set in CI

Use for summarization quality, not as sole gate for numeric extraction—use exact field asserts there.

AIMock for deterministic E2E

When UI displays parsed LLM output:

// AIMock returns fixed JSON matching output_schema
{
"vendor": "Acme Corp",
"total_cents": 499900,
"line_items": [{ "sku": "WIDGET", "qty": 2 }]
}
await ai.act('Upload invoice PDF and click Extract');
await page.locator('[data-extraction-complete="true"]').waitFor();
await expect(page.getByTestId('field-vendor')).toHaveText('Acme Corp');
await expect.poll(async () => {
const res = await request.get(`/api/test/probe-extraction?runId=${runId}`);
return (await res.json()).total_cents;
}).toBe(499900);

Semantic UI checks with ai.verify

For narrative summaries where DOM structure is stable but wording varies:

await ai.verify('Summary mentions refund deadline and 30-day window');

Pair with probe on structured fields—never ai.verify alone for money or compliance flags.

Prompt version tracking

Tag prod and test events with prompt_version. When deploy bumps version without eval run, TrueCoverage highlights gap. Block merge if golden set pass rate drops below threshold for that version.

Repair and fallback paths

Test when model returns invalid JSON:

  • UI shows retry/error state
  • Probe shows no partial DB write
  • Repair prompt limited to N attempts

Anti-patterns

Anti-patternWhy it failsBetter approach
toHaveText full summaryRephrase flakesSchema + ai.verify
No eval on prompt PRSilent quality dropGolden set CI gate
Judge-only gate for numbersHallucinated totals passField exact match
Snapshot entire JSON in E2EKey order noiseProbe normalized fields
Skip invalid JSON pathProd parser crashNegative eval + E2E
Real LLM every E2ECost + varianceAIMock + eval split

Example scenario

Situation: User extracts structured fields from an uploaded contract.

Expected outcome: Parsed JSON validates against schema and persisted record matches seeded contract terms.

Why UI-only automation breaks: UI preview looks correct but probe shows wrong total_cents or missing signatory field.

  1. Arrange: AIMock returns schema-valid extraction JSON; seed contract fixture with known totals.
  2. Act: Upload contract and trigger extraction.
  3. Assert: JSON Schema pass; probe row matches expected_fields; required compliance flag present.

TestChimp workflow: Track prompt_version × output_schema in TrueCoverage when new extraction templates ship.

Same Arrange/Act/Assert pattern as expired-coupon checkout.

Evals vs E2E: when each layer helps

LayerBest forLimitations
Offline evals (golden sets, JSON Schema, LLM-as-judge)Prompt/model regression, numeric and enum accuracy at scaleMisses upload UI, auth, file parsing pipeline, and DB persistence bugs
E2E SmartTests (AIMock + schema UI + probe Assert)End-to-end extract→review→save journeyToo slow for hundreds of document variants
HybridEvals on every prompt change; E2E on critical document typesLink eval failures to new SmartTests when integration breaks

Invest in golden sets when prompts stabilize and volume is high. Use E2E when files cross auth boundaries or touch paid features. TestChimp does not ship eval tooling—combine your eval pipeline with AIMock SmartTests.

Connect scenarios to your QA workflow

Capture business rules in markdown test plans and enforce them with seed routes and probe Assert. Link SmartTests with // @Scenario: for requirement traceability. Use /testchimp test on PRs; /testchimp explore on SmartTest paths for non-functional gaps (ExploreChimp).

External references

Frequently asked questions

How do I validate LLM JSON output in CI?

Run offline evals with JSON Schema validation and field-level asserts on critical numbers. E2E uses AIMock matching output_schema plus probe on persisted records—not raw model calls every PR.

LLM-as-judge vs human labels for evals?

Calibrate judge prompts against a human-labeled subset; freeze judge version in CI. Use for summarization triage, not sole gate for numeric extraction—use exact field asserts there.

Should E2E tests call the real model?

Default no—use AIMock for integration and probes for truth. Reserve real model evals for prompt-change jobs or nightly pipelines.

How do I test invalid JSON from the model?

Fixture AIMock or eval case returning malformed JSON. Assert UI error/retry and probe shows no partial save.

Exact match or semantic match for summaries?

Semantic evals or LLM-as-judge offline; ai.verify in E2E for high-level checks. Structured fields always use schema or exact probes.

How do prompt changes connect to E2E scenarios?

Tag events with prompt_version. When eval fails or TrueCoverage shows new output_schema in prod, add AIMock fixture + SmartTest via /testchimp evolve linked to markdown scenarios.

What if output wording changes but meaning is correct?

That is why exact string E2E fails—use schema probes for facts and semantic evals for narrative. Avoid snapshotting full model prose in Playwright.

Apply these patterns in your repo

Run `/testchimp init` to connect TestChimp to your repo, then `/testchimp test` on PRs to turn these patterns into maintained SmartTests. Use `/testchimp evolve` when you want to expand coverage as your app grows.

Start free on TestChimp · Book a demo