9 min readJohnny UnarJohnny Unar

Structured Outputs Guarantee Syntax, Not Sanity

GPT-4o's JSON mode gives you valid JSON. It doesn't give you correct data. Here's why that distinction is destroying pipelines.

the confidence score was 1400

We were building a document processing pipeline for a client, extracting structured data from supplier invoices, and everything looked great in staging. Pydantic models, typed fields, JSON mode enabled on GPT-4o. Clean. Then we hit production and found a confidence score of 1400 sitting in the database. Not 0.14. Not a typo. Fourteen hundred, because the model decided that was a perfectly reasonable float to return for a field we'd described as 'confidence between 0 and 1'.

The JSON parsed. Pydantic accepted it because confidence: float doesn't mean 'float between zero and one', it means 'float'. The record sailed through every layer of the stack and landed in a downstream report where it made a bar chart look like a skyscraper.

This is the trap. Structured output modes in GPT-4o and Claude 3.7 Sonnet give you syntactic correctness. The JSON is valid. The field names match your schema. The types are at least approximately right. But semantic correctness, whether the values actually make sense in the real world, is completely outside what these features were designed to guarantee, and a lot of teams are shipping pipelines that treat JSON mode as the finish line rather than the starting line.

what structured outputs actually promise

OpenAI's structured outputs feature, which has been generally available for a while now and is the default recommendation for any extraction work against GPT-4o, enforces that the response conforms to a JSON Schema at the token generation level. It's constrained decoding. The model literally cannot produce a token sequence that would violate the schema structure. Same story on the Anthropic side with Claude 3.7, where tool use with strict schemas gives you the same guarantee.

That's genuinely useful. Before this existed, you'd get JSON with trailing commas, or the model would start a code block and forget to close it, or it'd return null where you expected an object and your entire parsing layer would explode. Constrained decoding solved a real problem.

The problem is the marketing landed wrong, or maybe teams just heard what they wanted to hear. 'Structured outputs' sounds like it means the output is structured correctly, in the sense of being correct. It means the output is structured, full stop. The model can still return an empty string for a required name field. It can still return "1970-01-01" for a document date field because that's what Unix epoch zero looks like when a model hallucinates a timestamp. It can return a list with one element when you needed exactly three, because your schema said array and not minItems: 3. All valid JSON. All garbage data.

pydantic is not enough and here's why

The default response to this, and I've seen it in code reviews at probably a dozen different shops over the past year, is to define a Pydantic v2 model and call it done. The model definition lives in your codebase, you pass the schema to the LLM, you parse the response with model_validate, and you ship it. Four lines of code. Very clean.

Pydantic v2 is excellent software. The validation it performs, type coercion, required fields, basic constraints when you use Field(ge=0, le=1), is real and it catches some things. The issue is that most teams aren't writing their Pydantic models defensively. They're writing them to describe the shape they expect, not to aggressively reject anything outside a sane range.

A field like invoice_total: float will happily accept -50000.00. A field like extracted_date: str will accept any string that doesn't cause a parse error, including dates from 1970, dates in the future, dates formatted as "January 32nd, 2024". You can add validators, and you should, but the point is that the schema you send to the LLM and the Pydantic model you use to parse the response are usually identical, which means you have one layer of validation doing the work of two, and it's the syntactic layer, not the semantic one.

The mental model shift that fixes this is treating LLM output exactly the way you'd treat a response from a third-party API you don't control. You wouldn't trust that some random payment processor was going to return a positive amount. You'd validate it. You'd check ranges, cross-reference fields against each other, reject nonsensical combinations. The LLM is a third-party API. A very smart, very fast, wildly non-deterministic one.

a concrete setup that actually catches this

The approach we've settled on at steezr for document processing work combines two things: Pydantic v2 validators that are written adversarially, and property-based tests using Hypothesis that hammer the parsing layer with generated inputs before anything touches production.

The Pydantic side looks something like this. For a field representing a confidence score:

python
1from pydantic import BaseModel, Field, field_validator
2
3class ExtractionResult(BaseModel):
4 confidence: float = Field(ge=0.0, le=1.0)
5 invoice_date: str
6 total_amount: float = Field(ge=0.0, lt=1_000_000.0)
7 vendor_name: str = Field(min_length=1)
8
9 @field_validator('invoice_date')
10 @classmethod
11 def date_must_be_sane(cls, v: str) -> str:
12 from datetime import date
13 parsed = date.fromisoformat(v)
14 if parsed.year < 2000 or parsed > date.today():
15 raise ValueError(f'invoice_date {v!r} is outside plausible range')
16 return v
17
18 @field_validator('total_amount')
19 @classmethod
20 def amount_cannot_be_zero(cls, v: float) -> float:
21 if v == 0.0:
22 raise ValueError('total_amount of exactly zero is almost certainly an extraction failure')
23 return v

Every field has an opinion. Confidence can't be above 1.0. Dates can't be from before 2000 or in the future. Total amount can't be zero or negative or suspiciously large. Vendor name can't be an empty string.

Then Hypothesis, which if you're not using it for this kind of work you really should be. The idea is to generate arbitrary JSON-shaped data and verify that your validation layer rejects the things it should reject and accepts the things it should accept:

python
1from hypothesis import given, strategies as st
2from pydantic import ValidationError
3
4@given(st.floats(allow_nan=False))
5def test_confidence_rejects_out_of_range(value):
6 if value < 0.0 or value > 1.0:
7 with pytest.raises(ValidationError):
8 ExtractionResult(
9 confidence=value,
10 invoice_date='2024-06-15',
11 total_amount=100.0,
12 vendor_name='ACME Corp'
13 )
14
15@given(st.text())
16def test_vendor_name_rejects_empty(value):
17 if len(value.strip()) == 0:
18 with pytest.raises(ValidationError):
19 ExtractionResult(
20 confidence=0.9,
21 invoice_date='2024-06-15',
22 total_amount=100.0,
23 vendor_name=value
24 )

This approach finds edge cases that manual testing misses. Hypothesis will find the float that's technically positive but rounds to zero. It'll find the date string that fromisoformat accepts but represents a timestamp from 1970 because someone passed a Unix timestamp as a date string. It'll find all of it, and it'll find it before you do.

cross-field validation is where it gets interesting

Single-field validation is the easy part. The harder and more interesting cases are cross-field semantic checks, things that can only be wrong when you look at two fields together.

On an invoice extraction pipeline, we caught a case where the model was returning an items list with a total that didn't match the sum of line items, not because it was rounding, but because it was hallucinating the total independently of the items. Both fields were valid individually. Together they implied a math error in a financial document. Our client was loading these into an accounting system.

You can catch this with a model-level validator in Pydantic v2:

python
1from pydantic import model_validator
2
3@model_validator(mode='after')
4def totals_must_reconcile(self) -> 'InvoiceExtractionResult':
5 if self.line_items:
6 computed = sum(item.amount for item in self.line_items)
7 tolerance = 0.02 # allow 2 cents for rounding
8 if abs(computed - self.total_amount) > tolerance:
9 raise ValueError(
10 f'total_amount {self.total_amount} does not match '
11 f'sum of line_items {computed:.2f}'
12 )
13 return self

This is the kind of validation that LLMs simply cannot enforce on themselves at token generation time, because it requires reasoning about the relationship between fields after the fact. Constrained decoding operates token by token. It can enforce that total_amount is a float. It cannot enforce that total_amount equals the sum of a list it generated forty tokens earlier.

The pattern generalizes. Date ranges where end comes before start. Addresses where the postcode doesn't match the city. Tax amounts that are inconsistent with the stated tax rate. Every domain has these, and every domain's LLM extraction pipeline is probably silently corrupting records on the cases where the model gets confused.

what to do when validation fails

The last piece people underestimate is failure handling. Once you have real validation with opinions, you'll get real validation failures, and you need a decision about what to do with them.

The naive approach is to raise an exception and let it propagate. For a batch document processing pipeline, that's often wrong, because one bad extraction shouldn't block five hundred good ones. We usually implement a dead-letter queue pattern: records that fail semantic validation get routed to a separate table with the raw LLM response, the validation error, and a flag for human review. The pipeline continues. Nothing is silently dropped. Nothing is silently accepted.

For cases where the validation failure is recoverable, it's worth trying a second extraction pass with a different prompt, or with the validation error included in the prompt as explicit feedback. Something like "your previous extraction returned a total_amount of -50.00 which is invalid for an invoice, please re-examine the document". This works surprisingly well for the borderline cases, because the model genuinely does better when you tell it what it got wrong, though you want a retry budget and you don't want to loop forever.

The telemetry matters too. Log every validation failure with enough context to see patterns. If you're seeing confidence > 1.0 failures frequently, that's a signal your prompt isn't constraining the model well enough on that field. If you're seeing date range failures cluster around a particular document type, that's a signal to add few-shot examples. The validation layer becomes a feedback mechanism for prompt engineering, which is a secondary benefit that teams rarely anticipate but end up relying on heavily.

the bigger point

JSON mode being available doesn't mean your data is correct. It means your data is parseable. Those are different things, and the gap between them is where silent data corruption lives.

The discipline that prevents this isn't complex. Treat every LLM response as untrusted external input. Write your Pydantic models like an adversary is going to fill in the fields. Use Hypothesis to find the edge cases your intuition misses. Validate fields against each other, not just individually. Build a dead-letter path for failures instead of swallowing them or crashing on them.

We've built enough of these pipelines at steezr, invoice extraction, document classification, form processing, contract analysis, that the pattern is well worn at this point. The teams that skip the semantic validation layer always come back having found garbage in their database. The ones that build it spend a few extra days up front and then don't think about it again.

The LLM is a powerful extraction tool that is also wrong in creative and unpredictable ways. Design your system around that fact.

Johnny Unar

Written by

Johnny Unar

Want to work with us?

GPT-4o's JSON mode gives you valid JSON. It doesn't give you correct data. Here's why that distinction is destroying pipelines.