Structured Outputs Don't Mean Correct Data

the false sense of safety

There was a specific moment, sometime in late 2024, when a huge chunk of the engineering world collectively exhaled and said 'okay, the JSON problem is solved.' OpenAI shipped structured outputs with strict mode in the Assistants API and in the chat completions endpoint, Anthropic followed with their own constrained generation approach, and suddenly you could get a response that would parse without throwing a fit. No more fragile regex to strip markdown fences from around a code block that was supposed to be raw JSON. No more json.JSONDecodeError on the seventh retry because the model decided to include a helpful preamble. The shape of the data was finally guaranteed, and teams moved on.

The problem is that shape and correctness are completely different things, and conflating them is one of the more expensive mistakes you can make when building a pipeline that puts LLM output anywhere near a database or a downstream service. A JSON object that passes your Pydantic or Zod schema is exactly as trustworthy as whatever the model decided to put in the fields, which is to say: not very trustworthy at all, and in ways that are much harder to catch than a parse error.

what schema conformance actually gives you

Strict structured outputs, whether you're using OpenAI's response_format: { type: 'json_schema', json_schema: { strict: true, ... } } or Anthropic's tool-use trick to force a particular shape, give you one thing: the model will produce a string that deserves to survive JSON.parse() and will have the keys and types you declared. That's it. The value of a string field can be anything. The value of an integer field will be an integer, but it might be the completely wrong integer. An enum field will contain one of your declared enum values, but the model will pick whichever enum value seems most plausible to it in the moment, which under distribution shift is not necessarily the one a human expert would pick.

Think about a document processing pipeline, something like what we've built for clients at steezr where incoming invoices or contracts get parsed into structured records. You define a field payment_terms as an enum: ['net_30', 'net_60', 'net_90', 'immediate']. The model will always return one of those four values. What it won't always do is return the right one, especially when the source document is ambiguous, scanned badly, or written in a way the model hasn't seen much of. You get a perfectly valid enum value representing a completely incorrect classification, and it flows silently into your accounting system.

Pydantic doesn't catch that. Zod doesn't catch that. Your schema validator has no idea what the document said. It only knows the output matches the shape you described, which it does.

semantic drift in the wild

Semantic drift is the category of failure that keeps me up at night more than anything else in LLM pipeline work. It's when the model's interpretation of a field's meaning gradually diverges from what you intended, and it tends to happen in ways that look totally reasonable on a case-by-case basis but create systematic bias at scale.

We had a situation with an internal system processing customer support tickets, where one of the output fields was sentiment typed as an enum of ['positive', 'neutral', 'negative']. The model was being asked to classify the sentiment of the resolution, not the initial complaint. Over a few weeks of looking at samples, we noticed it was consistently rating resolutions as neutral when a human reviewer would have called them negative, specifically in cases where the agent's response was polite but the customer's issue was never actually fixed. The model had learned, reasonably, that formal and polite language correlates with neutral or positive sentiment, and it was applying that heuristic in a context where it was wrong.

Every single one of those records had a valid sentiment field. The enum value was always present, always one of the three options. The schema was passing. The data was wrong in a way that took weeks to surface because it required looking at aggregate distributions, not individual records.

This is why monitoring LLM outputs means tracking value distributions over time, not just parse success rates. If your category field suddenly shifts from 60% billing to 85% billing after a model version update or a prompt change, that's a signal worth investigating. The individual records still look fine.

hallucinated enumerations and silent invention

Here's a subtler failure mode that strict mode partially addresses but doesn't fully eliminate: hallucinated values that fit the type but not the domain.

If your field is an open string, not an enum, you're entirely at the model's mercy. We've seen pipelines that ask a model to extract a product SKU from a document and store it as a string. The schema is satisfied when a string is present. The model, faced with an invoice where the SKU is partially obscured or formatted unusually, will frequently generate a plausible-looking SKU rather than returning null or flagging uncertainty, because that's what generative models do. They generate. A hallucinated SKU that's nine characters and matches your ^[A-Z]{2}[0-9]{7}$ regex is indistinguishable from a real one at the schema layer.

Strict mode with an enum does force the model to pick from your list, which prevents fully invented values, but it introduces a different problem: the model now has to map every input to one of your declared options even when none of them are a good fit. There's no 'I don't know' or 'none of the above' unless you explicitly include it, and a lot of schemas don't. The model will make a choice. It'll be a syntactically valid choice. It might be completely wrong.

The pattern we've landed on for anything where 'none of these' is a real possibility is to include an explicit unknown or no_match enum value and then treat records with that value as requiring human review rather than silently proceeding. It sounds obvious in retrospect, but an embarrassing number of pipelines just don't do this.

silent truncation and the length problem

Context window limits interact with structured outputs in a way that most people only discover under load. When a model is constrained to produce a particular JSON structure and it's running out of output tokens, the behavior isn't consistent. Sometimes the model will truncate string fields mid-sentence, producing a valid string that just... stops. Sometimes it'll omit optional fields entirely. Sometimes, if you're using strict mode, the generation will fail outright because it can't complete the required structure within the token budget.

The truncation case is the nasty one because it's a valid string, the field is present, the schema passes, and you've got half a summary, half an extracted clause, half a reason code. We've seen this happen in document extraction pipelines when the source material is long and the model is also being asked to produce several nested output fields. The first few fields come out fine, and by the time it's filling the summary field near the end of the output object, it's out of budget and just stops.

Dealing with this requires instrumentation at the output level, not the schema level. You want to check whether string fields that should have meaningful length are suspiciously short. You want to log finish_reason on every single call and alert on anything that isn't stop. A finish_reason of length with a valid JSON response that passed schema validation is a genuinely weird edge case that a lot of teams' monitoring setups don't surface clearly, because they're only watching for errors, not for this particular flavor of silent degradation.

what actually needs to happen

Schema validation is the floor, not the ceiling. Passing a Pydantic model over your LLM output is the bare minimum, roughly equivalent to checking that a function returns without throwing an exception. It's necessary and completely insufficient.

What production pipelines actually need is a second validation layer that understands the domain, not just the shape. For enum fields with high business impact, that means tracking value distributions and alerting on statistical drift. For string fields that should contain specific formats, extracted numbers, dates, identifiers, that means regex or rule-based checks on the content, not just the type. For fields where the model might be uncertain, that means designing the schema to express uncertainty explicitly, with a confidence field, an unknown enum option, a nullable type, rather than forcing the model to commit to an answer it doesn't have.

For pipelines we build at steezr, particularly anything touching financial data or document extraction, we'll typically run a second-pass validation stage that's entirely rule-based and domain-specific, after the LLM stage has run and after the schema has been checked. That stage knows things like 'a valid invoice amount for this client is never going to be negative' or 'this category only applies to these sub-categories, so if you see this combination, flag it.' The LLM doesn't know those things implicitly, and no schema you describe to it will make it know them.

The other piece is human review sampling. Even in high-throughput pipelines, randomly sampling some percentage of outputs for human spot-check is worth the cost. Not because you expect to find errors in every sample, but because systematic errors tend to cluster in ways that random sampling eventually surfaces. You're looking for the weird pattern that shows up in five out of fifty samples and makes you go 'wait, why are all of these the same wrong answer?'

Strict JSON mode is a genuinely useful feature. It eliminated a whole class of brittle output parsing code and made integrations more reliable. But the teams that shipped it made it very easy to conflate 'reliable to parse' with 'reliable to trust,' and those aren't the same thing, not even close.

Structured Outputs Don't Mean Your LLM Data Is Correct

the false sense of safety

what schema conformance actually gives you

semantic drift in the wild

hallucinated enumerations and silent invention

silent truncation and the length problem

what actually needs to happen

Your AI Feature Ships Fast and Rots Faster

Swap Encoders Without Torching Retrieval

Don’t ship agent chains to SMB customers

Want to work with us?