Treat hallucinations like system failures

stop calling them random

Hallucinations get described as spooky model behavior, like the system woke up in a bad mood and decided your invoice total is 17,430 EUR instead of 1,743.00 EUR. That framing is lazy, and it leads teams straight into bad architecture, because if you assume the failure is mysterious you end up adding bigger prompts, more examples, more retries, and other expensive nonsense instead of measuring the places where the model is plainly telling you it has weak footing.

Production failures usually have a shape. We see this in document pipelines, CRM assistants, and AI sales workflows we build at Steezr, where the dangerous outputs tend to cluster around the same patterns, unsupported entity linking, fabricated citations, overconfident normalization of messy source text, and action requests that quietly smuggle assumptions the retriever never found. None of that is random. The model emits low-confidence token regions, starts hedging then snaps back into confident prose, or references chunks that never existed in the context window. You can detect those signals cheaply if you stop treating the response as a sacred blob of text and start treating generation as telemetry.

The practical thesis is simple, log token-level confidence where the API supports it, store provenance for every retrieved chunk, run a small verifier or re-ranker against the final claim set, and keep deterministic paths for tasks that never needed an LLM in the first place. If the job is extracting an invoice date from a fixed PDF template, use pdfplumber, pymupdf, regexes, and schema validation. If the job is answering a fuzzy question over ten policy documents, then yes, use the model, just don’t let it act without passing through a cheap checkpoint.

Most teams skip the checkpoint. Then they act surprised when the assistant invents a renewal clause.

token streams are telemetry

If you're already streaming model output and only using it to make the UI feel alive, you're leaving useful signal on the floor. The generation stream can tell you where the model started guessing, where a JSON object drifted off schema, and where a citation was likely stitched together from nearby context instead of copied from evidence. Token-level log probabilities are the most obvious feature, assuming your provider exposes them for the model you're running, and yes, support varies, so build your pipeline around optional enrichment instead of hard dependency.

A practical pattern with the OpenAI Python SDK is to stream the response for latency, then request token details on the same completion path where available, or run a cheap secondary scoring pass on the emitted answer if the endpoint doesn't expose logprobs in the mode you need. The exact API surface moves, which is another reason to isolate this behind your own adapter. The logic stays stable. Aggregate per-token logprob, compute rolling windows, and flag spans where the average drops below a threshold while the text contains high-risk structures such as money amounts, dates, legal clauses, database identifiers, or citations.

A stripped down sketch looks like this:

python

 1from openai import OpenAI
 2client = OpenAI()
 3
 4resp = client.responses.create(
 5    model="gpt-4.1-mini",
 6    input=[
 7        {"role": "system", "content": "Answer using provided evidence only."},
 8        {"role": "user", "content": question}
 9    ],
10    stream=False,
11    logprobs=True
12)
13
14text = resp.output_text
15tokens = []
16for item in resp.output:
17    if getattr(item, "type", None) == "output_text":
18        for t in getattr(item, "logprobs", []) or []:
19            tokens.append({"token": t.token, "logprob": t.logprob})
20
21def suspicious_spans(tokens, threshold=-2.5, window=6):
22    out = []
23    for i in range(len(tokens) - window + 1):
24        avg = sum(t["logprob"] for t in tokens[i:i+window]) / window
25        if avg < threshold:
26            out.append((i, i + window, avg))
27    return out

Thresholds need calibration against your own data. -2.5 may be noisy for one model and too lenient for another. Build a labeled set of good and bad answers, then plot false positives against incident cost. The point isn't academic certainty. The point is to catch the ugly cases before they become support tickets or silent database corruption.

One more thing, store the raw token trace with the final answer, prompt version, model version, and retrieval ids. If you can't replay the bad generation later, you don't have observability, you have vibes.

provenance or it didn't happen

Retrieval-augmented generation falls apart the moment evidence provenance gets flattened into a single context string. I've seen teams concatenate ten chunks with \n\n---\n\n, pass that into the prompt, then ask the model for citations. That's theater. Once you erase chunk identity, page numbers, section ids, and retrieval scores, you can't verify anything downstream except by re-running the whole pipeline and hoping the retriever returns the same order.

Keep each evidence unit as a first-class object. That means chunk_id, source document id, page or row locator, retrieval score, embedding model version, checksum of the underlying text, and ideally a stable span offset if the source is a document you control. Your answer generator should produce structured claims tied to evidence ids, not free-form prose with decorative citations. Force the model to emit something like:

json

 1{
 2  "claims": [
 3    {
 4      "text": "The contract renews automatically for 12 months.",
 5      "evidence_ids": ["doc_184:p12:c03"],
 6      "risk": "high"
 7    }
 8  ]
 9}

Then verify that every evidence_id exists, that the cited chunk actually contains lexical overlap or semantic support for the claim, and that the chunk came from the current retrieval set, not a stale cache entry from yesterday's index. This is painfully unglamorous work. It's also the difference between a demo and a production system.

For document workflows, I like a two-table approach in PostgreSQL. One table stores canonical chunks with source metadata and text hash. Another stores retrieval events for each request, including rank, score, query rewrite, and prompt version. If a user asks why the assistant claimed a vendor had net 30 terms, you can answer with a precise audit trail instead of staring at LangSmith traces for forty minutes.

This provenance layer also unlocks deterministic checks. If the model says an invoice total came from invoice_882:p1:c2 and that chunk doesn't contain a currency pattern matched by r"\b(?:EUR|USD|CZK)\s?\d+[\d.,]*\b", reject it. Cheap, boring, effective.

add a cheap judge

The main model should not be the final authority on whether its own answer is supported. Self-grading works just well enough to trick people in benchmarks and just badly enough to burn you in production. Use a second step. A contrastive re-ranker, a small verifier model, or both.

Contrastive re-ranking is good for narrowing evidence before generation or before approval of a claimed citation set. Cross-encoders like bge-reranker-large or ms-marco-MiniLM-L-6-v2 are still useful, especially if you're running on your own hardware and care about predictable cost. For each claim, score candidate chunks against the claim text, then compare the best supporting chunk with the cited chunk. If the cited chunk isn't near the top, mark the claim as weak. That's one pass, cheap enough to run per answer, and far more reliable than trusting the generator's citation formatting.

A lightweight verifier model comes next. This can be another LLM call, kept narrow and structured, or a fine-tuned classifier if your domain is stable. Ask it one question only, does claim X follow from evidence Y, with labels like supported, contradicted, insufficient. Keep temperature at 0, require JSON, and fail closed on parse errors. A verifier prompt should be brutally specific:

json

 1{
 2  "task": "Assess support for a claim from evidence.",
 3  "labels": ["supported", "contradicted", "insufficient"],
 4  "claim": "The SLA guarantees 99.95% uptime.",
 5  "evidence": "Section 4.2: Service availability target is 99.9% per calendar month.",
 6  "rules": [
 7    "Do not use outside knowledge.",
 8    "Numeric mismatches are contradictions.",
 9    "If evidence is incomplete, return insufficient."
10  ]
11}

If you want a practical thresholding rule, start with this, block any action if any high-risk claim is labeled contradicted or if more than 20% of claims are insufficient. Tune later. You don't need elegance on day one, you need a gate that stops obvious garbage.

Teams worry this extra pass will hurt latency. Usually it adds a few hundred milliseconds, which is cheaper than sending a bad refund email, updating the wrong CRM field, or telling a customer their contract expires next week when it actually auto-renewed for a year.

deterministic paths win often

A depressing amount of LLM workflow design is just people refusing to admit they need a parser. If the data already lives in PostgreSQL, don't ask a model to remember the sales rep's quota attainment, write the SQL. If a customer portal needs policy numbers extracted from uploads, define the schema and use deterministic extraction first, then send only ambiguous fields to the model. You'll get lower cost, fewer incidents, and cleaner debugging.

This split architecture works well. Put an orchestrator in front, classify the task, and route to one of three paths, deterministic query, deterministic parser plus validator, or retrieval plus generation plus verification. We do this a lot with Django backends and Next.js frontends, because most user requests inside internal tools are repetitive enough to classify reliably. A question like "show unpaid invoices from February for Acme" should become a parameterized query against a known view. A request like "summarize dispute risk in these six vendor contracts" belongs on the LLM path.

Guardrails matter at the action boundary too. If the model proposes a SQL query, execute it only against a restricted read-only role, parse the AST with sqlglot or sqlparse, and reject writes outright. If it emits JSON for an automation step, validate with Pydantic v2 and surface the exact failure. Real errors help. This kind of message is useful:

text

 11 validation error for LeadUpdate
 2next_contact_date
 3  Input should be a valid date or datetime, input_value='next Thursday-ish', input_type=str

This kind isn't:

text

 1Something went wrong processing your request.

Deterministic fallbacks also make incident response sane. When a model path starts failing after a provider update, you can degrade gracefully to search results, raw evidence snippets, or a parser-only mode instead of taking the whole feature offline. Your users will tolerate a blunter answer. They won't tolerate fabricated ones.

wire it like production

The pipeline I'd ship today for a production assistant or document workflow is straightforward, and yes, straightforward beats clever almost every time.

Start with request classification. Decide whether the task is pure retrieval, extraction, transformation, or an action request. Run deterministic handlers first where possible. For LLM-eligible requests, retrieve evidence with stable chunk ids, then re-rank the top set with a cross-encoder. Generate a structured answer, not prose first, with claim objects tied to evidence ids. Collect token-level signals during generation where supported, then run anomaly checks over spans containing risky entities. Feed each claim and cited evidence into a verifier. If verification passes, render the user-facing answer or execute the bounded action. If it fails, degrade to quoted evidence, ask a clarification question, or switch to a deterministic path.

Logging needs the same discipline as payments code. Store request_id, user id, model name, provider response id, prompt template version, retrieved chunk ids, re-ranker scores, verifier labels, token anomaly metrics, and final disposition. Keep all of it queryable. A simple PostgreSQL schema is enough unless you're already deep into an observability stack. You don't need ten vendors to answer, which prompt version started fabricating cancellation dates after we changed chunk size from 600 to 1200 tokens?

One operational detail people miss, version every prompt and every schema. If the generator emits claims[].evidence_ids[] in v12 and claims[].citations[] in v13, your verifier and dashboards need to know which one they are reading. We learned this the annoying way on a document processing pipeline where a small schema tweak silently bypassed a citation checker for two days. Nothing exploded, because the deterministic action gate blocked updates with missing evidence, which is exactly the kind of boring safeguard you want in place before launch.

Treat hallucinations as instrumentable failures, the same way you'd treat timeouts, deadlocks, or cache stampedes. Measure them, gate them, and route around them. The teams that do this ship faster, mostly because they stop having the same argument every sprint about whether the model can be trusted.

Treat hallucinations like system failures

stop calling them random

token streams are telemetry

provenance or it didn't happen

add a cheap judge

deterministic paths win often

wire it like production

The METR Productivity Illusion in Real Engineering Work

Structured Outputs Guarantee Syntax, Not Sanity

Postgres + pgvector 0.8 Is Probably Enough for Your Embeddings

Want to work with us?