Most RAG systems fail because nobody designs for audit

the failure mode is predictable

Most production RAG systems fail long before the model says something embarrassing in front of a customer, they fail the moment somebody asks a simple question like, "which source document justified this answer, which version of that document was active at the time, and why did retrieval rank chunk 184 above chunk 203?" If your stack can't answer that in a few seconds, with IDs and timestamps and deterministic retrieval logs, you don't have a production system, you have a demo with embeddings.

Teams keep making the same mistake. They start with a polished vector database, wire up a generic chunker, dump PDFs into it, call text-embedding-3-large or text-embedding-3-small, then act surprised when the assistant invents contract clauses, merges two CRM accounts into one person, or cites a policy revision that was superseded three months ago. None of this is mysterious. Retrieval quality is usually mediocre because chunking is naïve, metadata is shallow, and ranking is treated like a single cosine search should somehow understand recency, document type, tenant boundaries, and business criticality.

The real problem is that people treat vectors like magic. They're just coordinates. Useful coordinates, sure, but still only one signal. In document-heavy systems, especially ERP, CRM, support portals, or internal copilots, the hard part isn't generating an answer, it's preserving traceability under ugly real-world conditions, duplicate documents, OCR noise, tenant isolation, partial updates, permission filters, legal retention rules, and product managers who will definitely ask for "show me exactly where this came from" after the first bad response.

We've built enough AI-assisted document workflows at Steezr to stop romanticizing fancy retrieval stacks. PostgreSQL 16 with pgvector 0.7.x, plain relational metadata, generated tsvectors, and a retrieval pipeline you can explain to an auditor beats a black-box vector setup almost every time. Maybe not in benchmark theater. Definitely in production.

start with postgres

A pgvector-first setup survives audit because PostgreSQL already knows how to do the boring parts that actually matter, transactions, constraints, filtering, row-level security, versioning, JSONB metadata, full-text search, and decent observability. You don't need one system for metadata, another for vectors, then a queue of brittle sync jobs pretending eventual consistency is fine. It isn't fine when retrieval must respect document ACLs or legal retention windows.

A schema like this gets you most of the way there:

sql

 1create extension if not exists vector;
 2
 3create table documents (
 4  id uuid primary key,
 5  tenant_id uuid not null,
 6  source_type text not null,
 7  source_ref text not null,
 8  title text,
 9  checksum text not null,
10  version int not null,
11  created_at timestamptz not null default now(),
12  superseded_at timestamptz,
13  metadata jsonb not null default '{}'
14);
15
16create table document_chunks (
17  id uuid primary key,
18  document_id uuid not null references documents(id) on delete cascade,
19  tenant_id uuid not null,
20  chunk_index int not null,
21  content text not null,
22  token_count int not null,
23  page_number int,
24  section_path text,
25  char_start int,
26  char_end int,
27  embedding vector(1536),
28  embedding_model text not null,
29  tsv tsvector generated always as (to_tsvector('english', content)) stored,
30  metadata jsonb not null default '{}'
31);
32
33create index on document_chunks using hnsw (embedding vector_cosine_ops);
34create index on document_chunks using gin (tsv);
35create index on document_chunks (tenant_id, document_id, chunk_index);

That schema does two things most RAG stacks skip. First, it stores chunk-level provenance, page number, section path, character offsets, source reference, version. Second, it forces embedding reproducibility into the data model. The embedding_model column matters. If you silently re-embed half your corpus with a different model, your retrieval quality drifts, nobody notices, and your evaluation numbers become fiction.

Store checksums for source files, version documents explicitly, and never mutate chunk content in place without producing a new version. If legal or compliance asks why an answer changed between April 3 and April 29, you need to reconstruct the exact retrieval set used at that time. PostgreSQL makes that straightforward. Most "AI-native" stores make it annoying.

retrieval needs two signals

Cosine similarity alone is weak retrieval for messy business data. Dense embeddings are good at semantic closeness, terrible at exact identifiers, brittle around acronyms, and frequently confused by templates where every contract or ticket looks structurally similar. BM25 or PostgreSQL full-text search catches the literal terms that embeddings miss, invoice IDs, product SKUs, customer names, clause numbers, error codes like psycopg.errors.UndefinedTable, all the little pieces that matter in production.

Use hybrid retrieval, then rerank deterministically. Deterministically means you can run the same query against the same snapshot and get the same candidate set. No hidden LLM reranker in the critical path unless you can tolerate variance and cost. In most business assistants, you can't.

A very workable pattern is this:

1. Apply hard filters first, tenant, ACL, current document version, document type, date window. 2. Pull top 50 by BM25 or ts_rank_cd. 3. Pull top 50 by vector cosine. 4. Union the candidate set. 5. Score with a simple weighted formula you can tune and log.

For example:

sql

 1select
 2  c.id,
 3  c.document_id,
 4  0.45 * ts_rank_cd(c.tsv, websearch_to_tsquery('english', $1)) +
 5  0.55 * (1 - (c.embedding <=> $2::vector)) as final_score
 6from document_chunks c
 7join documents d on d.id = c.document_id
 8where c.tenant_id = $3
 9  and d.superseded_at is null
10  and (c.metadata->>'doc_type') = any($4)
11order by final_score desc
12limit 12;

You'll tune the weights based on your corpus. Legal contracts usually need more lexical weight. CRM notes often benefit from more dense similarity because people write in shorthand and fragments. Keep the formula boring. Boring wins audits.

Chunking also needs discipline. Fixed 1,000 token chunks with 200 overlap is lazy and often bad. Split on headings, table boundaries, speaker turns, or semantic sections, then cap chunk size. A policy PDF wants different chunking from a CRM timeline or a Zendesk export. If your ingestion treats them all the same, retrieval quality will flatten out at mediocre and stay there.

ground every answer

RAG systems die in production because the generator gets too much freedom. If the prompt says "answer based on the context" and the context contains partially relevant chunks, the model will fill gaps with plausible nonsense. GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, all of them do this. The model isn't broken, your constraints are weak.

Each answer needs explicit source binding. We usually require every factual claim to map back to one or more chunk IDs, and we store the final answer with citations like doc_8f3/page_14/chunk_03. If the model produces a sentence with no citation coverage, that sentence gets flagged. In stricter workflows, especially finance or healthcare-adjacent internal tools, we reject the whole answer and route it for review.

A practical prompt contract looks like this:

json

 1{
 2  "answer": "...",
 3  "citations": [
 4    {"claim": "Payment terms are net 30.", "chunk_ids": ["...", "..."]}
 5  ],
 6  "confidence": 0.82,
 7  "needs_human_review": false
 8}

Then validate it in code. Don't trust the model's self-reported confidence, use it as one input. Compute your own checks. Citation density, unsupported claims, chunk relevance spread, contradiction signals across retrieved chunks, and answer length relative to evidence all correlate surprisingly well with hallucination risk. If the assistant answers a narrow question with a 500-word synthesis supported by two tiny snippets, that's suspicious.

We've had good results with a simple post-generation verifier: re-run each claim against the retrieved chunks using a smaller model, or even a rule-based checker for exact fields, and mark unsupported claims. This catches obvious nonsense cheaply. Human review should be part of the design, not an embarrassing fallback added after incident number three. Queue the original question, retrieved chunks, answer draft, and citation map in one review screen. HTMX is perfectly fine for building that UI, no need to overengineer it.

evaluation is the whole job

Most teams evaluate RAG by asking five internal questions, seeing decent answers, then shipping. That's how you end up debugging hallucinations with a customer success manager on a Friday evening. Retrieval systems need offline evaluation, online monitoring, and business SLAs tied to specific failure classes.

Start with a test set built from real queries, not synthetic prompt candy. Sample from support tickets, CRM searches, internal knowledge requests, procurement docs, wherever the assistant will actually operate. Label the relevant chunks, or at least the relevant documents, then measure recall@k and precision@k before you even think about prompt tuning. If recall@10 is bad, generation quality doesn't matter because the model never saw the right evidence.

Set explicit thresholds. For example, an assistant answering policy questions for internal operations might require recall@8 above 0.92 on current-version documents, citation coverage above 0.95, and unsupported-claim rate below 0.02. A sales copilot summarizing CRM notes can tolerate more fuzziness, though you'll still want hard guards around account ownership, renewal dates, and quoted pricing.

Log every retrieval event. Query text, normalized query, embedding model, candidate chunk IDs, lexical score, vector score, final score, chosen citations, review outcome, user correction. Without that trail, debugging is guesswork. With it, you can answer concrete questions like, "did retrieval fail because BM25 missed the SKU, because vector search preferred a semantically similar policy, or because the prompt let the model generalize past the evidence?"

One ugly truth, evaluation datasets rot fast. Documents change, customers rename products, support teams invent new shorthand. Re-run evaluation whenever you change chunking, embedding model, rerank weights, parser versions, or access-control logic. If the corpus changes materially, your benchmark changed too. Pretending otherwise is how teams convince themselves quality is stable while users quietly stop trusting the assistant.

the recipe we keep shipping

The production-safe recipe is boring on purpose. PostgreSQL 16.4, pgvector 0.7.4, Django or FastAPI on the app side, Celery or plain Postgres-backed jobs for ingestion, object storage for raw files, and a retrieval path you can print on one page. Raw document goes in, parser extracts structured text, chunker emits section-aware chunks, embeddings are generated with a pinned model version, chunks are inserted transactionally with provenance fields, hybrid retrieval pulls candidates, deterministic rerank orders them, generation returns structured JSON with citations, verifier checks support, then answers either go back to the user or into a review queue.

Nothing exotic there. That's the point.

For teams building document assistants, CRM copilots, or internal knowledge tools, this setup is easier to operate than a stack made of five specialist services glued together by webhooks. It also maps cleanly onto the way senior engineers already think about systems, one source of truth, explicit schemas, strong filters, observable pipelines, reversible migrations. If you already run PostgreSQL, adding pgvector is a smaller operational jump than introducing another stateful service with its own backup, security, and consistency model.

The biggest win is social, not technical. Product, legal, support, and engineering can all reason about the system. You can show the exact chunk, the exact document version, the exact score, and the exact validation outcome. Auditors like that. Customers like that. Engineers like sleeping through the night.

RAG doesn't need more magic. It needs source discipline, ranking discipline, and evaluation discipline. Put vectors inside a system that respects those constraints and the whole thing gets dramatically less fragile.

Most RAG systems fail because nobody designs for audit

the failure mode is predictable

start with postgres

retrieval needs two signals

ground every answer

evaluation is the whole job

the recipe we keep shipping

The METR Productivity Illusion in Real Engineering Work

Structured Outputs Guarantee Syntax, Not Sanity

Postgres + pgvector 0.8 Is Probably Enough for Your Embeddings

Want to work with us?