silent failure is the default
Most teams shipping RAG features still test them like a demo, they click around in a staging UI, ask three friendly questions, see a plausible answer with a citation badge, then merge. A week later somebody bumps gpt-4.1-mini to a newer snapshot, rotates embeddings from text-embedding-3-small to text-embedding-3-large, rebuilds pgvector indexes with different chunk boundaries, and the whole thing degrades without throwing a single exception. No 500s, no crash loops, no red line on the Grafana dashboard, just worse answers.
That kind of failure is nasty because every layer keeps returning valid-looking data. PostgreSQL 16 with pgvector 0.7.4 will happily return nearest neighbors, your reranker will happily sort them, the LLM will happily produce fluent nonsense, and your product team will happily assume the system is fine until support tickets start piling up. We see this in document processing and customer portal work all the time at Steezr, especially once a feature moves from a narrow pilot to a corpus with messy PDFs, duplicated policies, stale CRM exports, and users who ask questions the prompt writer never imagined.
You need regression tests that treat the pipeline as software, not theater. Deterministic snapshots for stable cases, generated counterexamples for ugly edge cases, and contract tests around retrieval and attribution so the system can fail loudly before production does. If your only quality control is eyeballing traces in LangSmith or OpenAI logs, you don't have a test strategy, you have a ritual.
freeze what can be frozen
Start with the pieces you can actually make deterministic. Set temperature=0, top_p=1, fix the model version if the provider allows it, pin your system prompt, pin chunking logic, pin retrieval parameters, and store the exact inputs that produced a known-good output. This sounds obvious, yet teams skip half of it and then complain that snapshots are flaky. Of course they're flaky, you changed three variables and expected the test to tell you which one mattered.
For a Python stack, a snapshot test can be brutally simple. We usually keep golden cases in Postgres because the data already lives there and versioning rows is easier than people think. A table like this works fine:
create table rag_golden_cases (
id uuid primary key,
case_name text not null unique,
question text not null,
document_set_version text not null,
prompt_version text not null,
model_name text not null,
expected_answer text not null,
expected_citations jsonb not null,
created_at timestamptz not null default now()
);Then in pytest, hit the pipeline with frozen settings and compare normalized output, not raw string equality unless the task is tiny and tightly bounded. Normalize whitespace, strip volatile timestamps, maybe lowercase citation labels if your renderer changes formatting. Keep the comparison strict enough to catch drift and loose enough to avoid nonsense failures from harmless punctuation. For example:
def test_policy_refund_snapshot(rag_app, golden_case):
result = rag_app.answer(
question=golden_case.question,
temperature=0,
top_k=6,
rerank=True,
prompt_version=golden_case.prompt_version,
document_set_version=golden_case.document_set_version,
model=golden_case.model_name,
)
assert normalize(result.answer) == normalize(golden_case.expected_answer)
assert result.citations == golden_case.expected_citationsThis catches prompt drift immediately. It also forces discipline around versioning, which most RAG codebases badly need. If a developer can't tell you which prompt template and which document set produced a passing test, the pipeline is already too slippery to operate.
synthetic counterexamples beat happy paths
Snapshot tests only cover known-good examples. They won't save you from the weird inputs that users invent five minutes after launch. You need generated cases that actively try to break retrieval, stuffing edge cases into places where the pipeline tends to lie.
Hypothesis is perfect for this, even if the property-based crowd sometimes oversells it. Use it to generate adversarial prompts around ambiguity, near-duplicate entities, negation, date boundaries, OCR garbage, and citation bait. If your corpus contains "ACME Standard Plan" and "ACME Standard Plus Plan", generate prompts that force disambiguation. If your source docs mix 2023-01-02 and 02/01/2023, generate locale confusion. If some PDFs contain headers like Page 1 of 42 CONFIDENTIAL, inject questions that tempt the retriever toward junk chunks.
A rough example:
from hypothesis import given, strategies as st
company = st.sampled_from(["ACME", "Acmé", "Acme Ltd"])
plan = st.sampled_from(["Standard", "Standard Plus", "Enterprise"])
year = st.integers(min_value=2021, max_value=2026)
negation = st.sampled_from(["can", "cannot", "is allowed to", "is not allowed to"])
@given(company=company, plan=plan, year=year, negation=negation)
def test_refund_policy_disambiguation(company, plan, year, negation, rag_app):
q = f"Under the {company} {plan} contract in {year}, who {negation} request a refund after renewal? Cite the source."
result = rag_app.answer(question=q, temperature=0)
assert no_hallucinated_citations(result)
assert answer_mentions_scope(result.answer, plan, year)The point isn't beautiful generators. The point is volume and nastiness. You want dozens or hundreds of cheap shots aimed at the exact weak points in your retrieval chain. We do this with document pipelines that extract entities from invoices and contracts too, because generated junk uncovers assumptions faster than polite fixture data ever will. One generated prompt that produces a citation to the wrong tenant's handbook is worth more than twenty green tests on clean marketing copy.
Also, keep every failing generated case. Promote it into the golden suite. That's how the test corpus becomes smarter over time instead of staying stuck at the level of the first sprint demo.
contract test the retrieval layer
A RAG pipeline has a hard boundary where retrieval hands evidence to generation. Put contracts there. If you only test the final answer text, you'll miss the most common class of failures, retrieval quietly returning irrelevant chunks that the model then smooths over with confident prose.
A retrieval contract should check at least four things. First, the expected source document appears in the top K. Second, chunk ranking stays above a threshold, either by cosine similarity, reranker score, or both. Third, the retrieved chunk actually contains the facts used in the answer. Fourth, citations map back to stable document IDs and offsets, not display names that break the moment somebody renames employee_handbook_final_v3.pdf.
Concrete example, suppose your retriever returns rows like (document_id, chunk_id, score, content, start_offset, end_offset). Then the test can assert a proper interface instead of vague relevance vibes:
def test_retrieval_contract_refund_policy(retriever, corpus_version):
hits = retriever.search(
query="Can a customer request a refund within 30 days of renewal?",
corpus_version=corpus_version,
top_k=8,
)
assert len(hits) >= 5
assert hits[0].score >= 0.78
assert any(h.document_id == "policy-refunds-2024" for h in hits[:3])
assert any("30 days of renewal" in h.content.lower() for h in hits[:5])
assert all(h.start_offset < h.end_offset for h in hits)Then test attribution after generation:
def test_attribution_contract(rag_app):
result = rag_app.answer(
question="Can a customer request a refund within 30 days of renewal?",
temperature=0,
)
for citation in result.citations:
chunk = load_chunk(citation.document_id, citation.chunk_id)
assert cited_span_exists(chunk.content, citation.quote)That last assertion matters more than people admit. Plenty of systems emit citation objects that look legitimate and still don't support the actual answer. I've seen traces where the answer says "refunds are unavailable after renewal" while the cited chunk says the opposite, and the UI still renders a neat little source card as if that solved anything.
version your corpus like code
Most RAG quality problems get blamed on prompts because prompts are visible. Corpus drift is usually the real culprit. Somebody reparses a PDF with a different OCR setting, changes chunk size from 800 tokens to 400, strips tables during preprocessing, or re-embeds half the collection after a migration, and now retrieval behavior has moved enough that old assumptions are dead.
Treat the document set as a versioned artifact. Store source file hash, extraction pipeline version, chunker version, embedding model, and index settings. If you're on Postgres, this can live next to your content tables without much drama:
create table document_versions (
version text primary key,
extractor_version text not null,
chunker_version text not null,
embedding_model text not null,
source_manifest jsonb not null,
created_at timestamptz not null default now()
);Then tie every golden test to document_set_version. If a test fails after a corpus rebuild, good, now you know exactly why. You can diff manifests, inspect changed chunks, and decide whether the new behavior is an improvement or a regression. Without this, teams end up with spooky-action debugging, one engineer swears the prompt regressed, another swears OpenAI changed something upstream, and nobody notices that the chunker stopped preserving section headings two deploys ago.
We learned this the hard way on document-heavy internal tools where extraction quality mattered as much as the final answer. A single change to table parsing can knock out retrieval for pricing rules, SLA exceptions, or invoice line items, while every API request still returns 200. Corpus versioning turns that mess into something you can reason about. It also gives you a clean promotion path, build corpus 2026-03-12.1, run the suite, inspect diffs, then mark it production-ready.
make failures readable
A red test that says assert False is useless. LLM regression tests need failure output that points to the broken layer in under a minute, otherwise the suite becomes decorative and people stop trusting it.
Print the query, prompt version, corpus version, retrieved chunks with scores, final answer, and citation validation status. If a snapshot changed, show a diff, not just mismatched strings. If retrieval missed the expected document, include the top ten hits and their scores. If attribution failed, print the cited quote and the actual chunk text window around the supposed span. You want the failure report to answer the first debugging questions automatically, because nobody wants to replay the whole thing manually inside a notebook after every CI run.
In practice this means writing a little test harness around your RAG stack instead of pretending pytest alone is enough. Ours usually emits a JSON artifact per failed case, something like:
{
"case": "refund-policy-renewal",
"model": "gpt-4.1-mini-2025-02-14",
"prompt_version": "answer_v12",
"document_set_version": "corpus_2026_03_12_1",
"retrieval": [{"doc":"policy-refunds-2023","score":0.81}],
"answer": "Refunds are not available after renewal.",
"citations_valid": false
}That artifact can go into CI logs, S3, or a small internal review page. Doesn't matter much. What matters is that the engineer on call can see, immediately, whether the failure came from retrieval, generation, or attribution. Once you have that, model upgrades stop feeling like superstition. They become ordinary engineering work, run the suite, inspect the breakage, decide if the delta is acceptable, ship or revert.
That's the standard. Anything softer leaves too much money riding on vibes.
