Ship prompts like code, or expect regressions

silent failures are the default

Teams keep treating prompts like sticky notes taped to the side of an application, then they act surprised when a model update, a changed system prompt, or one innocent-looking refactor turns a good feature into a support ticket generator. That approach is amateur hour. If a prompt decides whether your customer gets the right invoice classification, whether your AI sales assistant qualifies a lead correctly, or whether your document pipeline extracts the right VAT ID, that prompt is production logic, and production logic belongs in the same discipline bucket as application code.

The nasty part is that LLM regressions usually don't explode loudly. You don't get a clean stack trace. You get a softer, more dangerous failure mode, support saying "it feels worse this week", conversion dropping 7%, or your ops team noticing that fields once extracted correctly now come back as null or, even worse, confidently wrong. We saw this pattern while building AI-heavy internal tooling and customer-facing workflows at Steezr, mostly around document processing and sales automation, where one prompt tweak improved performance on the sample inputs everyone cared about, then quietly degraded extraction on ugly real-world PDFs that had weird OCR artifacts and mixed languages.

Classical software taught us this years ago. Anything that changes behavior needs versioning, tests, staged rollout, and a kill switch. Prompts are no exception. If your prompt lives in a database row edited through an admin panel with no review, no diff, and no traceability, you've built a regression machine. Stick it in Git. Review it in pull requests. Tag it with a semantic version if the behavior matters externally. Keep the model version next to it. Keep sampling parameters next to it. Store the expected schema next to it. Then your team can answer basic questions under pressure, which prompt was running, with which model, under which constraints, and when exactly did the behavior change.

This doesn't need to be academic. A prompt repository can be a plain directory in a Next.js or Django monorepo, something like ai/prompts/lead_qualifier/v3.2.1.yaml, checked in with the rest of the app. Boring is good. Boring survives incidents.

make prompts diffable

The first practical step is choosing a file format that preserves intent and plays nicely with code review. JSON works, YAML is easier on the eyes, and raw .txt files are fine until you need metadata, at which point people start inventing comment conventions and the whole thing gets messy. We usually prefer YAML with explicit fields because it lets you capture the behavior contract in one place.

A file like this is enough to start:

yaml

 1id: lead_qualifier
 2version: 3.2.1
 3model: gpt-4.1-mini-2025-04-14
 4temperature: 0
 5max_output_tokens: 300
 6owner: growth-team
 7input_schema:
 8  type: object
 9  required: [company_name, website, notes]
10output_schema:
11  type: object
12  required: [score, reason, industry, employee_band]
13system: |
14  You classify inbound B2B leads for a SaaS sales team.
15  Return valid JSON only.
16  If data is missing, use null and explain uncertainty briefly.
17user_template: |
18  Company: {{ company_name }}
19  Website: {{ website }}
20  Notes: {{ notes }}
21stop: []

That single file gives you a stable review surface. A PR can show that someone changed temperature from 0 to 0.7, removed the JSON-only instruction, or swapped gpt-4.1-mini-2025-04-14 for a newer snapshot. Those are behavior changes, and they deserve the same scrutiny as changing a SQL query in a billing path.

You also want lint rules. Real ones. Not style-policing nonsense, actual rules that catch recurring failure modes. We enforce checks like, prompt must specify an explicit model, prompt must pin temperature to 0 unless there's a written reason, output prompts must define a schema, no unresolved template variables, no contradictory instructions like "be concise" followed by "provide detailed rationale", and no hidden dependency on application state that isn't represented in the input. A simple Python linter using ruamel.yaml or pydantic is enough. Fail CI if a prompt violates the contract.

One more thing matters more than people expect, keep system prompts and user templates separate. Once they get concatenated all over the codebase, nobody can tell whether a regression came from business logic, formatting changes, or a model swap. Separation makes diffs readable, and readable diffs prevent stupid incidents.

tests that actually help

Most teams hear "test prompts" and immediately either overcomplicate it with LLM-as-judge pipelines everywhere or give up because outputs are probabilistic. Both reactions miss the obvious middle ground. You can get a lot of value from deterministic tests if you design for determinism. Pin the model version. Set temperature: 0. Constrain output to JSON. Validate structure first, content second.

A useful test suite has a few layers. First, schema tests. Every fixture should produce parseable JSON that conforms to the expected shape, no markdown fences, no prose before the object, no missing required fields. Second, invariant tests. If the input says country is Germany, the model shouldn't output currency: USD unless your prompt explicitly derives a different concept. Third, golden tests on a curated dataset, where you compare exact or near-exact outputs for high-value cases. Exact matching works surprisingly well once the response is structured tightly enough.

A pytest example looks like this:

python

 1import json
 2import pytest
 3from jsonschema import validate
 4from myapp.ai import render_prompt, call_model
 5from myapp.prompts import load_prompt
 6
 7PROMPT = load_prompt("lead_qualifier", version="3.2.1")
 8SCHEMA = PROMPT["output_schema"]
 9
10@pytest.mark.parametrize("fixture", [
11    "tests/fixtures/leads/saas_us.json",
12    "tests/fixtures/leads/manufacturing_de.json",
13    "tests/fixtures/leads/agency_empty_notes.json",
14])
15def test_lead_qualifier_schema(fixture):
16    payload = json.load(open(fixture))
17    messages = render_prompt(PROMPT, payload)
18    raw = call_model(messages, model=PROMPT["model"], temperature=0)
19    data = json.loads(raw)
20    validate(data, SCHEMA)
21    assert data["score"] in ["low", "medium", "high"]

Then add regression fixtures that encode bugs you've already paid for. If the model once mislabeled "family-owned industrial supplier" as "consumer ecommerce", freeze that case forever. Production incidents should become tests, every single time. Engineers already know this rule for application code, they just forget it the moment the word "AI" enters the room.

There is one caveat, provider-side model aliases can quietly drift. gpt-4.1-mini today may not behave exactly like it did last month, depending on the vendor and the endpoint semantics. If the provider offers dated snapshots, use them. If they don't, record raw responses from your fixture suite daily and alert on distribution shifts, because your deterministic test just became less deterministic.

canary against shadow data

Unit tests catch the dumb regressions. Canarying catches the expensive ones. A prompt can pass every curated fixture you wrote and still fail on the ugly long-tail inputs that only show up in production, malformed OCR, users pasting email threads into a textarea, product names that collide with common nouns, all the weirdness nobody includes in a neat test file.

The fix is straightforward, keep a shadow dataset built from real traffic. Strip PII, freeze the inputs, attach expected labels where you have them, and score both the current prompt and the candidate prompt against the same corpus before rollout. This doesn't require a giant ML platform. A nightly Django management command or GitHub Actions job can do the job if your dataset is in S3 or PostgreSQL.

We usually care about three metrics on a canary run, valid output rate, task-specific correctness, and disagreement rate versus the current production prompt. That last one sounds crude, still it's extremely useful. If a candidate prompt suddenly disagrees on 38% of cases where the current version has been stable and support tickets were low, you stop and inspect. Maybe it found a real improvement. Maybe it got weird. You do not ship first and investigate later.

A simple rollout table helps:

sql

 1create table prompt_releases (
 2  id bigserial primary key,
 3  prompt_id text not null,
 4  version text not null,
 5  model text not null,
 6  traffic_percent integer not null check (traffic_percent between 0 and 100),
 7  status text not null check (status in ('shadow', 'canary', 'full', 'rolled_back')),
 8  created_at timestamptz not null default now()
 9);

Then your app chooses prompt versions by config, not by hardcoded constants scattered through handlers. In a Next.js route or Django service layer, route 5% of eligible requests to the candidate version, log outputs side by side if the feature allows for shadow execution, and compare them offline. If your p95 latency jumps from 1.8s to 4.9s because the new prompt is twice as long and triggers more tool calls, that matters too. Regressions aren't only about accuracy.

Canarying is where engineering leads separate themselves from prompt hobbyists. Hobbyists ship vibes. Leads ship evidence.

ci that fails fast

If the only place a prompt gets exercised is a product manager clicking around a staging environment, your process is broken. Prompt changes need a pipeline. A real one. GitHub Actions, Buildkite, GitLab CI, pick your poison, the shape is the same, lint, render validation, unit tests against fixtures, shadow evaluation on a sampled dataset, then deployment gates.

A lean GitHub Actions job can look like this:

yaml

 1name: prompt-ci
 2on:
 3  pull_request:
 4    paths:
 5      - 'ai/prompts/**'
 6      - 'tests/ai/**'
 7      - '.github/workflows/prompt-ci.yml'
 8jobs:
 9  test-prompts:
10    runs-on: ubuntu-22.04
11    steps:
12      - uses: actions/checkout@v4
13      - uses: actions/setup-python@v5
14        with:
15          python-version: '3.12'
16      - run: pip install -r requirements.txt
17      - run: python scripts/lint_prompts.py
18      - run: pytest tests/ai -q
19      - run: python scripts/eval_shadow.py --base main --candidate HEAD --threshold-file ai/thresholds.yaml
20        env:
21          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The threshold file matters because quality gates should be explicit, not trapped in a reviewer's intuition. Something like valid_output_rate >= 0.995, accuracy_delta >= -0.01, latency_p95_ms <= 2500, cost_per_1k_requests_delta_usd <= 3.00. If the candidate fails, CI goes red. That sounds obvious, yet plenty of teams still merge prompt tweaks because the wording "feels clearer". I don't care how it feels. Show the numbers.

You'll hit flaky tests eventually, usually because the provider changed behavior, a network hiccup caused retries to produce different outputs, or a supposedly structured response wrapped itself in triple backticks and broke your parser. Good. Expose that fragility early. Add retries only where the business behavior tolerates retries. Capture raw request and response payloads in CI artifacts for failures. Save the prompt, model, parameters, input fixture, and raw output. Nobody can debug AssertionError: expected high got medium without the full context.

One hard rule helps a lot, prompt-only PRs shouldn't ship without at least one engineer reviewing them who owns the downstream metric. Prompt text can move revenue just as effectively as code.

rollback has to be boring

Every production LLM feature needs a rollback path that works at 2 a.m. under stress, with half the team asleep and the on-call person reading logs through one eye. If reverting a prompt means rebuilding the app, waiting for a container rollout, and hoping nobody changed the provider config in the meantime, your incident process is garbage.

The simplest design is a prompt registry table plus config-driven resolution. The application asks for lead_qualifier@active, your service resolves that alias to a concrete prompt version and model snapshot, caches it briefly, and logs the resolved pair on every request. Rolling back becomes a database update or feature flag flip, not a redeploy.

Something like this in Django is enough:

python

 1from django.core.cache import cache
 2from myapp.models import PromptAlias
 3
 4def resolve_prompt(prompt_id: str, alias: str = "active"):
 5    key = f"prompt:{prompt_id}:{alias}"
 6    cached = cache.get(key)
 7    if cached:
 8        return cached
 9    row = PromptAlias.objects.select_related("prompt_version").get(
10        prompt_id=prompt_id,
11        alias=alias,
12    )
13    resolved = {
14        "version": row.prompt_version.version,
15        "model": row.prompt_version.model,
16        "system": row.prompt_version.system,
17        "user_template": row.prompt_version.user_template,
18    }
19    cache.set(key, resolved, 30)
20    return resolved

Then wire a kill switch into the feature path. If output validation fails above some threshold, if disagreement spikes, if user feedback drops, or if the provider starts timing out, your app should fall back to the previous prompt version or even disable the AI branch and route to a safer rules-based path. This is especially relevant in ERP and CRM workflows, where a bad extraction can poison downstream data quietly for hours.

I've seen teams obsess over fancy evaluation dashboards and then realize they can't answer the most basic incident question, can we revert now. Build that first. A boring rollback beats a beautiful postmortem.

Ship prompts like code, or expect regressions

silent failures are the default

make prompts diffable

tests that actually help

canary against shadow data

ci that fails fast

rollback has to be boring

The METR Productivity Illusion in Real Engineering Work

Structured Outputs Guarantee Syntax, Not Sanity

Postgres + pgvector 0.8 Is Probably Enough for Your Embeddings

Want to work with us?