AI Feature Lifecycle Policy for Startups

the rot is quiet

Nobody tells you when your AI feature starts degrading. There's no exception thrown, no 5xx spike in Datadog, no Slack alert firing at 2am. The model just quietly returns slightly worse outputs on the new snapshot, your evals were never set up in the first place, and six months after launch you're getting support tickets about responses that "feel off" and you have no baseline to diff against.

We've seen this pattern multiple times, both in our own work at steezr and in systems we've been brought in to audit. A team ships a document summarization feature on gpt-4o-2024-08-06, it works beautifully, everyone celebrates, and then OpenAI rotates the snapshot behind the gpt-4o alias and the feature slowly drifts. Not catastrophically. Just enough that the legal team notices the summaries are omitting liability clauses they used to catch.

The reason this happens is structural. Shipping an AI feature in 2026 is genuinely easy, embarrassingly so. You call an API, write a prompt, wrap it in a Next.js route handler, done. But the operational discipline that keeps that feature reliable over a 12-month horizon, pinning model versions, maintaining a regression baseline, owning a deprecation runbook, having an actual fallback path, that discipline has a cost, and early-stage teams defer it because there's always a more urgent ticket.

The cost of deferral compounds. By month six you're in a situation where nobody remembers why the system prompt has that weird caveat in paragraph three, the model version is hardcoded in four different places with three different values, and the engineer who built the feature is working on something else entirely.

how OpenAI and Anthropic actually version things

Before writing any policy, you need to understand what you're actually pinning to, because OpenAI and Anthropic handle versioning very differently and conflating them is a real source of bugs.

OpenAI has two layers. There's the model alias (gpt-4o, gpt-4o-mini, o3) which points to a snapshot that can rotate without notice, and then there's the dated snapshot (gpt-4o-2024-11-20, gpt-4o-mini-2024-07-18) which is stable until it's deprecated, at which point OpenAI gives you roughly six months of warning. If your code says model: "gpt-4o" you're on the alias, and your prompts are running against whatever snapshot they've decided is current this week. Sometimes that's fine. Often it isn't. The fix is one line: pin to the dated snapshot. Right now in mid-2026 the latest stable pinnable snapshot for GPT-4o is gpt-4o-2025-03-26. Use that string literally in your config, not the alias.

Anthropic's scheme is slightly different. Claude model identifiers include a version suffix like claude-opus-4-5 or claude-sonnet-4-5, and Anthropic's deprecation policy gives you a minimum of six months notice before a model is retired, with the API throwing a deprecation warning header on responses from models nearing end-of-life. That header is anthropic-beta: deprecated and it shows up in the response metadata, not as an error, which means it's trivially easy to miss unless you're explicitly logging response headers.

The practical upshot is that your AI config should never contain a bare alias as the production model identifier. Ever. Aliases are fine for quick prototyping, genuinely useful for that, but the moment a feature ships to users, the model identifier in your production config should be a pinned, dated string that you consciously update.

the one-page policy template

Every AI feature your team ships should have a policy document. Not a fifty-page MLOps spec, just a single markdown file in the repo that answers five questions. We call this a Model Lifecycle Document and here's the template we've landed on after enough painful post-mortems.

**Model Pin.** What exact model identifier is in production right now, what date was it last updated, and who approved the update. One line each. If it says gpt-4o without a date suffix, the policy is already violated.

**Regression Baseline.** A set of 15-30 input/output pairs that represent the expected behavior of this feature, stored in the repo as a JSON file, with a script that runs the current pinned model against those inputs and scores the outputs. The scoring doesn't need to be fancy. For structured outputs, assert on schema and key field values. For free-text outputs, use a cosine similarity threshold against the reference or run a cheap judge model call. The baseline file lives at tests/ai/baselines/{feature_name}.json and the script runs in CI on every PR that touches the prompt or model config.

**Deprecation Runbook.** When the pinned model hits end-of-life, what happens. Concretely: who gets paged, what's the SLA for updating the pin, which engineer is the rotation owner, and what's the test gate before the new version goes to production. This doesn't need to be elaborate. A three-paragraph Notion doc is fine. It just needs to exist and be linked from the policy file.

**Fallback Path.** If the primary model is unavailable (rate limit, outage, deprecation with no warning), what does the feature do. Returns a cached response? Degrades to a simpler model? Fails closed with a user-facing message? The answer doesn't matter as much as the fact that there is one, and it's been tested, not just theorized.

**Prompt Ownership.** Who is the named owner of the system prompt for this feature. Not the team, a person. When the model pin updates, this person is responsible for validating that the prompt still performs correctly against the new snapshot.

That's the whole policy. Five sections, fits in a single screen of markdown.

setting up the regression baseline without a data team

The baseline is the part teams skip because it sounds like an ML problem, and they don't have ML engineers. It's not an ML problem. It's a testing problem, and you already know how to write tests.

Start by capturing 20 real production inputs from your first week of traffic. Not synthetic examples you made up while writing the feature, actual user inputs that exercised the feature in ways you didn't fully anticipate. Store those inputs in your tests/ai/baselines/ directory alongside the responses the model gave at launch, when the feature was working the way you intended.

Then write a pytest script (or a Go test, or a Jest test, whatever your stack is) that takes each stored input, runs it through the current model with the current pinned identifier and current system prompt, and compares the output to the stored baseline. For structured outputs this comparison is straightforward. If your feature extracts invoice line items as JSON and the baseline says {"total": 142.50, "line_items": [{...}]}, assert that the schema is intact and the numeric fields are within some tolerance. For free-text outputs, you have two options that don't require a data team: semantic similarity via text-embedding-3-small (cheap, fast, good enough for regression detection) or a judge prompt that asks gpt-4o-mini whether the new response covers the same key points as the reference.

The judge-model approach does introduce a dependency on a second API call, but gpt-4o-mini is inexpensive enough that running 25 judge calls per CI run costs fractions of a cent. We've found this pattern works well for document processing features where the outputs are long-form and exact string matching is useless.

The critical thing is to run this in CI on every PR that touches OPENAI_MODEL_ID, ANTHROPIC_MODEL_ID, or any file in your prompts/ directory. You want the regression check to be automatic and blocking, not a manual step someone might skip when they're in a hurry to ship a hotfix.

the deprecation runbook in practice

OpenAI's current deprecation process gives you a minimum of 30 days notice via email and documentation updates, with a typical runway of around six months for major model versions. The email goes to your organization's billing contact, which is often the CEO's personal Gmail account and definitely not in anyone's Slack. Build the assumption that you will not receive timely deprecation emails into your process from day one.

The practical fix is a cron job. Set up a weekly script that calls GET https://api.openai.com/v1/models/{model_id} for each pinned model identifier in your production config, parses the response for deprecation metadata, and fires a Slack notification if the deprecated field is true or if deprecation_date is within 90 days. OpenAI includes created and deprecation timestamps in the model object. Anthropic's equivalent is GET https://api.anthropic.com/v1/models/{model_id} which returns a deprecation_date field when one is set. This cron job is maybe 40 lines of Go or Python and it's one of the highest-ROI pieces of infrastructure you can build the week after you ship an AI feature.

When the runbook actually fires, the process should be: the rotation owner (named in the policy) picks the next pinned snapshot, runs the regression baseline against it locally, checks whether the delta in outputs is acceptable, updates the model identifier in config, pushes a PR that includes an updated policy document with the new date and approver, and merges after CI passes the regression gate. The whole thing should take less than two hours for a stable feature with a well-maintained baseline.

For features where the model update causes meaningful regression, you have to decide whether to update the prompt, accept the regression, or escalate to a larger model. That decision belongs in the runbook. "If regression score drops below 0.85 cosine similarity, escalate to tech lead" is a perfectly reasonable policy line.

the fallback path people forget to test

Every team says they have a fallback. Almost nobody has actually tested it under realistic conditions.

The most common failure mode we see in AI features is a model identifier that's been deprecated but the fallback logic was never wired up correctly, so the feature throws an unhandled 404 from the OpenAI API, which bubbles up as a 500 to your users, which nobody notices for a while because the error rate is low enough to stay under the alert threshold. This happened to a customer portal we inherited from another agency: they'd pinned to gpt-4-0314 years earlier, that model was deprecated, the feature silently failed for a percentage of requests based on cache hit rate, and nobody noticed for two weeks because the errors weren't tagged with the right metadata in their logging pipeline.

Test your fallback by temporarily setting your model identifier to a nonexistent string in a staging environment and running through the user flow manually. Not in a unit test, manually, the way a user would experience it. You'll often discover that the "graceful degradation" you wired up three months ago now returns a UI state that was never designed because a component changed.

For features where a synchronous AI response is in the critical path of the user experience, a cached fallback is worth the complexity. Store the last successful response for each input hash (or a representative subset of common inputs) in Redis with a TTL of 24 hours, and serve that when the primary model call fails. It's not perfect, it returns a slightly stale response, but stale is almost always better than broken for summarization or classification features where the input doesn't change minute to minute.

For features where freshness matters more than availability, fail closed explicitly with a user-facing message rather than silently returning bad data. "Document analysis is temporarily unavailable" is a better user experience than a subtly wrong summary that the user acts on.

shipping the policy alongside the feature

The hardest part of this isn't the tooling, it's the norm. Getting a team to write a Model Lifecycle Document for every AI feature requires making it a non-negotiable part of the definition of done, the same way you wouldn't merge a new API endpoint without a migration.

At steezr, when we build AI features for clients, the lifecycle document ships in the same PR as the feature itself, with the baseline JSON file, the regression test script, and the cron job config all included. It adds maybe half a day to the initial build. The alternative is spending that half day six months later in a fire drill because a model deprecated without warning and nobody owns the recovery process.

The template I described above is genuinely a one-page document. Keep the policy file at docs/ai/{feature_name}-lifecycle.md. Link it from your engineering wiki. Review it quarterly alongside your dependency updates. When you update the model pin, update the date in the policy. When someone new joins the team, the policy file tells them everything they need to know about why a particular model version is pinned and who to talk to when it needs to change.

The teams that get burned by AI feature rot aren't the ones that didn't care about quality, they're the ones that shipped something great, got busy with the next thing, and didn't leave any scaffolding for the person who'd need to maintain it eight months later. The scaffolding is cheap. The cleanup isn't.

Your AI Feature Ships Fast and Rots Faster

the rot is quiet

how OpenAI and Anthropic actually version things

the one-page policy template

setting up the regression baseline without a data team

the deprecation runbook in practice

the fallback path people forget to test

shipping the policy alongside the feature

The METR Productivity Illusion in Real Engineering Work

Structured Outputs Guarantee Syntax, Not Sanity

Postgres + pgvector 0.8 Is Probably Enough for Your Embeddings

Want to work with us?