Swap Encoders Without Torching Retrieval

silent failures first

Encoder swaps fail in the most annoying way possible, nothing crashes, dashboards stay green, p95 might even improve, and your search quality still falls off a cliff because embedding spaces aren't interchangeable and every ANN structure you built around the old vectors suddenly starts serving the wrong neighborhood. Teams miss this because they treat an embedding model like a patch release. It isn't. Changing text-embedding-3-large to a newer encoder, or moving from a local bge variant to gte, means your cosine distances, nearest-neighbor relationships, score distributions, and reranker inputs all shift at once.

I've seen this happen with pgvector on PostgreSQL, with Qdrant, with Pinecone, with Weaviate, same pattern every time, somebody re-embeds a corpus, points production queries at the new index, sees no infra alarms, then support tickets start reading like "search stopped understanding invoice numbers" or "results are vaguely related now". That's the dangerous part. Retrieval regressions show up as product weirdness, not pager noise.

The playbook is simple and strict. Version every embedding, never overwrite vectors in place, shadow real traffic before you cut over, validate retrieval against a judged set with a cross-encoder, and keep rollback independent from reindexing, because if rollback means regenerating 200 million embeddings at 2 a.m., you don't have rollback. At steezr we've used this pattern for document processing systems and customer portals where semantic search is wired into business workflows, and once you set it up properly the next model swap becomes operational work instead of ritual sacrifice.

version the vectors

Start in the database, because this is where people make the first bad decision, they update an embedding column in place and destroy the only clean comparison point they had. Keep the source record stable, keep semantic identity stable, and version the embeddings separately. Your semantic key should survive chunking tweaks, document renames, and model churn. I usually use a deterministic key based on the canonical source path plus chunk ordinal plus a content hash of normalized text, something like sha256(doc_id || ':' || chunk_no || ':' || normalized_text).

A PostgreSQL schema with pgvector works fine for this. Current pgvector releases support HNSW and IVFFlat well enough for most production loads, and Postgres 17 plus pgvector has been perfectly reasonable for mid-sized corpora if you know your memory limits.

sql

 1create extension if not exists vector;
 2
 3create table document_chunks (
 4  id bigserial primary key,
 5  semantic_key text not null unique,
 6  document_id uuid not null,
 7  chunk_no int not null,
 8  content text not null,
 9  content_sha256 bytea not null,
10  created_at timestamptz not null default now()
11);
12
13create table embedding_versions (
14  id bigserial primary key,
15  version_key text not null unique,
16  encoder_name text not null,
17  encoder_revision text not null,
18  dimensions int not null,
19  distance_metric text not null check (distance_metric in ('cosine', 'l2', 'ip')),
20  indexed_at timestamptz,
21  created_at timestamptz not null default now(),
22  active boolean not null default false
23);
24
25create table chunk_embeddings (
26  chunk_id bigint not null references document_chunks(id) on delete cascade,
27  embedding_version_id bigint not null references embedding_versions(id) on delete cascade,
28  embedding vector(1536) not null,
29  created_at timestamptz not null default now(),
30  primary key (chunk_id, embedding_version_id)
31);
32
33create index concurrently idx_chunk_embeddings_v1_hnsw
34on chunk_embeddings using hnsw (embedding vector_cosine_ops)
35with (m = 16, ef_construction = 200);

That last index needs one correction in real life, create one index per embedding version, either by partitioning chunk_embeddings by embedding_version_id or by storing each version in its own physical table. pgvector indexes don't understand your logical intent. If you stuff five model versions into one giant table and hope the planner will magically avoid scanning the wrong graph, you're building latency and recall bugs into the system.

Hosted vector DBs need the same discipline. Separate collections or namespaces per encoder version, immutable metadata for semantic keys, and a control plane record that says which version receives shadow traffic, which version is candidate, which one is active. Don't let the app infer any of this.

dual write, shadow read

Once your storage can hold multiple embedding versions, the rollout gets boring, which is exactly what you want. New and existing content should be dual-indexed for a while, old encoder plus candidate encoder, and query traffic should be mirrored to the candidate path without letting candidate results affect users yet. People often skip shadowing because they already have offline eval. Offline eval is necessary, it isn't sufficient, because real query distributions are always uglier than the curated set, users paste stack traces, half a contract clause, serial numbers with whitespace damage, terrible OCR, mixed-language fragments.

A tiny shadow proxy does the job. Put it in front of the retrieval service, send the primary result path back to the caller, fire the candidate query in parallel, log both top-k lists, scores, latency, and overlap. In Go this is maybe 150 lines if you don't overengineer it.

 1type SearchRequest struct {
 2    Query string `json:"query"`
 3    K     int    `json:"k"`
 4}
 5
 6type Hit struct {
 7    SemanticKey string  `json:"semantic_key"`
 8    Score       float64 `json:"score"`
 9}
10
11type SearchResponse struct {
12    Hits   []Hit  `json:"hits"`
13    Model  string `json:"model"`
14    TookMs int64  `json:"took_ms"`
15}
16
17func handler(w http.ResponseWriter, r *http.Request) {
18    var req SearchRequest
19    _ = json.NewDecoder(r.Body).Decode(&req)
20
21    ctx := r.Context()
22    primaryCh := make(chan SearchResponse, 1)
23    shadowCh := make(chan SearchResponse, 1)
24
25    go func() { primaryCh <- callRetriever(ctx, "embed-v2026-02", req) }()
26    go func() { shadowCh <- callRetriever(ctx, "embed-v2026-03", req) }()
27
28    primary := <-primaryCh
29    w.Header().Set("Content-Type", "application/json")
30    _ = json.NewEncoder(w).Encode(primary)
31
32    go func() {
33        shadow := <-shadowCh
34        logComparison(req, primary, shadow)
35    }()
36}

The logging matters more than the proxy. Store top-10 semantic keys, reciprocal rank deltas, Jaccard overlap, latency deltas, and query fingerprints. You want to answer very specific questions, does the candidate preserve navigational queries, does it improve long natural-language queries, does it collapse on exact code identifiers, does OCR garbage poison the nearest neighbors. A decent shadow record gives you that.

One hard rule, don't compare raw similarity scores across models. They're meaningless across spaces. Compare ranked outputs, judged relevance, and downstream task success. Teams waste days trying to normalize cosine scores from incompatible encoders. Don't.

judge with a reranker

If you only compare old top-k to new top-k, you'll reproduce the old system's biases and call that validation. Use a cross-encoder or reranker as a judge, ideally one that has already correlated with human relevance on your corpus. For text retrieval, a modern reranker from Cohere, Jina, mixedbread, Voyage, or a self-hosted cross-encoder from Sentence Transformers can score query-document pairs far better than embedding similarity alone. The reranker doesn't need to serve production traffic in the critical path if latency is tight, it just needs to score eval sets and shadow samples.

Your SLOs should be explicit. Precision@10, Recall@50 against a judged set, MRR for navigational queries, plus a failure budget for category-specific regressions. If invoice lookup drops 8 percent and FAQ search gains 2 percent, you still reject the rollout if invoice lookup is revenue-critical. Generic averages hide the pain.

A practical table for CI looks like this:

yaml

 1eval:
 2  min_precision_at_10: 0.78
 3  min_recall_at_50: 0.92
 4  min_mrr: 0.81
 5  max_latency_delta_ms_p95: 25
 6  max_regression_rate: 0.03
 7  critical_segments:
 8    invoice_queries:
 9      min_precision_at_10: 0.90
10    sku_lookup:
11      min_mrr: 0.95

Then wire a pass/fail script around it. Query both indexes, collect candidate documents by semantic key, rerank (query, chunk) pairs, and compute metrics. This can run in CI on a fixed dataset and nightly on fresh shadow samples. If the candidate misses the threshold, fail the build. Hard stop.

python

 1if metrics["precision_at_10"] < cfg["min_precision_at_10"]:
 2    raise SystemExit(f"FAIL precision@10={metrics['precision_at_10']:.3f}")
 3
 4if metrics["invoice_queries"]["precision_at_10"] < 0.90:
 5    raise SystemExit("FAIL invoice_queries precision regression")

You'll want human review too, especially for the weird edge buckets, short identifier queries, negation-heavy legal text, multilingual support tickets. A cross-encoder catches a lot, it won't catch every business-specific nuance. Keep 100 to 300 hand-judged examples alive and painful. They pay for themselves every release.

migration without identity loss

The migration job should preserve semantic keys even if your chunk IDs or storage layout change, otherwise every comparison and every cache becomes nonsense. People tie embeddings to database row IDs, then later re-chunk a corpus, import a backup, or shard a table, and suddenly they can't tell whether the candidate failed retrieval or they just changed identity under the system. Stable semantic keys fix that.

A basic backfill in SQL plus an application worker is enough. First create the target embedding version record, then upsert vectors keyed by (chunk_id, embedding_version_id). If content changes, generate a new chunk record with a new semantic key only when the normalized content actually changed.

sql

 1insert into embedding_versions (
 2  version_key, encoder_name, encoder_revision, dimensions, distance_metric
 3) values (
 4  'embed-v2026-03', 'text-embedding-3-large-next', '2026-03-01', 1536, 'cosine'
 5)
 6on conflict (version_key) do nothing;
 7
 8with target as (
 9  select id from embedding_versions where version_key = 'embed-v2026-03'
10)
11select dc.id, dc.semantic_key, dc.content
12from document_chunks dc
13where not exists (
14  select 1
15  from chunk_embeddings ce, target t
16  where ce.chunk_id = dc.id and ce.embedding_version_id = t.id
17)
18order by dc.id
19limit 1000;

Your worker reads that batch, computes embeddings, then writes with COPY or batched inserts. Track rate, failures, and drift. A common failure during bulk load is dimension mismatch after a model config change, and pgvector is refreshingly blunt about it: ERROR: expected 1536 dimensions, not 1024. Good. You want hard failures.

Hosted vector DBs need the same key discipline. Upsert payload should always include semantic_key, source metadata, and chunk checksum. If the vendor only exposes opaque IDs, use your semantic key as the ID. Don't let their autogenerated UUID become your identity layer.

During migration, freeze any retrieval-side assumptions that depend on score shape. If your app has thresholds like score > 0.82 means answer directly, delete that logic or version it per encoder. Absolute thresholds almost always rot across model swaps.

cutover and rollback

The final cutover should be one config flip, not a chain of side effects. Active encoder version changes in a control table or feature flag, query embedding switches to the candidate model, retrieval points to the candidate index, reranking stays compatible with both paths for one release window, and the old index remains queryable until you've survived enough real traffic to trust the new one.

In PostgreSQL that control plane can be dead simple:

sql

 1update embedding_versions
 2set active = case when version_key = 'embed-v2026-03' then true else false end;

The service should read active version on a short cache, 30 seconds is fine, or subscribe to config updates if you've already got that machinery. Keep the old query path alive behind a flag named something boring and obvious, RETRIEVAL_ROLLBACK_VERSION=embed-v2026-02, because the middle of an incident is not the time for poetry.

Rollback-safe reranking matters. If production ranking is ANN top-100 -> cross-encoder rerank -> top-10, the reranker must accept candidates from either encoder version without any code changes or score assumptions. That sounds obvious, people still bake encoder-specific heuristics into the post-filter stage and then wonder why rollback produces a different class of failures. Keep reranking stateless with respect to embedding version whenever possible.

Watch a few concrete metrics during canary, zero-result rate, clickthrough or task completion if you have it, overlap against old top-k, segment-specific precision from shadow judgments, p95 latency, and volume of low-confidence fallbacks. Canary size should be large enough to include ugly traffic. One percent of a tiny tenant set tells you nothing.

This whole process feels heavier than swapping a model name in config, and that's because it is. Search quality is product behavior, not infrastructure plumbing. Treat encoder upgrades like schema migrations with a blast radius, because that's what they are. Once you do, model rollouts stop being anxious guesswork and start looking like normal engineering.

Swap Encoders Without Torching Retrieval

silent failures first

version the vectors

dual write, shadow read

judge with a reranker

migration without identity loss

cutover and rollback

The METR Productivity Illusion in Real Engineering Work

Structured Outputs Guarantee Syntax, Not Sanity

Postgres + pgvector 0.8 Is Probably Enough for Your Embeddings

Want to work with us?