Zero-Downtime Django Postgres Migrations Need Multi-Step Plans

squashing is theater

Squashing migrations gets sold as hygiene, and sure, reducing a 400-file migration chain into 12 files makes a repo feel less embarrassing, your CI logs get shorter, and new developers stop asking why python -m django migrate spends half its time replaying archaeology from 2021. None of that has much to do with production safety. The dangerous part of a migration isn't how many files exist in app/migrations, it's the lock you take, the table rewrite you trigger, the index build you forgot to mark CONCURRENTLY, or the app release you coupled to a schema flip that can only succeed if every pod deploys in perfect order.

We've seen this repeatedly on large Django monoliths, including internal tools and ERP-style systems we build at Steezr for SMBs that start simple and then quietly accumulate ten years of business rules in one Postgres database. Teams obsess over cleaning migration history, then ship ALTER TABLE ... ADD COLUMN foo TEXT NOT NULL DEFAULT '' against a 120 million row table at noon, and wonder why API p95 jumps from 140ms to 12s while Postgres queues every writer behind an ACCESS EXCLUSIVE lock. That's the part that matters.

Production migrations are deployment plans. They have phases, observability, abort paths, cleanup work, and sometimes ugly temporary code that exists for two weeks and then gets deleted. Treat them like a binary toggle and you'll eventually earn a bad night.

Django 6.x doesn't remove that burden, it just gives you decent primitives. You still need to know which ORM operations emit lock-heavy SQL, when to set atomic = False, when to split one conceptual change into four releases, and when raw SQL is the honest tool. The migration graph is bookkeeping. The rollout plan is engineering.

expand first

The pattern we use is expand, backfill, switch, contract. Every time. A column rename, enum reshaping, moving data into a new table, replacing a nullable foreign key, adding a uniqueness rule, all of it fits this model if you stop insisting that one migration file should express the whole truth.

Say you want to replace customer.email with customer.primary_email, make it non-null, and enforce uniqueness case-insensitively. The bad version drops and recreates things in one shot. The safe version starts by adding the new column as nullable, no default, then creating the supporting index concurrently. In Django 6.x that usually means one schema migration for the field, then a non-atomic migration for the index:

python

 1from django.db import migrations, models
 2
 3class Migration(migrations.Migration):
 4    dependencies = [("customers", "0184_add_primary_email")]
 5    atomic = False
 6
 7    operations = [
 8        migrations.RunSQL(
 9            sql=(
10                "CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS "
11                "customers_customer_primary_email_uniq "
12                "ON customers_customer (lower(primary_email)) "
13                "WHERE primary_email IS NOT NULL;"
14            ),
15            reverse_sql=(
16                "DROP INDEX CONCURRENTLY IF EXISTS "
17                "customers_customer_primary_email_uniq;"
18            ),
19        ),
20    ]

That atomic = False matters because Postgres rejects CREATE INDEX CONCURRENTLY inside a transaction block with ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block. Django won't save you from that. You need to know it.

Expanding also means avoiding rewrites. Adding a nullable column is cheap. Adding a column with a volatile default can be expensive. Changing TEXT to JSONB on a big table can rewrite the whole relation. Some ALTER TYPE operations are nearly free, some are a trap. Read the actual SQL plan, check PostgreSQL docs for lock level, and test against production-like row counts, because a migration that finishes in 0.8s on staging can sit on a lock queue forever in prod once thirty app servers start contending for the same table.

backfill like an operator

Backfills don't belong in request paths, and they usually don't belong in a single Django migration either. A RunPython that loops over 80 million rows with ORM .save() calls is how you end up staring at statement timeout errors while autovacuum falls behind and replicas lag by minutes. Use a background worker, chunk the work, checkpoint progress, and make the job idempotent because you'll restart it.

We usually write a small management command or a one-off worker task, then deploy it separately from the schema expand step. Example, chunk by primary key, update in SQL, sleep a little between batches so you don't turn your own database into a benchmark:

python

 1from django.core.management.base import BaseCommand
 2from django.db import connection
 3import time
 4
 5BATCH = 10_000
 6SLEEP = 0.05
 7
 8class Command(BaseCommand):
 9    def handle(self, *args, **opts):
10        last_id = 0
11        while True:
12            with connection.cursor() as cur:
13                cur.execute(
14                    """
15                    WITH batch AS (
16                        SELECT id
17                        FROM customers_customer
18                        WHERE id > %s
19                          AND primary_email IS NULL
20                        ORDER BY id
21                        LIMIT %s
22                    )
23                    UPDATE customers_customer c
24                    SET primary_email = lower(c.email)
25                    FROM batch
26                    WHERE c.id = batch.id
27                    RETURNING c.id
28                    """,
29                    [last_id, BATCH],
30                )
31                rows = cur.fetchall()
32            if not rows:
33                break
34            last_id = rows[-1][0]
35            time.sleep(SLEEP)

That script is boring, which is exactly what you want at 2am.

For tables receiving writes during the backfill, you need a bridge so old code and new code don't diverge. Sometimes app-level dual writes are enough. For high-value invariants, triggers are safer because they don't depend on every code path being updated correctly. A temporary trigger that keeps primary_email synced from email during rollout is cheap insurance:

sql

 1CREATE OR REPLACE FUNCTION sync_primary_email() RETURNS trigger AS $$
 2BEGIN
 3  IF NEW.primary_email IS NULL AND NEW.email IS NOT NULL THEN
 4    NEW.primary_email := lower(NEW.email);
 5  END IF;
 6  RETURN NEW;
 7END;
 8$$ LANGUAGE plpgsql;
 9
10CREATE TRIGGER customers_sync_primary_email
11BEFORE INSERT OR UPDATE OF email, primary_email
12ON customers_customer
13FOR EACH ROW EXECUTE FUNCTION sync_primary_email();

Temporary code has a job. Keep data coherent while multiple app versions coexist.

switch with overlap

The switch phase is where teams get overconfident. They see the new column populated, merge one PR that flips reads and writes, and assume the rest is cleanup. Production doesn't work that neatly. You need overlap, because some pods are still on the old release, some async jobs are still deserializing old fields, one cron script no one remembered imports a model constant from 2023, and replicas might briefly lag right when you think everything is aligned.

We prefer a release sequence that tolerates mixed versions. Release one writes old and new. Release two reads new, still writes both. Release three enforces constraints once metrics show the backfill is complete and no old reads remain. If the app can survive one version skew, deployment order stops being a source of adrenaline.

This is also the point where Postgres lock behavior matters again. ALTER TABLE ... SET NOT NULL can scan the table to verify existing rows. On recent PostgreSQL versions it's far better than it used to be, still, on huge tables we often add a CHECK (primary_email IS NOT NULL) NOT VALID, validate it online, then set NOT NULL in a later step if needed. Same idea for foreign keys, add them NOT VALID, validate later:

sql

 1ALTER TABLE orders_order
 2  ADD CONSTRAINT orders_order_customer_id_fk
 3  FOREIGN KEY (customer_id)
 4  REFERENCES customers_customer(id)
 5  NOT VALID;
 6
 7ALTER TABLE orders_order
 8  VALIDATE CONSTRAINT orders_order_customer_id_fk;

Validation holds lighter locks than creating a fully validated constraint upfront, which is the difference between an uneventful deploy and a support channel full of timeout screenshots.

We also add feature flags around read paths for risky switches. Plain Django settings, LaunchDarkly, Flipper clone, doesn't matter. What matters is having a one-minute rollback path that doesn't involve trying to reverse DDL under pressure. Reversing migrations in prod is often fiction. Forward fixes are what actually happen.

tools that save nights

A few tools and habits pay for themselves fast. pg_locks and pg_stat_activity should be muscle memory, not trivia. If a migration stalls, run something like this immediately:

sql

 1SELECT blocked.pid     AS blocked_pid,
 2       blocked.query   AS blocked_query,
 3       blocking.pid    AS blocking_pid,
 4       blocking.query  AS blocking_query
 5FROM pg_catalog.pg_locks blocked_locks
 6JOIN pg_catalog.pg_stat_activity blocked
 7  ON blocked.pid = blocked_locks.pid
 8JOIN pg_catalog.pg_locks blocking_locks
 9  ON blocking_locks.locktype = blocked_locks.locktype
10 AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
11 AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
12 AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
13 AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
14 AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
15 AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
16 AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
17 AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
18 AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
19 AND blocking_locks.pid != blocked_locks.pid
20JOIN pg_catalog.pg_stat_activity blocking
21  ON blocking.pid = blocking_locks.pid
22WHERE NOT blocked_locks.granted;

Set lock_timeout defensively in risky migrations. Failing fast beats silently queueing behind a long transaction from some admin report. We often wrap raw SQL with SET lock_timeout = '2s'; SET statement_timeout = '15min'; inside the migration operation so the deploy fails loudly instead of turning into a brownout.

pg_repack is another lifesaver for bloat cleanup and certain table reorganizations, especially after large churn or column drops that didn't actually return space to the OS. It's not magic, and you still need enough disk headroom, still, it's far safer than pretending VACUUM FULL is acceptable on a hot table. Keep it in the toolbox.

Small maintenance scripts matter too. One script that checks backfill completion percentages. One that verifies trigger presence. One that compares old and new columns for drift on a random sample of 100k rows. One that dumps active migrations and app version hashes before a release. These take an afternoon to write, then save a weekend later.

contract last, much later

Dropping the old column, removing the trigger, deleting dual-write code, and squashing the migration history can wait. Waiting is healthy. Give the new path a full deploy cycle, then another one, then clean up after you've seen backups succeed, replicas stay current, analytics jobs stop touching the old field, and nobody has opened a bug that somehow still references it.

This delayed cleanup is where squashing migrations becomes actively misleading. People squash because they want the codebase to tell a clean story. Production systems don't have clean stories, they have residue from careful rollouts, and that residue is evidence that the team respects uptime. I would rather open a migration directory and see 0191_add_primary_email_nullable, 0192_primary_email_index_concurrently, 0193_sync_trigger, 0194_noop_placeholder_for_backfill, 0195_enforce_primary_email_constraint, than one majestic 0191_customer_email_refactor that pretends the change happened atomically.

We still squash occasionally, mostly for startup time in fresh environments and to keep apps with very old histories manageable, though only after the operational risk is gone and only after checking that the squashed state mirrors reality exactly. Squashing is repo maintenance. It isn't a migration strategy.

The steady, low-drama way to evolve a Django 6.x monolith is boring on purpose. Expand first. Backfill outside the request path. Use triggers or dual writes during overlap. Validate constraints separately. Switch reads with a rollback path. Contract later. Repeat. That's how one Postgres database can survive years of product churn without the team developing a collective fear of migrate.

Zero-Downtime Django Postgres Migrations Need Multi-Step Plans

squashing is theater

expand first

backfill like an operator

switch with overlap

tools that save nights

contract last, much later

Your Tests Passed and Production Still Broke: The AI Verification Gap Nobody Wants to Name

The Vercel Breach Is a Template for How OAuth Sprawl Kills You

Your AI Dev Toolchain Is the Attack Surface

Want to work with us?