squashing is theater
Squashing migrations gets sold as hygiene, and sure, reducing a 400-file migration chain into 12 files makes a repo feel less embarrassing, your CI logs get shorter, and new developers stop asking why python -m django migrate spends half its time replaying archaeology from 2021. None of that has much to do with production safety. The dangerous part of a migration isn't how many files exist in app/migrations, it's the lock you take, the table rewrite you trigger, the index build you forgot to mark CONCURRENTLY, or the app release you coupled to a schema flip that can only succeed if every pod deploys in perfect order.
We've seen this repeatedly on large Django monoliths, including internal tools and ERP-style systems we build at Steezr for SMBs that start simple and then quietly accumulate ten years of business rules in one Postgres database. Teams obsess over cleaning migration history, then ship ALTER TABLE ... ADD COLUMN foo TEXT NOT NULL DEFAULT '' against a 120 million row table at noon, and wonder why API p95 jumps from 140ms to 12s while Postgres queues every writer behind an ACCESS EXCLUSIVE lock. That's the part that matters.
Production migrations are deployment plans. They have phases, observability, abort paths, cleanup work, and sometimes ugly temporary code that exists for two weeks and then gets deleted. Treat them like a binary toggle and you'll eventually earn a bad night.
Django 6.x doesn't remove that burden, it just gives you decent primitives. You still need to know which ORM operations emit lock-heavy SQL, when to set atomic = False, when to split one conceptual change into four releases, and when raw SQL is the honest tool. The migration graph is bookkeeping. The rollout plan is engineering.
expand first
The pattern we use is expand, backfill, switch, contract. Every time. A column rename, enum reshaping, moving data into a new table, replacing a nullable foreign key, adding a uniqueness rule, all of it fits this model if you stop insisting that one migration file should express the whole truth.
Say you want to replace customer.email with customer.primary_email, make it non-null, and enforce uniqueness case-insensitively. The bad version drops and recreates things in one shot. The safe version starts by adding the new column as nullable, no default, then creating the supporting index concurrently. In Django 6.x that usually means one schema migration for the field, then a non-atomic migration for the index:
1from django.db import migrations, models23class Migration(migrations.Migration):4 dependencies = [("customers", "0184_add_primary_email")]5 atomic = False67 operations = [8 migrations.RunSQL(9 sql=(10 "CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS "11 "customers_customer_primary_email_uniq "12 "ON customers_customer (lower(primary_email)) "13 "WHERE primary_email IS NOT NULL;"14 ),15 reverse_sql=(16 "DROP INDEX CONCURRENTLY IF EXISTS "17 "customers_customer_primary_email_uniq;"18 ),19 ),20 ]
That atomic = False matters because Postgres rejects CREATE INDEX CONCURRENTLY inside a transaction block with ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block. Django won't save you from that. You need to know it.
Expanding also means avoiding rewrites. Adding a nullable column is cheap. Adding a column with a volatile default can be expensive. Changing TEXT to JSONB on a big table can rewrite the whole relation. Some ALTER TYPE operations are nearly free, some are a trap. Read the actual SQL plan, check PostgreSQL docs for lock level, and test against production-like row counts, because a migration that finishes in 0.8s on staging can sit on a lock queue forever in prod once thirty app servers start contending for the same table.
backfill like an operator
Backfills don't belong in request paths, and they usually don't belong in a single Django migration either. A RunPython that loops over 80 million rows with ORM .save() calls is how you end up staring at statement timeout errors while autovacuum falls behind and replicas lag by minutes. Use a background worker, chunk the work, checkpoint progress, and make the job idempotent because you'll restart it.
We usually write a small management command or a one-off worker task, then deploy it separately from the schema expand step. Example, chunk by primary key, update in SQL, sleep a little between batches so you don't turn your own database into a benchmark:
1from django.core.management.base import BaseCommand2from django.db import connection3import time45BATCH = 10_0006SLEEP = 0.0578class Command(BaseCommand):9 def handle(self, *args, **opts):10 last_id = 011 while True:12 with connection.cursor() as cur:13 cur.execute(14 """15 WITH batch AS (16 SELECT id17 FROM customers_customer18 WHERE id > %s19 AND primary_email IS NULL20 ORDER BY id21 LIMIT %s22 )23 UPDATE customers_customer c24 SET primary_email = lower(c.email)25 FROM batch26 WHERE c.id = batch.id27 RETURNING c.id28 """,29 [last_id, BATCH],30 )31 rows = cur.fetchall()32 if not rows:33 break34 last_id = rows[-1][0]35 time.sleep(SLEEP)
That script is boring, which is exactly what you want at 2am.
For tables receiving writes during the backfill, you need a bridge so old code and new code don't diverge. Sometimes app-level dual writes are enough. For high-value invariants, triggers are safer because they don't depend on every code path being updated correctly. A temporary trigger that keeps primary_email synced from email during rollout is cheap insurance:
1CREATE OR REPLACE FUNCTION sync_primary_email() RETURNS trigger AS $$2BEGIN3 IF NEW.primary_email IS NULL AND NEW.email IS NOT NULL THEN4 NEW.primary_email := lower(NEW.email);5 END IF;6 RETURN NEW;7END;8$$ LANGUAGE plpgsql;910CREATE TRIGGER customers_sync_primary_email11BEFORE INSERT OR UPDATE OF email, primary_email12ON customers_customer13FOR EACH ROW EXECUTE FUNCTION sync_primary_email();
Temporary code has a job. Keep data coherent while multiple app versions coexist.
switch with overlap
The switch phase is where teams get overconfident. They see the new column populated, merge one PR that flips reads and writes, and assume the rest is cleanup. Production doesn't work that neatly. You need overlap, because some pods are still on the old release, some async jobs are still deserializing old fields, one cron script no one remembered imports a model constant from 2023, and replicas might briefly lag right when you think everything is aligned.
We prefer a release sequence that tolerates mixed versions. Release one writes old and new. Release two reads new, still writes both. Release three enforces constraints once metrics show the backfill is complete and no old reads remain. If the app can survive one version skew, deployment order stops being a source of adrenaline.
This is also the point where Postgres lock behavior matters again. ALTER TABLE ... SET NOT NULL can scan the table to verify existing rows. On recent PostgreSQL versions it's far better than it used to be, still, on huge tables we often add a CHECK (primary_email IS NOT NULL) NOT VALID, validate it online, then set NOT NULL in a later step if needed. Same idea for foreign keys, add them NOT VALID, validate later:
1ALTER TABLE orders_order2 ADD CONSTRAINT orders_order_customer_id_fk3 FOREIGN KEY (customer_id)4 REFERENCES customers_customer(id)5 NOT VALID;67ALTER TABLE orders_order8 VALIDATE CONSTRAINT orders_order_customer_id_fk;
Validation holds lighter locks than creating a fully validated constraint upfront, which is the difference between an uneventful deploy and a support channel full of timeout screenshots.
We also add feature flags around read paths for risky switches. Plain Django settings, LaunchDarkly, Flipper clone, doesn't matter. What matters is having a one-minute rollback path that doesn't involve trying to reverse DDL under pressure. Reversing migrations in prod is often fiction. Forward fixes are what actually happen.
tools that save nights
A few tools and habits pay for themselves fast. pg_locks and pg_stat_activity should be muscle memory, not trivia. If a migration stalls, run something like this immediately:
1SELECT blocked.pid AS blocked_pid,2 blocked.query AS blocked_query,3 blocking.pid AS blocking_pid,4 blocking.query AS blocking_query5FROM pg_catalog.pg_locks blocked_locks6JOIN pg_catalog.pg_stat_activity blocked7 ON blocked.pid = blocked_locks.pid8JOIN pg_catalog.pg_locks blocking_locks9 ON blocking_locks.locktype = blocked_locks.locktype10 AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database11 AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation12 AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page13 AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple14 AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid15 AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid16 AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid17 AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid18 AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid19 AND blocking_locks.pid != blocked_locks.pid20JOIN pg_catalog.pg_stat_activity blocking21 ON blocking.pid = blocking_locks.pid22WHERE NOT blocked_locks.granted;
Set lock_timeout defensively in risky migrations. Failing fast beats silently queueing behind a long transaction from some admin report. We often wrap raw SQL with SET lock_timeout = '2s'; SET statement_timeout = '15min'; inside the migration operation so the deploy fails loudly instead of turning into a brownout.
pg_repack is another lifesaver for bloat cleanup and certain table reorganizations, especially after large churn or column drops that didn't actually return space to the OS. It's not magic, and you still need enough disk headroom, still, it's far safer than pretending VACUUM FULL is acceptable on a hot table. Keep it in the toolbox.
Small maintenance scripts matter too. One script that checks backfill completion percentages. One that verifies trigger presence. One that compares old and new columns for drift on a random sample of 100k rows. One that dumps active migrations and app version hashes before a release. These take an afternoon to write, then save a weekend later.
contract last, much later
Dropping the old column, removing the trigger, deleting dual-write code, and squashing the migration history can wait. Waiting is healthy. Give the new path a full deploy cycle, then another one, then clean up after you've seen backups succeed, replicas stay current, analytics jobs stop touching the old field, and nobody has opened a bug that somehow still references it.
This delayed cleanup is where squashing migrations becomes actively misleading. People squash because they want the codebase to tell a clean story. Production systems don't have clean stories, they have residue from careful rollouts, and that residue is evidence that the team respects uptime. I would rather open a migration directory and see 0191_add_primary_email_nullable, 0192_primary_email_index_concurrently, 0193_sync_trigger, 0194_noop_placeholder_for_backfill, 0195_enforce_primary_email_constraint, than one majestic 0191_customer_email_refactor that pretends the change happened atomically.
We still squash occasionally, mostly for startup time in fresh environments and to keep apps with very old histories manageable, though only after the operational risk is gone and only after checking that the squashed state mirrors reality exactly. Squashing is repo maintenance. It isn't a migration strategy.
The steady, low-drama way to evolve a Django 6.x monolith is boring on purpose. Expand first. Backfill outside the request path. Use triggers or dual writes during overlap. Validate constraints separately. Switch reads with a rollback path. Contract later. Repeat. That's how one Postgres database can survive years of product churn without the team developing a collective fear of migrate.
