Migration patterns: how to change a running system without stopping it
Every non-trivial system eventually needs to be changed in ways its designers did not anticipate. Move off a database. Break up a service. Replace a vendor. Re-platform. Rewrite a subsystem. These are not greenfield projects — the system is already running, paying salaries, holding data the business cannot afford to lose, and serving users who have not agreed to a maintenance window.
The naive approach — the rewrite — fails so reliably that it has become a running joke among senior engineers. Two years of parallel development against a moving target, followed by a cutover nobody is confident in, followed by six months of discovering what the old system did that nobody had written down. Rewrites sometimes succeed; they usually don’t. And sometimes is not a good expected value when the bad outcome is “the business grinds to a halt for a quarter.”
The alternative is a discipline of migration — a set of patterns that turn big, scary changes into sequences of small, reversible ones. Each pattern has a name, a shape, and a specific kind of change it fits. The patterns compose. A major migration is usually several of them, chained, executed over months or years.
This post is a map. The claim under it: migration is a program of work, not a project. Projects end. Programs evolve. Systems that change without outages are systems whose teams have internalized that distinction.
The strangler fig
Martin Fowler’s naming, borrowed from the Australian strangler fig trees that grow around a host tree, gradually replacing it until only the new tree remains. Applied to software: leave the old system running. Build the new one around it. Route traffic piece by piece from old to new. When nothing routes to the old, remove it.
The mechanics:
- Put a layer in front of the old system. Usually a proxy, an API gateway, a routing component. Every request goes through it. For now, it routes everything to the old system.
- Build a new implementation of one capability. Not the whole system — one slice. The capability whose pace of change is mismatched with the monolith, or whose ownership is changing, or whose cost of continued life in the old system is highest.
- Route traffic for that capability to the new service. Start with a small percentage or a specific user segment. Watch for errors. Increase.
- Once the new capability handles all its traffic correctly, remove the corresponding code from the old system. The strangler has grown around the branch.
- Repeat. One capability at a time, until the old system is either hollowed out or cold enough to retire.
The strangler fig’s strengths are cumulative. After the first capability is migrated, the team has the patterns in hand — the routing layer exists, the deployment pipeline works, the observability is wired up. The second capability is easier. The tenth is routine. The whole migration proceeds at roughly constant velocity, rather than the J-curve of a rewrite (slow for years, then a terrifying cutover).
The strangler fig’s weakness is slow progress on systems where the capabilities are tangled. If every capability touches every table, carving a seam is itself a project. In those cases, the first carving is the hardest part of the migration — sometimes months of work before the first capability ships — and is where teams are tempted to give up on the pattern and start a rewrite.
Parallel run (dark launch)
Before cutting traffic from the old system to the new, run both at the same time and compare the results. The old system still serves production traffic. The new system receives the same input and produces output, which is compared to the old system’s output. Differences are logged for investigation; they do not affect the user.
This pattern is how you find the things the old system does that nobody wrote down. A new payment calculation gets rolled out in parallel run. It matches the old one 99% of the time. The 1% is distributed across a dozen edge cases, each of which is a hidden business rule the old code encodes and the new code does not. Fix the rules, rerun, converge.
Parallel run is extremely good at finding these cases, because it runs them at production volume on production data. A test environment cannot replicate this — the edge cases are specifically the ones your test data never hits. The cost is doubled compute for the duration of the run, and some care to ensure the new system’s side effects are either idempotent, written to a separate side branch, or explicitly disabled during the parallel phase.
Two flavors:
- Online parallel run: the new system runs on the live request path. Extra latency must be tolerable. Differences are observed in real time.
- Offline parallel run: production requests are recorded and replayed against the new system later. No production latency impact. Differences are found after the fact, which is fine for everything except real-time rules.
The GitHub “scientist” library is a canonical implementation of online parallel run — call both the control and the candidate, return the control’s result, log the differences. The pattern is mature enough that it is rarely the wrong idea. The question is whether the system’s shape allows it.
Dual writes and the coordination problem
A common pattern during migration: write the data to both the old and the new store. Reads can then come from either, and you can switch reads independently from writes. In theory, the old and new stay in sync automatically.
In practice, dual writes are subtly broken in the same way naive event publishing is broken — as discussed in the event-driven post. Writing to two systems, not in a shared transaction, means one can succeed and the other fail. Your old store has the new state; your new store does not. Or vice versa. Over time, silent divergence accumulates.
The fixes, in order of increasing robustness:
- Write to one system, then the other, and log failures for manual reconciliation. Tolerable at low volumes if you audit regularly. Not tolerable at scale.
- Outbox pattern. Write to the primary store (with an outbox entry) in one transaction. An async process reads the outbox and writes to the secondary store. If the secondary fails, the outbox retries. The primary is authoritative; the secondary is a durable asynchronous projection.
- Change Data Capture. Read the primary’s transaction log and apply changes to the secondary. The primary never knows the secondary exists. This is the cleanest for most migrations — tools like Debezium exist specifically to make it routine.
The outbox and CDC approaches both produce eventually consistent dual systems. For the duration of the migration, the secondary may lag the primary by some milliseconds or seconds. Most migrations can tolerate this; a few cannot, and those need additional care (read-your-writes routing during the transition, for instance).
“Write to both in application code, naively” is the one option that does not work reliably at scale. Every team that tries it eventually discovers this. Skip ahead.
Expand-contract, applied to migration
The expand-contract pattern — covered in the schema evolution post — applies to migrations broadly, not just to schemas. The shape is always the same:
- Expand. Add the new thing alongside the old. Both coexist.
- Migrate. Move readers and writers over, one at a time, in whatever order is safe.
- Contract. Remove the old thing when nothing uses it.
Applied at various levels:
- Database columns: add new, dual-write, migrate readers, drop old.
- Database tables: add new schema, dual-write, migrate readers, drop old.
- Storage technology (Postgres to a different database): both run, replication or dual-writes keep them synced, readers migrate per feature, writes switch last, old store is decommissioned.
- Service boundaries (monolith to microservice): new service stands up, reads/writes routed gradually, old code removed when nothing reaches it.
- APIs (v1 to v2): both versions respond, consumers migrate, v1 is deprecated and eventually removed.
- Infrastructure (on-prem to cloud, region to region): parallel deployments, traffic shifted, old region drained.
The pattern scales from minutes (a single-column rename in a small table) to years (a corporate-scale datacenter migration). The mechanics are the same. What changes is how many iterations of “migrate readers and writers one at a time” are needed.
Feature flags as a migration tool
Feature flags are commonly framed as a product tool — roll out a feature to 10% of users, measure, expand. They are equally useful as a migration tool. A flag that routes a request either to the old code path or the new code path is the atomic unit of a controlled migration.
The mechanics for migration flags:
- Percent rollout. Route 1% of traffic to the new path. Watch metrics. Increase. This is the cautious default.
- User targeting. Route internal users first, beta users second, production users last. Early users get the fastest bug reports.
- Kill switch. A flag that defaults to “new path” but can be flipped back to “old path” in seconds if something breaks. The ability to roll back without a deploy is what makes aggressive migration safe. The old code must remain until the flag is removed.
- Sticky routing. Once a user is on the new path, keep them there. Otherwise, the same user flipping between paths on each request can produce inconsistent state.
Flags are machinery. The discipline is removing them. A codebase strewn with flags for migrations that finished six months ago is carrying technical debt that is invisible in the moment and painful to clean up later. The working practice: every migration flag has a planned removal date, and the team tracks the list.
LaunchDarkly, Unleash, Flagsmith, and home-rolled equivalents all work. The choice of tool matters less than the operational discipline around it.
The data backfill problem
Almost every real migration includes a backfill: the moment when you have the new structure, and you need to populate it from the old structure, for every existing record. A few hundred rows can be backfilled in a single job. Billions of rows require more care.
The constraints:
- The production database cannot be blocked for the duration.
Running
UPDATE ... SELECT ...across a billion-row table is a multi-hour table lock in most databases. Not tolerable. - The data is changing while you backfill. Rows written during the backfill are either caught by the dual-write path (if set up) or missed.
- Errors must be recoverable. A backfill that fails at row 800M of 1B cannot restart from row 0.
The working pattern is a chunked, idempotent backfill:
- Chunk the source: partition the work by id range, by time range, or by some other key. Process one chunk at a time.
- Make the per-row operation idempotent: “if the new store already has this row, skip or overwrite.” A retry of a chunk must not duplicate writes.
- Throttle: limit the rate at which you write, so the backfill does not starve production queries. Many teams run backfills only during off-peak hours.
- Record progress: a “last successfully backfilled id” marker, so a failed run resumes from the right place.
- Verify: after the backfill completes, a separate job samples the new store and confirms it matches expected values.
Most teams underestimate how long this takes. A billion-row backfill, throttled to 10k rows per second to protect production, is a 28-hour job if nothing goes wrong — and something usually does. Budget the time, and get the backfill running early in the migration so that the rest of the migration doesn’t block waiting for it.
Reverse migrations
The pattern nobody likes to think about, because it feels like giving up: rolling back to the old system after starting the move to the new.
Reverse migrations happen. Sometimes because the new system isn’t ready. Sometimes because an assumption was wrong. Sometimes because a vendor went sideways. A migration without a reverse path is not a safe migration; it is a gamble.
The concrete implication: for as long as the old system is authoritative (or has been authoritative recently), the migration must be reversible. Dual writes to both systems. Reads routed by flag. No destruction of the old system’s data. No one-way door.
Once the new system has been authoritative for long enough that reversing would lose data (or is infeasible for some other reason), the migration is one-way. This is the point of no return and should be crossed deliberately, with both engineering and business signoff. Before that point, the team can retreat. After it, they can only fix forward.
Teams that skip the “reversible” phase and go straight to one-way migrations sometimes get lucky. When they do not, the recovery is catastrophic. The premium on reversibility is worth paying.
Data migration between heterogeneous stores
Migrations within a store type — Postgres to Postgres, Kafka to Kafka — are mechanical. Migrations between heterogeneous stores — Postgres to DynamoDB, MongoDB to Postgres, Oracle to anything — are a different class of problem, because the data shapes are different in ways that require semantic translation.
A relational schema going to a document store has to make choices about aggregation (do we denormalize, and if so, how?). A document schema going to a relational store has to make choices about normalization and handling variable fields. A store moving from SQL to a key-value store has to make choices about access patterns, since the key-value store will not support the relational queries the SQL store did.
These choices are the migration, not details of the migration. The mechanics — dual writes, backfills, cutovers — are the same. The hard part is the modeling decisions before any code is written. Teams that underestimate this spend the first quarter of the migration building infrastructure that works perfectly against the wrong target schema.
The strangler is a program
The mistake most teams make is treating a migration as a project — a thing with a start, a middle, an end, a success metric, and a team that dissolves after. This fits software-delivery patterns and does not fit reality.
A real migration of any scope is a program. It spans multiple teams over multiple quarters. It has reversible checkpoints. It leaves visible scar tissue (compatibility code, routing layers, dual writes) that only gets cleaned up in the last phase. It interleaves with feature work, because the business keeps moving while the migration runs. It survives changes of personnel, because no migration of that scale stays with the same people from start to finish.
Programs need continuity. The practical forms: a running migration playbook that the next team can read, a backlog of the remaining capabilities to move, metrics that show progress (traffic on new path / traffic on old path), a list of outstanding feature flags with removal dates, and someone — named, senior — whose job is to make the migration finish. Without that named person, migrations stall in the 80%-done state indefinitely, because the last 20% is never the most interesting work.
The rule
Do not rewrite. Strangle. Run in parallel before cutting over. Dual- write with mechanical safety (outbox, CDC). Expand-contract at every level. Use feature flags to cut traffic safely and keep the reverse path open. Backfill in chunks, idempotently, measuring. Treat the migration as a program, not a project — with a named owner and visible progress.
The systems that change without outages do so because their teams have made “change without outages” a working discipline. The systems that can’t usually can’t because nobody named the problem as one worth the discipline. Naming it is most of the fix.
Migration is how running systems stay alive as the business changes around them. It is not glamorous. It is often most of the work. Treat it that way.