Database migrations are one of those things that work perfectly in development, seem fine in staging, and then somehow cause your production deployment to fail spectacularly at 2 AM on a Friday.
I've been through enough migration disasters to know that the problem isn't usually the SQL itself—it's everything around it. The deployment strategy, the rollback plan, the performance implications, and the dozen edge cases nobody thought about until they're staring at a locked table and angry users.
Let's talk about what actually goes wrong and how to handle migrations like a production system deserves.
The Lock That Ate Production
Most migration horror stories start the same way: a seemingly innocent schema change that locks a table for way longer than expected. Here's a classic example:
-- This looks harmless
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;
On a table with 10 million rows, this might take 20 minutes and lock the entire table. Your users can't log in, your API returns 500s, and your monitoring is screaming.
The PostgreSQL version is particularly tricky because DEFAULT FALSE
requires
updating every row. MySQL has similar issues with NOT NULL
columns. The "fix" is
usually to break it down:
-- Step 1: Add nullable column
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- Step 2: Update in batches (separate deployment)
UPDATE users SET email_verified = FALSE WHERE id BETWEEN 1 AND 10000;
-- Continue in batches...
-- Step 3: Set NOT NULL constraint (another deployment)
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
But this creates a new problem: your application needs to handle the intermediate state where the column might be null. Most ORMs don't handle this gracefully.
The Dependency Chain Nobody Sees
Migrations aren't just about schema changes—they're about the order of operations across your entire deployment pipeline. Consider this sequence:
- Deploy application code that expects new column
- Run migration that adds the column
- Application starts using the new column
Seems logical, right? But what happens if step 2 fails? Your application is now deployed with code that depends on a column that doesn't exist.
The safer approach is backwards-compatible migrations:
-- Migration 1: Add column (nullable)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- Deploy application code that can handle null values
-- Migration 2: Backfill data
UPDATE users SET email_verified = FALSE WHERE email_verified IS NULL;
-- Migration 3: Make column NOT NULL
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
This requires discipline from the development team. Every schema change needs to be planned across multiple deployments.
The Index Performance Gotcha
Adding indexes seems safe—they don't change data, just improve performance. But on large tables, index creation can be surprisingly problematic:
-- This can lock your table for hours
CREATE INDEX idx_users_email ON users(email);
PostgreSQL's CREATE INDEX CONCURRENTLY
is better, but it's not foolproof. It can
fail midway through, leaving you with an invalid index that you need to clean up:
-- Check for invalid indexes
SELECT indexname FROM pg_indexes WHERE schemaname = 'public'
AND indexname NOT IN (
SELECT indexname FROM pg_stat_user_indexes WHERE idx_scan > 0
);
-- Drop invalid indexes
DROP INDEX CONCURRENTLY idx_users_email;
The real challenge is that index creation performance depends on your data distribution, concurrent load, and available memory. What works in staging might not work in production.
Column Renames: The Deployment Minefield
Renaming columns is where things get really interesting. The naive approach breaks everything:
-- This breaks all existing application code
ALTER TABLE users RENAME COLUMN username TO email;
The safe approach requires multiple deployments:
Deployment 1: Add new column, keep old one
ALTER TABLE users ADD COLUMN email VARCHAR(255);
UPDATE users SET email = username;
Deployment 2: Update application to read from both columns, write to both
// Application code needs to handle both
const user = {
email: row.email || row.username,
// ... other fields
};
// Writes go to both columns
await db.query('UPDATE users SET email = $1, username = $1 WHERE id = $2',
[email, userId]);
Deployment 3: Remove old column references from application
Deployment 4: Drop old column
ALTER TABLE users DROP COLUMN username;
This is tedious, but it's the only way to ensure zero-downtime deployments.
Data Migrations: The Hidden Complexity
Schema changes are one thing, but data migrations are where the real complexity lives. Moving data between tables, transforming formats, or cleaning up inconsistent data—these operations can be unpredictable.
-- This might work on 100K rows, but what about 10M?
UPDATE orders
SET status = 'completed'
WHERE created_at < '2023-01-01'
AND status = 'pending'
AND payment_confirmed = true;
Large data migrations need to be batched:
-- Process in chunks
DO $$
DECLARE
batch_size INTEGER := 1000;
affected_rows INTEGER := 0;
BEGIN
LOOP
UPDATE orders
SET status = 'completed'
WHERE id IN (
SELECT id FROM orders
WHERE created_at < '2023-01-01'
AND status = 'pending'
AND payment_confirmed = true
LIMIT batch_size
);
GET DIAGNOSTICS affected_rows = ROW_COUNT;
IF affected_rows = 0 THEN
EXIT;
END IF;
-- Add a small delay to avoid overwhelming the database
PERFORM pg_sleep(0.1);
END LOOP;
END $$;
This is more complex, but it's much safer for production systems.
The Rollback Reality Check
Everyone plans for successful migrations. Few people plan for rollbacks. But migrations fail, and when they do, you need a way back.
The hard truth: not all migrations are rollbackable. Adding a column is easy to reverse, but dropping a column destroys data. Changing data types can be irreversible if you lose precision.
For risky migrations, consider the backup approach:
-- Before dropping a column, back up the data
CREATE TABLE users_backup_username AS
SELECT id, username FROM users;
-- Now you can safely drop the column
ALTER TABLE users DROP COLUMN username;
-- Rollback process (if needed)
-- ALTER TABLE users ADD COLUMN username VARCHAR(255);
-- UPDATE users SET username = b.username
-- FROM users_backup_username b WHERE users.id = b.id;
Monitoring Migration Performance
One thing that's helped me is adding instrumentation to migrations:
-- Log migration start
INSERT INTO migration_log (migration_name, status, started_at)
VALUES ('add_email_verified_column', 'started', NOW());
-- Your migration code here
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- Log migration completion
UPDATE migration_log
SET status = 'completed', completed_at = NOW()
WHERE migration_name = 'add_email_verified_column';
This gives you visibility into which migrations are slow and helps with debugging when things go wrong.
The Coordination Problem
In distributed systems, migrations become a coordination problem. Which service runs the migration? How do you ensure other services don't start using the new schema before it's ready?
One approach is to use feature flags:
// Application code
if (featureFlags.hasEmailVerification) {
return user.email_verified;
}
return false; // Safe default
This lets you deploy the migration and application code independently, then enable the feature when everything is ready.
What Actually Works
After dealing with enough migration disasters, here's what I've learned works:
- Always test migrations on production-sized data. Your staging environment with 1000 rows tells you nothing about performance on 10 million rows.
- Plan for rollbacks from the beginning. If you can't safely rollback a migration, you probably shouldn't run it.
- Use backwards-compatible changes. Every schema change should be planned across multiple deployments.
- Monitor migration performance. Log timing and lock duration so you can spot problems early.
- Have a communication plan. When migrations go wrong, you need to know who to call and what systems might be affected.
The most important lesson: migrations are not just database operations—they're deployment operations that affect your entire system. Treat them with the same care you'd give to any other critical infrastructure change.
Migrations will always be risky, but with the right approach, they don't have to be disasters waiting to happen.
Good infrastructure practices apply to every layer of your system—including the database. The same principles that make code deployments safe can make migrations safer too.
Building fast, secure tools is all about making informed trade-offs. That's the philosophy behind ToolShelf—giving developers the utilities they need without the performance penalties they don't.
Stay safe & happy coding,
— ToolShelf Team