Database Migrations: The Silent Killer of Production Deployments

Database migrations are one of those things that work perfectly in development, seem fine in staging, and then somehow cause your production deployment to fail spectacularly at 2 AM on a Friday.

I've been through enough migration disasters to know that the problem isn't usually the SQL itself—it's everything around it. The deployment strategy, the rollback plan, the performance implications, and the dozen edge cases nobody thought about until they're staring at a locked table and angry users.

Let's talk about what actually goes wrong and how to handle migrations like a production system deserves.

The Lock That Ate Production

Most migration horror stories start the same way: a seemingly innocent schema change that locks a table for way longer than expected. Here's a classic example:

-- This looks harmless
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;

On a table with 10 million rows, this might take 20 minutes and lock the entire table. Your users can't log in, your API returns 500s, and your monitoring is screaming.

The PostgreSQL version is particularly tricky because DEFAULT FALSE requires updating every row. MySQL has similar issues with NOT NULL columns. The "fix" is usually to break it down:

-- Step 1: Add nullable column
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;

-- Step 2: Update in batches (separate deployment)
UPDATE users SET email_verified = FALSE WHERE id BETWEEN 1 AND 10000;
-- Continue in batches...

-- Step 3: Set NOT NULL constraint (another deployment)
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;

But this creates a new problem: your application needs to handle the intermediate state where the column might be null. Most ORMs don't handle this gracefully.

The Dependency Chain Nobody Sees

Migrations aren't just about schema changes—they're about the order of operations across your entire deployment pipeline. Consider this sequence:

  1. Deploy application code that expects new column
  2. Run migration that adds the column
  3. Application starts using the new column

Seems logical, right? But what happens if step 2 fails? Your application is now deployed with code that depends on a column that doesn't exist.

The safer approach is backwards-compatible migrations:

-- Migration 1: Add column (nullable)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;

-- Deploy application code that can handle null values
-- Migration 2: Backfill data
UPDATE users SET email_verified = FALSE WHERE email_verified IS NULL;

-- Migration 3: Make column NOT NULL
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;

This requires discipline from the development team. Every schema change needs to be planned across multiple deployments.

The Index Performance Gotcha

Adding indexes seems safe—they don't change data, just improve performance. But on large tables, index creation can be surprisingly problematic:

-- This can lock your table for hours
CREATE INDEX idx_users_email ON users(email);

PostgreSQL's CREATE INDEX CONCURRENTLY is better, but it's not foolproof. It can fail midway through, leaving you with an invalid index that you need to clean up:

-- Check for invalid indexes
SELECT indexname FROM pg_indexes WHERE schemaname = 'public' 
  AND indexname NOT IN (
    SELECT indexname FROM pg_stat_user_indexes WHERE idx_scan > 0
  );

-- Drop invalid indexes
DROP INDEX CONCURRENTLY idx_users_email;

The real challenge is that index creation performance depends on your data distribution, concurrent load, and available memory. What works in staging might not work in production.

Column Renames: The Deployment Minefield

Renaming columns is where things get really interesting. The naive approach breaks everything:

-- This breaks all existing application code
ALTER TABLE users RENAME COLUMN username TO email;

The safe approach requires multiple deployments:

Deployment 1: Add new column, keep old one

ALTER TABLE users ADD COLUMN email VARCHAR(255);
UPDATE users SET email = username;

Deployment 2: Update application to read from both columns, write to both

// Application code needs to handle both
const user = {
  email: row.email || row.username,
  // ... other fields
};

// Writes go to both columns
await db.query('UPDATE users SET email = $1, username = $1 WHERE id = $2', 
               [email, userId]);

Deployment 3: Remove old column references from application

Deployment 4: Drop old column

ALTER TABLE users DROP COLUMN username;

This is tedious, but it's the only way to ensure zero-downtime deployments.

Data Migrations: The Hidden Complexity

Schema changes are one thing, but data migrations are where the real complexity lives. Moving data between tables, transforming formats, or cleaning up inconsistent data—these operations can be unpredictable.

-- This might work on 100K rows, but what about 10M?
UPDATE orders 
SET status = 'completed' 
WHERE created_at < '2023-01-01' 
  AND status = 'pending' 
  AND payment_confirmed = true;

Large data migrations need to be batched:

-- Process in chunks
DO $$
DECLARE
    batch_size INTEGER := 1000;
    affected_rows INTEGER := 0;
BEGIN
    LOOP
        UPDATE orders 
        SET status = 'completed' 
        WHERE id IN (
            SELECT id FROM orders 
            WHERE created_at < '2023-01-01' 
              AND status = 'pending' 
              AND payment_confirmed = true
            LIMIT batch_size
        );
        
        GET DIAGNOSTICS affected_rows = ROW_COUNT;
        
        IF affected_rows = 0 THEN
            EXIT;
        END IF;
        
        -- Add a small delay to avoid overwhelming the database
        PERFORM pg_sleep(0.1);
    END LOOP;
END $$;

This is more complex, but it's much safer for production systems.

The Rollback Reality Check

Everyone plans for successful migrations. Few people plan for rollbacks. But migrations fail, and when they do, you need a way back.

The hard truth: not all migrations are rollbackable. Adding a column is easy to reverse, but dropping a column destroys data. Changing data types can be irreversible if you lose precision.

For risky migrations, consider the backup approach:

-- Before dropping a column, back up the data
CREATE TABLE users_backup_username AS 
SELECT id, username FROM users;

-- Now you can safely drop the column
ALTER TABLE users DROP COLUMN username;

-- Rollback process (if needed)
-- ALTER TABLE users ADD COLUMN username VARCHAR(255);
-- UPDATE users SET username = b.username 
-- FROM users_backup_username b WHERE users.id = b.id;

Monitoring Migration Performance

One thing that's helped me is adding instrumentation to migrations:

-- Log migration start
INSERT INTO migration_log (migration_name, status, started_at) 
VALUES ('add_email_verified_column', 'started', NOW());

-- Your migration code here
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;

-- Log migration completion
UPDATE migration_log 
SET status = 'completed', completed_at = NOW() 
WHERE migration_name = 'add_email_verified_column';

This gives you visibility into which migrations are slow and helps with debugging when things go wrong.

The Coordination Problem

In distributed systems, migrations become a coordination problem. Which service runs the migration? How do you ensure other services don't start using the new schema before it's ready?

One approach is to use feature flags:

// Application code
if (featureFlags.hasEmailVerification) {
  return user.email_verified;
}
return false; // Safe default

This lets you deploy the migration and application code independently, then enable the feature when everything is ready.

What Actually Works

After dealing with enough migration disasters, here's what I've learned works:

  • Always test migrations on production-sized data. Your staging environment with 1000 rows tells you nothing about performance on 10 million rows.
  • Plan for rollbacks from the beginning. If you can't safely rollback a migration, you probably shouldn't run it.
  • Use backwards-compatible changes. Every schema change should be planned across multiple deployments.
  • Monitor migration performance. Log timing and lock duration so you can spot problems early.
  • Have a communication plan. When migrations go wrong, you need to know who to call and what systems might be affected.

The most important lesson: migrations are not just database operations—they're deployment operations that affect your entire system. Treat them with the same care you'd give to any other critical infrastructure change.

Migrations will always be risky, but with the right approach, they don't have to be disasters waiting to happen.

Good infrastructure practices apply to every layer of your system—including the database. The same principles that make code deployments safe can make migrations safer too.


Building fast, secure tools is all about making informed trade-offs. That's the philosophy behind ToolShelf—giving developers the utilities they need without the performance penalties they don't.

Stay safe & happy coding,
— ToolShelf Team