MySQL Internals
Full-FocusIndexing
B-Tree Structure
InnoDB stores data in a B+ tree. Understanding the node layout, page splits, and pointer-following is key to explaining why some queries are fast and others aren't.
- B+ tree leaf nodes hold row data (clustered) or PK pointers (secondary)
- Page size defaults to 16 KB; controls how many keys fit per node
- Balanced height (typically 3-4 levels for millions of rows) means O(log n) lookups
- Inserts into a full page cause a page split, impacting write performance
Clustered vs Secondary Index
InnoDB has exactly one clustered index (the PK). Every secondary index stores a copy of the PK, so lookups via secondary indexes require a double lookup unless the index covers the query.
- Clustered index = data is physically ordered by PK
- Secondary index leaf stores PK value, not a row pointer
- Double lookup: secondary index → PK value → clustered index → row
- Wide PKs bloat every secondary index
Composite Index — Column Order Matters
A composite index on (A, B, C) can serve queries filtering on A, (A,B), or (A,B,C) efficiently, but not on B alone or C alone. The leftmost prefix rule is one of the most frequently tested MySQL concepts.
- Leftmost prefix rule: index is useful only from the leftmost column
- Range condition on a column stops index use for subsequent columns
- ORDER BY can also leverage the index if column order matches
- Use EXPLAIN to verify which key_len portion of the index is actually used
Covering Index
When an index contains all the columns a query needs, MySQL reads data entirely from the index without touching the clustered index. EXPLAIN shows "Using index" in the Extra column.
- Eliminates the secondary-to-clustered lookup ("bookmark lookup")
- Significantly reduces I/O for read-heavy queries
- Trade-off: wider indexes consume more storage and slow writes
Index Selectivity
Selectivity measures how many distinct values an index column has relative to total rows. High selectivity (e.g., email) makes an index effective; low selectivity (e.g., boolean) often doesn't justify an index.
- Selectivity = COUNT(DISTINCT col) / COUNT(*)
- MySQL optimizer may skip a low-selectivity index in favour of a full scan
- Cardinality stats in SHOW INDEX help assess selectivity
- Prefix indexes can help on long strings, but reduce selectivity
EXPLAIN Output Reading
The EXPLAIN statement shows the query execution plan: which indexes MySQL chose, the join order, estimated rows, and access type. Mastering this output is essential for diagnosing slow queries.
- Key columns: type (ALL, ref, range, const), key, rows, Extra
- type=ALL means full table scan — usually bad
- Look for "Using filesort" and "Using temporary" as red flags
- EXPLAIN ANALYZE (MySQL 8.0.18+) shows actual execution times
- FORMAT=JSON gives cost estimates and optimizer decisions
DB Transactions
ACID Guarantees
Atomicity, Consistency, Isolation, Durability form the reliability contract of relational databases. Interview discussions often test whether you understand the trade-offs each property implies.
- Atomicity: undo log ensures all-or-nothing via rollback
- Consistency: application + DB constraints enforce valid state transitions
- Isolation: MVCC + locking prevent concurrent transaction interference
- Durability: redo log (WAL) ensures committed data survives crash
MVCC in InnoDB
Multi-Version Concurrency Control lets readers see a consistent snapshot without blocking writers. InnoDB implements this using hidden row versions, undo log chains, and read views.
- Each row has hidden DB_TRX_ID and DB_ROLL_PTR columns
- Readers construct a read view at transaction start (REPEATABLE READ) or statement start (READ COMMITTED)
- Undo log chains let old versions be reconstructed on the fly
- Purge thread cleans up versions no longer visible to any transaction
Isolation Levels — Read Phenomena
The four standard isolation levels (READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE) trade off consistency against concurrency. Understand which read phenomena each level permits.
- Dirty read: seeing uncommitted data (only RU)
- Non-repeatable read: row changes between two reads in same txn (RU, RC)
- Phantom read: new rows appear in range query (RU, RC; InnoDB RR prevents via gap locks)
- InnoDB default is REPEATABLE READ, which avoids phantoms via next-key locks
- SERIALIZABLE adds shared locks on all reads — safe but slow
Gap Locks vs Record Locks vs Next-Key Locks
InnoDB uses three lock types to prevent phantoms and ensure isolation. Understanding the difference is critical for debugging deadlocks and designing concurrent writes.
- Record lock: locks a single index record
- Gap lock: locks the gap between two index records (prevents inserts)
- Next-key lock = record lock + gap lock on the gap before the record
- Gap locks are only used at REPEATABLE READ and above
- Can cause unexpected deadlocks when concurrent transactions lock overlapping gaps
Deadlock in DB Layer
A deadlock occurs when two transactions each hold a lock the other needs. InnoDB automatically detects deadlocks and rolls back one transaction. Knowing how to prevent and debug them is essential.
- InnoDB runs a wait-for graph detector; victim is the transaction with least undo
- SHOW ENGINE INNODB STATUS shows the last detected deadlock
- Prevention: always lock rows in the same order across transactions
- Keep transactions short and avoid holding locks during external calls
Query Optimization
Query Execution Plan
MySQL's optimizer transforms SQL into a physical execution plan: choosing join order, index access paths, and temp-table strategies. Understanding this pipeline helps you write queries the optimizer can handle well.
- Parser → Resolver → Optimizer → Executor pipeline
- Cost-based optimizer uses table statistics and index cardinality
- ANALYZE TABLE refreshes statistics for better plans
- Subquery vs JOIN: optimizer may convert correlated subqueries to joins
Index Hints
USE INDEX, FORCE INDEX, and IGNORE INDEX let you override the optimizer's index choice. Use sparingly — the optimizer is usually right, but sometimes stale statistics lead it astray.
- USE INDEX(idx): suggest an index, optimizer may still ignore
- FORCE INDEX(idx): only consider this index or full scan
- IGNORE INDEX(idx): exclude a specific index from consideration
- Hints are a band-aid; prefer fixing statistics or restructuring queries
Slow Query Log
MySQL can log queries that exceed a time threshold. This is the primary tool for identifying performance bottlenecks in production without code changes.
- Enable with slow_query_log=ON and long_query_time (seconds)
- log_queries_not_using_indexes captures unindexed queries
- mysqldumpslow or pt-query-digest aggregate and rank slow queries
- Performance Schema provides more granular profiling data
Partitioning Strategies
Table partitioning splits a large table into smaller physical pieces while keeping it logically unified. It can dramatically speed up range scans and bulk deletes on very large tables.
- RANGE partitioning: ideal for time-series data (partition by month/year)
- LIST partitioning: discrete categories (status codes, regions)
- HASH partitioning: even distribution when no natural range exists
- Partition pruning: optimizer skips irrelevant partitions when WHERE matches the partition key
- Unique indexes must include the partition key
Sharding vs Partitioning
Partitioning is a single-server optimization; sharding distributes data across multiple database servers. Knowing when to use each (and the operational cost of sharding) is a common system design question.
- Partitioning: same server, transparent to application, limited by single-node resources
- Sharding: multiple servers, application-aware routing, near-linear horizontal scale
- Shard key selection is critical: poor keys cause hot spots
- Cross-shard joins and transactions are expensive — design to avoid them
- Resharding is painful; plan the key carefully upfront
MongoDB
BothMongoDB Core
Document Model — Embedding vs Referencing
MongoDB stores data as BSON documents. The key design decision is whether to embed related data inside one document or use references across collections — each has performance and consistency trade-offs.
- Embed when data is read together and the subdocument won't grow unboundedly
- Reference when the related data is large, changes independently, or is shared
- Embedded reads are atomic and fast (single document fetch)
- References require application-level joins ($lookup) which are slower
- 16 MB document size limit constrains embedding
Index Types (Compound, Text, Geo, TTL)
MongoDB supports several specialized index types beyond the default single-field B-tree. Choosing the right one directly impacts query performance and enables features like full-text search and auto-expiry.
- Compound index: multi-field, follows leftmost prefix rule like MySQL
- Text index: full-text search with stemming and scoring
- 2dsphere / 2d index: geospatial queries ($near, $geoWithin)
- TTL index: automatically deletes documents after a specified time
- Wildcard index: indexes all fields in a document (useful for dynamic schemas)
Aggregation Pipeline Stages
The aggregation pipeline processes documents through a sequence of stages ($match, $group, $project, etc.). It is MongoDB's equivalent of SQL GROUP BY + subqueries and is heavily used in interviews.
- $match early to reduce pipeline volume (uses indexes)
- $group: accumulate values (sum, avg, push to array)
- $lookup: left outer join to another collection
- $unwind: flatten arrays for per-element processing
- $facet: run multiple pipelines in parallel on the same input
Write Concern & Read Preference
Write concern controls how many replica nodes must acknowledge a write before it's confirmed. Read preference controls which node serves reads. Together they define the consistency-availability trade-off.
- w:1 (default): acknowledged by primary only
- w:"majority": acknowledged by majority of nodes — durable across failover
- Read preference: primary (strong consistency), secondary (eventual, reduced primary load)
- primaryPreferred: reads from primary, falls back to secondary if unavailable
Replica Sets
A replica set is a group of MongoDB nodes that maintain the same data. One primary handles writes; secondaries replicate via the oplog. This is the foundation of MongoDB high availability.
- Minimum 3 members (or 2 + arbiter) for automatic failover
- Oplog: capped collection on primary that secondaries tail
- Election uses Raft-like consensus; priority and votes are configurable
- Secondaries can serve reads (read preference), reducing primary load
Transactions in MongoDB 4+
Starting with version 4.0, MongoDB supports multi-document ACID transactions across replica sets (and sharded clusters from 4.2). They work like SQL transactions but have performance caveats.
- Use sessions and startTransaction / commitTransaction APIs
- Snapshot isolation: reads see a consistent point-in-time view
- 60-second default time limit; long transactions risk oplog pressure
- Design for single-document atomicity first; use multi-doc transactions only when necessary
Schema Design
Design for Access Patterns
In MongoDB, schema design is driven by how you query the data, not by normalization rules. Map out your read/write patterns first, then structure documents to serve the most common queries efficiently.
- Group data that is queried together into the same document
- Separate data that is updated independently or has different access frequencies
- Denormalize read-heavy paths; accept some write duplication
- Use the bucket pattern for time-series or IoT data
Avoid Unbounded Arrays
Appending endlessly to an array field will eventually hit the 16 MB document limit and degrade performance due to document growth and in-place update failures. This is a classic MongoDB anti-pattern.
- Arrays that grow without bound cause document migrations (larger BSON)
- Large arrays slow down reads even if you only need a subset
- Solution: cap arrays or move overflow to a separate collection (bucket pattern)
- Use $slice or $push with $each and $position to limit array size
When to Normalize vs Denormalize
Normalization (references) avoids duplication but requires joins. Denormalization (embedding) duplicates data but gives fast, atomic reads. The right balance depends on your read/write ratio and consistency needs.
- Denormalize for read-heavy workloads where consistency lag is acceptable
- Normalize for data that changes frequently and is shared across documents
- Hybrid: embed summary fields, reference the full record
- Duplicated data requires an update strategy (change streams, batch sync)
ClickHouse & OLAP
LightOLAP Concepts
Column-Store Advantages
Column-oriented storage reads only the columns a query needs, enabling massive compression ratios and vectorized processing. This is why OLAP databases like ClickHouse can scan billions of rows per second.
- Only touched columns are loaded from disk — dramatic I/O reduction
- Same-type values compress 5-10x better than row-oriented storage
- Vectorized execution: operates on column batches, CPU-cache friendly
- Poor for point lookups or row-level updates (OLTP workloads)
MergeTree Engine
MergeTree is ClickHouse's primary table engine. Data is written in sorted "parts" that are merged in the background. The primary key defines sort order and enables efficient range scans.
- Data is stored in sorted parts; background merges reduce part count
- Primary key is a sparse index (not unique like RDBMS) — stores one entry per granule (8192 rows default)
- ORDER BY clause defines sort order; primary key defaults to it
- Supports data skipping indexes (minmax, set, bloom_filter)
Materialized Views
ClickHouse materialized views act as insert triggers: they transform data on the fly as it arrives and store results in a target table. This enables real-time pre-aggregation without ETL pipelines.
- Triggered on INSERT to the source table; processes only new data
- Target table typically uses AggregatingMergeTree or SummingMergeTree
- Great for dashboards that need sub-second aggregations on live data
- Multiple materialized views can consume the same source table
ReplicatedMergeTree
ReplicatedMergeTree adds replication to MergeTree using ZooKeeper (or ClickHouse Keeper) for coordination. Each replica independently fetches and merges data parts, providing high availability.
- ZooKeeper/Keeper stores metadata and coordinates part transfers
- Each replica is eventually consistent; INSERT waits for configurable quorum
- insert_quorum setting controls how many replicas must confirm a write
- Replicas auto-recover missing parts from peers
ClickHouse vs Druid vs BigQuery
All three are OLAP systems, but they differ in architecture, operational model, and cost. Being able to articulate when you'd pick one over another shows depth in data platform thinking.
- ClickHouse: self-hosted, SQL-native, fastest for single-table scans, needs ops investment
- Druid: real-time ingestion focus, segment-based, good for mixed real-time + historical
- BigQuery: fully managed, serverless, pay-per-query, best for irregular/batch workloads
- ClickHouse shines when you need low-latency on high-QPS analytic queries
OLAP Query Patterns
Aggregation at Scale
ClickHouse excels at aggregating billions of rows in seconds. Understanding how it parallelizes GROUP BY across cores and merges partial states is key to writing efficient analytic queries.
- Parallel GROUP BY: each thread aggregates its data part, then results are merged
- Approximate functions (uniq, quantileTDigest) trade accuracy for speed
- Hash tables used for GROUP BY must fit in memory; large cardinality may spill to disk
- Pre-filter with WHERE before GROUP BY to reduce data volume
Pre-aggregation Strategies
For dashboards requiring sub-second latency, pre-aggregating data at insert time (via materialized views or summary tables) avoids scanning raw data on every query.
- SummingMergeTree: auto-sums numeric columns during merges
- AggregatingMergeTree: stores intermediate aggregate states (count, avg, quantile)
- Materialized views populate summary tables on INSERT
- Trade-off: more storage and complexity for faster reads
TTL for Data Lifecycle
ClickHouse supports TTL rules at column and table level to automatically delete or move old data. This is essential for managing storage costs on high-volume event data.
- Table TTL: DELETE or MOVE TO VOLUME/DISK after a time expression
- Column TTL: set individual columns to default value after expiry
- Enables tiered storage: hot SSD → cold HDD → delete
- TTL is evaluated during part merges, not continuously
Redis Deep Dive
BothRedis Data Structures
String / Hash / List / Set / ZSet Internals
Redis's five core data structures each use optimized internal encodings. Knowing the underlying representations helps you pick the right structure and predict memory usage.
- String: SDS (Simple Dynamic String); int encoding for numbers ≤ LLONG_MAX
- Hash: ziplist for small hashes (≤128 entries), hashtable for larger
- List: quicklist (linked list of ziplists) since Redis 3.2
- Set: intset for all-integer small sets, hashtable otherwise
- ZSet (Sorted Set): skiplist + hashtable for O(log n) rank operations
HyperLogLog
HyperLogLog is a probabilistic data structure that estimates the cardinality of a set using ~12 KB of memory, regardless of set size. Perfect for counting unique visitors, IPs, or events at scale.
- PFADD adds elements; PFCOUNT returns approximate unique count
- Standard error: ~0.81% — accurate enough for analytics
- PFMERGE combines multiple HLLs (e.g., merge daily counts into weekly)
- Fixed 12 KB regardless of millions or billions of elements
Bloom Filter (RedisBloom)
A Bloom filter answers "is this element in the set?" with no false negatives but possible false positives. RedisBloom module adds native BF.ADD / BF.EXISTS commands to Redis.
- Space-efficient: uses bit array + multiple hash functions
- False positive rate configurable at creation (e.g., 1%)
- Use case: prevent duplicate processing, cache penetration protection
- Cannot delete elements (use Cuckoo filter if deletions needed)
Streams
Redis Streams is an append-only log data structure with consumer group support, similar to Kafka. It enables event-driven architectures with exactly-once-style processing semantics within Redis.
- XADD appends entries; XREAD reads from a position; XRANGE scans a range
- Consumer groups: multiple consumers share work, with per-consumer acknowledgement
- XACK confirms processing; pending entries list (PEL) tracks unacked messages
- Suitable for lightweight event streaming without deploying Kafka
Caching Patterns
Cache-Aside vs Write-Through vs Write-Behind
The three fundamental caching strategies differ in how they synchronize cache and database. Choosing the right one affects consistency, latency, and complexity.
- Cache-aside (lazy loading): app checks cache, loads from DB on miss, populates cache
- Write-through: every write goes to cache AND DB synchronously — strong consistency but higher write latency
- Write-behind (write-back): writes go to cache first, async flush to DB — fast writes but risk data loss
- Cache-aside is most common in practice; write-through for strict consistency needs
TTL Expiry Strategies
Redis uses a combination of lazy expiry (check on access) and active expiry (periodic sampling) to remove expired keys. Understanding this prevents surprises with memory usage and stale data.
- Lazy expiry: key is checked on GET; if expired, it is deleted and returns nil
- Active expiry: Redis samples 20 random keys with TTL 10 times/sec, deleting expired ones
- If >25% of sampled keys are expired, it repeats immediately (adaptive)
- Very short TTLs on many keys can cause CPU spikes during active expiry
Cache Stampede — Mutex Lock Solution
When a popular key expires, many concurrent requests may simultaneously attempt to rebuild the cache, overwhelming the database. The mutex lock pattern ensures only one thread rebuilds while others wait.
- SET key NX EX (SETNX with TTL) acquires a distributed lock on the rebuild key
- The lock winner fetches from DB and populates cache; losers retry with backoff
- Alternative: probabilistic early expiration (stagger TTLs)
- Another approach: never expire, use background refresh with logical TTL
Distributed Lock (Redlock)
Redlock is an algorithm for distributed locking across multiple independent Redis instances. It is used when correctness matters more than what a single-instance SETNX lock can guarantee.
- Acquire lock on N/2+1 of N independent Redis masters with clock-bounded TTL
- If majority acquired within validity time, lock is held
- Controversial: Martin Kleppmann argued it is unsafe under clock skew and GC pauses
- For most practical cases, single-instance SETNX with fencing tokens suffices
Redis Clustering
Sentinel vs Cluster Mode
Sentinel provides automatic failover for a single master-replica setup. Cluster mode shards data across multiple masters. The choice depends on whether you need HA only or HA + horizontal scaling.
- Sentinel: monitors one master + replicas, promotes replica on failure, no sharding
- Cluster: data partitioned across 16384 hash slots on multiple masters
- Sentinel is simpler and sufficient when data fits on one node
- Cluster is required when dataset exceeds single-node memory
Hash Slots
Redis Cluster divides the key space into 16384 hash slots. Each master owns a subset of slots, and clients route commands based on CRC16(key) mod 16384.
- CRC16 hash of the key determines the slot; slot maps to a specific master
- Hash tags: {user:123}.profile forces keys to the same slot for multi-key commands
- Resharding moves slots between nodes with minimal downtime (MIGRATE)
- MOVED and ASK redirections tell clients to retry on the correct node
Hot Key Problem
When a single key receives disproportionate traffic, it overloads the node owning that slot. This is a common Redis scaling bottleneck in high-traffic systems like flash sales or viral content.
- All requests for one key route to one node — no cluster parallelism
- Solution 1: key splitting — shard the value across key:1, key:2, ..., key:N
- Solution 2: local caching with short TTL to absorb reads
- Solution 3: read replicas for the hot key's slot (cluster read replicas)
Memory Eviction Policies
When Redis hits maxmemory, it must evict keys to make room. The eviction policy determines which keys are removed. Choosing the wrong policy can silently drop important data.
- noeviction: return errors on writes when full (safe but disruptive)
- allkeys-lru: evict least recently used across all keys (most common)
- volatile-lru: evict LRU only among keys with TTL set
- allkeys-lfu: evict least frequently used (Redis 4.0+, better for skewed access)
- volatile-ttl: evict keys closest to expiration
Database Selection & CAP Theory
LightCAP Theory
CAP Theorem Applied to Real Systems
CAP states a distributed system can only guarantee two of Consistency, Availability, and Partition tolerance simultaneously. In practice, P is non-negotiable, so the real choice is between C and A during partitions.
- CP systems (e.g., ZooKeeper, HBase): reject requests during partition to stay consistent
- AP systems (e.g., Cassandra, DynamoDB): serve requests during partition, may return stale data
- MySQL single-node is CA — but CA is meaningless in distributed contexts
- Most real systems are tunable: MongoDB (w:majority = CP, w:1 = AP-leaning)
Eventual Consistency Patterns
In eventually consistent systems, replicas converge over time. Understanding the practical patterns (read repair, anti-entropy, hinted handoff) helps you design systems that behave well under eventual consistency.
- Read repair: fix stale replica when a read detects divergence
- Anti-entropy: background Merkle-tree comparison to sync replicas
- Hinted handoff: temporarily store writes for an unavailable node, replay later
- Last-write-wins (LWW) is simple but can lose concurrent updates
Read-Your-Writes
Read-your-writes consistency guarantees that after a client writes, subsequent reads by the same client reflect that write. This is critical for user experience even in eventually consistent systems.
- Achieved by routing reads to the same node that handled the write
- Alternative: use a version token / timestamp; reject reads behind that token
- MongoDB: use readConcern "majority" + readPreference "primary" for the same session
- Without this, a user may save data and then see the old version on page refresh
Monotonic Reads
Monotonic read consistency guarantees a client never sees an older version of data after having seen a newer one. Without it, successive reads from different replicas can appear to "go back in time."
- Problem: load balancer routes read 1 to an up-to-date replica, read 2 to a stale one
- Solution: sticky sessions — pin a client to one replica for reads
- Alternative: track read timestamp; only read from replicas at or past that point
- MongoDB causal sessions provide monotonic reads automatically
CRDT Basics
Conflict-free Replicated Data Types are data structures that can be updated independently on different replicas and merged without conflicts. They are the theoretical foundation for many AP system designs.
- State-based (CvRDT): replicas periodically send full state, merge via join
- Operation-based (CmRDT): replicas broadcast operations, must be commutative
- Examples: G-Counter, PN-Counter, LWW-Register, OR-Set
- Used in Redis Enterprise (CRDB), Riak, and collaborative editing
DB Tradeoffs
RDBMS vs Document vs Column vs Graph
Each database paradigm optimizes for different access patterns. Knowing when to use which is a staple system design interview question.
- RDBMS (MySQL, PostgreSQL): structured data, complex joins, ACID — default choice for transactional workloads
- Document (MongoDB): flexible schema, nested data, horizontal scale — good for content, catalogs
- Column (ClickHouse, Cassandra): analytic aggregations or wide-row time-series
- Graph (Neo4j): relationship-heavy queries (social networks, fraud detection, recommendations)
- Pick based on primary access pattern, not hype
NewSQL Tradeoffs
NewSQL systems (CockroachDB, TiDB, Spanner) promise SQL semantics with horizontal scalability. They solve the sharding pain of MySQL but introduce new operational complexity and latency characteristics.
- Distributed SQL: automatic sharding, distributed transactions, strong consistency
- Spanner uses TrueTime (atomic clocks) for global serializable isolation
- CockroachDB and TiDB use Raft consensus for replication
- Higher write latency than single-node MySQL due to consensus overhead
- Good fit when you outgrow single-node RDBMS but need ACID guarantees
Polyglot Persistence Rationale
Polyglot persistence means using different databases for different parts of the same application, each chosen for its strengths. It is the practical application of "no single database does everything well."
- Example: MySQL for orders (ACID), Redis for sessions (speed), ClickHouse for analytics (OLAP), MongoDB for product catalog (flexible schema)
- Benefits: each store is used where it excels, better performance and scalability
- Costs: operational overhead, data synchronization complexity, more failure modes
- Need a clear data flow strategy (CDC, event sourcing, or dual writes)
Recommended Resources
- MySQL 8.0 Reference Manual — Official docs covering InnoDB internals, optimization, and administration
- MongoDB Manual — Complete reference for data modeling, aggregation, replication, and transactions
- ClickHouse Documentation — Engine types, query optimization, and operational guides
- Redis Documentation — Data types, commands, clustering, persistence, and patterns
- Baeldung: Database Concepts — Tutorials on indexing, transactions, caching, and distributed DB theory
- MySQL Optimization Guide — EXPLAIN, query tuning, index best practices, and server configuration