Pillar 4 — Distributed Systems | SDE 2 Interview Prep

Core Theory Full-Focus

Tip: You don't need to implement Raft, but explaining it confidently in HLD is a differentiator at top-tier companies.

Consensus & Replication

Leader Election

The process by which distributed nodes agree on a single coordinator to avoid split-brain and order operations consistently.

Prevents conflicting writes by funnelling mutations through one node
Common in ZooKeeper (ZAB), etcd (Raft), and Kafka controller
Election triggers: leader crash, network partition, or manual failover
Trade-off between election speed and false-positive detection

Resources: System Design Primer · Kleppmann on Distributed Locking

Raft Consensus (Conceptual)

An understandable consensus algorithm that decomposes the problem into leader election, log replication, and safety.

Three roles: Leader, Follower, Candidate
Leader appends entries to its log and replicates to followers
Committed once a majority (quorum) acknowledges
Term numbers prevent stale leaders from accepting writes

Resources: System Design Primer · Martin Kleppmann

Quorum Reads/Writes

A strategy requiring W + R > N to guarantee overlap between write and read replica sets, ensuring consistency without full replication.

N = total replicas, W = write quorum, R = read quorum
W + R > N ensures at least one node has the latest write
Tunable: favour availability (low W) or consistency (high W)
Used in Cassandra, DynamoDB, and Riak

Resources: System Design Primer · Martin Kleppmann

Gossip Protocol

A peer-to-peer communication protocol where nodes periodically exchange state information with random peers, eventually propagating data to all nodes.

Probabilistic: O(log N) rounds to reach all nodes
Resilient to node failures — no single point of failure
Used for failure detection, membership, and metadata propagation
Examples: Cassandra gossip, Consul serf, SWIM protocol

Resources: System Design Primer · Martin Kleppmann

Vector Clocks

A mechanism to capture causal ordering of events across distributed nodes, enabling conflict detection without a global clock.

Each node maintains a vector of logical counters, one per node
If V(a) < V(b) in all positions, a causally precedes b
Concurrent events have incomparable vectors — require conflict resolution
Used in Amazon Dynamo's original design for versioning

Resources: System Design Primer · Martin Kleppmann

Lamport Timestamps

A simple logical clock that provides a total ordering of events, though it cannot determine causality on its own.

Each event increments the local counter; messages carry timestamps
Receiver sets its clock to max(local, received) + 1
If a → b then L(a) < L(b), but the converse is not necessarily true
Foundation for more sophisticated clocks like vector clocks

Resources: System Design Primer · Martin Kleppmann

Failure Modes

Network Partitions

When a network fault divides a cluster into isolated subgroups that cannot communicate, forcing a choice between consistency and availability (CAP theorem).

CAP theorem: during a partition, choose CP (reject writes) or AP (allow divergence)
Real-world partitions are partial and transient, not binary
Detecting partitions: heartbeat timeouts, phi-accrual failure detector
Recovery involves conflict resolution (last-write-wins, CRDTs, manual merge)

Resources: System Design Primer — CAP · Martin Kleppmann

Byzantine Faults (Awareness)

Failures where a node behaves arbitrarily (sends conflicting messages, lies, or acts maliciously) rather than simply crashing.

Requires 3f+1 nodes to tolerate f Byzantine faults
Most internal systems assume crash-fail, not Byzantine
Relevant for blockchain, multi-tenant, or untrusted environments
Interview angle: know the distinction, not the full BFT algorithm

Resources: System Design Primer · Martin Kleppmann

Cascading Failures

When the failure of one component overloads its dependents, triggering a domino effect across the system.

Common trigger: retry storms after a single service goes down
Mitigations: circuit breakers, bulkheads, load shedding, backpressure
Timeouts without backoff amplify the problem
Design for graceful degradation, not just happy-path throughput

Resources: System Design Primer · Baeldung — Resilience4j

Split-Brain Scenarios

When a partition causes two sub-clusters to each elect their own leader, leading to divergent state and data corruption.

Classic in master-slave setups without proper fencing
Prevention: quorum-based elections, fencing tokens, STONITH
ZooKeeper ephemeral nodes and leader epoch help detect stale leaders
Interview tip: tie this to leader election and network partition topics

Resources: System Design Primer · Kleppmann on Distributed Locking

Kafka Depth Full-Focus

Tip: This is your strongest card — real production experience. Nail the guarantees section to make this a genuine differentiator.

Kafka Internals

Partition & Offset Model

Kafka's fundamental data model: topics are split into ordered, immutable partitions, and each record within a partition has a unique sequential offset.

Partitions enable parallelism — each consumed by one consumer in a group
Offsets are per-partition, monotonically increasing, and never reused
Ordering is guaranteed only within a partition, not across partitions
Partition count is a one-way door — increasing changes key routing
Key-based partitioning ensures related events land in the same partition

Resources: Kafka Docs — Topics · System Design Primer

Log Segment & Compaction

Kafka stores data as append-only log segments on disk. Compaction retains only the latest value per key, enabling infinite retention for changelog topics.

Segments are time- or size-bounded files; old segments are deleted or compacted
Delete policy: remove segments older than retention.ms
Compact policy: keep latest record per key, discard older duplicates
Compacted topics power KTable state stores and CDC use cases

Resources: Kafka Docs — Compaction · Baeldung — Log Compaction

ISR — In-Sync Replicas

The set of replicas that are fully caught up with the leader. Only ISR members are eligible to become leader on failover.

A replica falls out of ISR if it lags beyond replica.lag.time.max.ms
min.insync.replicas defines minimum ISR size for acks=all writes
If ISR shrinks below min.insync.replicas, producers receive NotEnoughReplicasException
Monitoring ISR shrink events is critical for operational health

Resources: Kafka Docs — Replication · Baeldung — ISR

Leader Epoch

A monotonically increasing counter that identifies a leader's term, preventing stale replicas from truncating valid committed data during recovery.

Replaces the older high-watermark-based truncation that could lose data
On becoming leader, the new leader increments the epoch
Followers use the leader epoch to determine the correct truncation point
Critical for exactly-once and transactional producer correctness

Resources: Kafka Documentation · System Design Primer

Consumer Group Rebalancing

When consumers join, leave, or crash, Kafka redistributes partition ownership across the group to maintain even load.

Eager rebalance: revoke all partitions, then reassign (causes pause)
Cooperative (incremental) rebalance: only migrate affected partitions
Triggers: consumer crash, new consumer, topic metadata change
session.timeout.ms and heartbeat.interval.ms control detection speed
Rebalance storms can kill throughput — tune max.poll.interval.ms

Resources: Kafka Docs — Consumer Configs · Baeldung — Rebalancing

Sticky Partition Assignor

An assignment strategy that minimises partition movement during rebalances, keeping existing assignments intact when possible.

Reduces rebalance overhead compared to range or round-robin assignors
Preserves consumer-to-partition affinity across rebalances
Works well with cooperative rebalance protocol
Configure via partition.assignment.strategy on consumer

Resources: Kafka Docs — Consumer Configs · Baeldung — Rebalancing

Delivery Guarantees

At-Most-Once / At-Least-Once / Exactly-Once

The three delivery semantics that define how many times a consumer processes a message, each with distinct trade-offs.

At-most-once: commit offset before processing — may lose messages
At-least-once: commit after processing — may duplicate on crash
Exactly-once: idempotent producer + transactional consumer — highest overhead
Most production systems use at-least-once with idempotent consumers
Interview angle: explain when exactly-once is worth the cost

Resources: Kafka Docs — Semantics · Baeldung — Exactly Once

Idempotent Producer

A Kafka producer mode that assigns a producer ID and sequence number to each record, allowing the broker to de-duplicate retries.

Enable with enable.idempotence=true (default since Kafka 3.0)
Guarantees exactly-once per partition for a single producer session
Requires acks=all and max.in.flight.requests.per.connection ≤ 5
Does not span multiple partitions — use transactions for that

Resources: Kafka Docs — Producer Configs · Baeldung — Idempotent Producer

Transactional Producer (2PC)

Enables atomic writes across multiple partitions and topics, ensuring all-or-nothing semantics via a two-phase commit protocol internal to Kafka.

Set transactional.id on the producer to enable transactions
beginTransaction() → send() → sendOffsetsToTransaction() → commitTransaction()
Consumers with isolation.level=read_committed see only committed records
Adds latency — use only when atomicity across partitions is required

Resources: Kafka Docs — Transactions · Baeldung — Exactly Once

Consumer Offset Commit Strategies

How and when a consumer marks records as processed, directly impacting delivery guarantees and rebalance behaviour.

Auto-commit (enable.auto.commit=true): periodic, risks at-most-once on crash
Sync commit (commitSync): blocks until broker confirms, safer but slower
Async commit (commitAsync): non-blocking, risk of out-of-order commits
Best practice: async in loop, sync in shutdown/rebalance listener

Resources: Kafka Docs — Consumer Configs · Baeldung — Offset Commit

Kafka Tuning

batch.size & linger.ms

Producer batching controls that trade latency for throughput by accumulating records before sending.

batch.size (default 16KB): max bytes per batch per partition
linger.ms (default 0): how long to wait for more records before sending
Increasing both improves throughput but adds latency
For high-throughput: batch.size=64KB+, linger.ms=5-20

Resources: Kafka Docs — Producer Configs · Baeldung — Kafka Producer

acks=all vs acks=1

The producer acknowledgement setting that controls durability guarantees at the cost of latency.

acks=0: fire-and-forget, fastest, may lose data
acks=1: leader acknowledges, fast but risks data loss if leader dies before replication
acks=all (-1): all ISR replicas acknowledge, strongest durability
Pair acks=all with min.insync.replicas=2 for production safety

Resources: Kafka Docs — Producer Configs · Baeldung — Reliability

max.poll.records

Controls the maximum number of records returned in a single consumer poll(), affecting processing latency and rebalance stability.

Default is 500; lower it for slow processing to avoid rebalance timeouts
Must finish processing within max.poll.interval.ms or get kicked from group
Higher values improve throughput for fast, lightweight consumers
Tune alongside max.poll.interval.ms and session.timeout.ms

Resources: Kafka Docs — Consumer Configs · Baeldung — Consumer Config

Lag Monitoring

Tracking the difference between the latest produced offset and the consumer's committed offset to detect processing bottlenecks.

Consumer lag = log-end-offset minus committed-offset per partition
Tools: kafka-consumer-groups.sh, Burrow, Prometheus + JMX exporter
Increasing lag signals under-provisioned consumers or slow processing
Alert on sustained lag growth, not momentary spikes

Resources: Kafka Docs — Monitoring · System Design Primer

Schema Registry Basics

A centralized service (typically Confluent Schema Registry) that stores and validates Avro/Protobuf/JSON schemas for Kafka topics, enforcing compatibility.

Schemas are versioned and identified by a numeric ID embedded in messages
Compatibility modes: BACKWARD, FORWARD, FULL, NONE
Prevents breaking changes from reaching consumers
BACKWARD (default): new schema can read old data — safest for consumers

Resources: Kafka Documentation · Baeldung — Schema Registry

SQS & Async Patterns Light

SQS

Standard vs FIFO

AWS SQS offers two queue types with fundamentally different ordering and deduplication guarantees.

Standard: nearly unlimited throughput, best-effort ordering, at-least-once delivery
FIFO: strict ordering within message groups, exactly-once processing, 300 TPS (3000 with batching)
FIFO queues require MessageGroupId for ordering scope
Choose Standard unless ordering or deduplication is a hard requirement

Resources: AWS SQS Docs · System Design Primer

Visibility Timeout

The period during which a message is invisible to other consumers after being received, giving the current consumer time to process and delete it.

Default: 30 seconds; max: 12 hours
If processing takes longer, extend with ChangeMessageVisibility
On timeout expiry without deletion, message becomes visible again (retry)
Set ≥ expected processing time + buffer to avoid duplicate processing

Resources: AWS SQS — Visibility Timeout · Baeldung — SQS

Dead-Letter Queue (DLQ)

A secondary queue that receives messages that fail processing after a configured number of retries, preventing poison messages from blocking the pipeline.

Configure maxReceiveCount on the source queue's redrive policy
DLQ messages retain original attributes for debugging
Monitor DLQ depth as a key operational alert
Redrive: replay DLQ messages back to source queue after fixing the bug

Resources: AWS SQS — DLQ · Baeldung — SQS

Long Polling

A receive technique where the SQS call waits (up to 20 seconds) for messages to arrive before returning, reducing empty responses and cost.

Set WaitTimeSeconds > 0 on ReceiveMessage (or queue-level default)
Reduces API calls and cost compared to short polling
Queries all servers, reducing false-empty responses
20 seconds is the maximum and recommended value

Resources: AWS SQS — Long Polling · System Design Primer

Message Deduplication

Mechanisms to ensure a message is processed only once, critical for financial and order-processing pipelines.

FIFO queues: 5-minute deduplication window using MessageDeduplicationId
Content-based deduplication: SQS generates ID from SHA-256 of body
Standard queues: no built-in dedup; implement at consumer side (idempotency key in DB)
Combine with idempotent consumers for end-to-end exactly-once

Resources: AWS SQS — Deduplication · Baeldung — SQS

Async Patterns

Outbox Pattern

Write domain events to an "outbox" table in the same database transaction as the business operation, then relay them to the message broker asynchronously.

Solves the dual-write problem: DB + message broker atomicity
A poller or CDC connector (Debezium) reads the outbox and publishes
Guarantees at-least-once event delivery without distributed transactions
Outbox table columns: id, aggregate_type, payload, created_at, published

Resources: Baeldung — Outbox Pattern · System Design Primer

Saga Pattern — Choreography vs Orchestration

A pattern for managing distributed transactions across microservices using a sequence of local transactions with compensating actions on failure.

Choreography: services emit events and react to each other — decentralised but hard to trace
Orchestration: a central coordinator directs the sequence — explicit flow but single point of concern
Each step has a compensating transaction for rollback
Choose orchestration when the flow has many steps or complex branching
Interview favourite: "How would you handle a failed payment after order creation?"

Resources: Baeldung — Saga Pattern · System Design Primer

Event Sourcing Basics

Instead of storing current state, persist every state-changing event as an immutable log. Current state is derived by replaying events.

Full audit trail by design — every change is a first-class record
Enables temporal queries: "What was the state at time T?"
Snapshots prevent slow replay for aggregates with many events
Pairs naturally with CQRS for separate read and write models

Resources: Martin Kleppmann · System Design Primer

CQRS Overview

Command Query Responsibility Segregation separates the write model (commands) from the read model (queries), allowing independent optimisation of each.

Write side: normalised, optimised for consistency and validation
Read side: denormalised, optimised for query patterns and speed
Eventual consistency between read and write models is the trade-off
Often combined with event sourcing: events update the read projections

Resources: Martin Kleppmann · System Design Primer

Resilience Patterns Both

Resilience

Circuit Breaker (Resilience4j)

Monitors calls to a downstream service and "trips open" after a failure threshold, failing fast instead of waiting for inevitable timeouts.

Three states: CLOSED (normal) → OPEN (fail fast) → HALF_OPEN (probe)
Configure: failureRateThreshold, waitDurationInOpenState, permittedNumberOfCallsInHalfOpenState
Prevents cascading failures by shedding load on unhealthy dependencies
Pair with a fallback to return cached/default responses when open

Resources: Baeldung — Resilience4j · System Design Primer

Retry with Exponential Backoff & Jitter

Automatically retry failed calls with increasing delays and random jitter to avoid thundering herd problems.

Delay = base * 2^attempt + random(0, jitter)
Jitter decorrelates retries from multiple clients
Set a max retry count to avoid infinite loops
Only retry on transient errors (5xx, timeout), not 4xx
AWS recommends "full jitter" for best spread

Resources: Baeldung — Backoff & Jitter · AWS — Retry Best Practices

Bulkhead Isolation

Limit concurrency to a dependency so that a slow or failing service cannot exhaust the caller's entire thread pool.

Thread-pool bulkhead: dedicated thread pool per dependency
Semaphore bulkhead: limits concurrent calls without a separate pool
Prevents one slow dependency from consuming all resources
Named after ship bulkheads that contain flooding to one compartment

Resources: Baeldung — Resilience4j · System Design Primer

Rate Limiting Algorithms

Techniques to cap the number of requests a system accepts in a given window, protecting against abuse and overload.

Token bucket: tokens refill at a steady rate; each request costs a token. Allows bursts up to bucket capacity
Leaky bucket: requests drain at a fixed rate; excess is queued or rejected. Smooths output rate
Sliding window: counts requests in a rolling time window, more accurate than fixed windows
Distributed rate limiting: use Redis (INCR + TTL or Lua script) for cluster-wide limits

Resources: System Design Primer — Rate Limiting · Baeldung — Rate Limiting

Timeout & Deadline Propagation

Setting time bounds on every outbound call and propagating remaining budget through the call chain to prevent indefinite waits.

Every network call must have a timeout — no timeout is a bug
Deadline propagation: pass remaining time budget in request headers (e.g., gRPC deadlines)
Downstream services should check remaining budget before expensive work
Prevents resource leaks from hung connections piling up

Resources: Baeldung — Timeouts · System Design Primer

Fallback Strategies

Predefined alternative responses when a primary call fails, ensuring graceful degradation instead of hard errors.

Cache fallback: return stale cached data when upstream is down
Default value: return a safe static response
Alternate service: route to a secondary provider
Combine with circuit breaker: fallback fires when circuit is OPEN

Resources: Baeldung — Resilience4j · System Design Primer

Resilience Libraries

Resilience4j Annotations

Declarative resilience in Spring Boot using annotations like @CircuitBreaker, @Retry, @Bulkhead, and @RateLimiter.

@CircuitBreaker(name="myService", fallbackMethod="fallback")
@Retry(name="myService", fallbackMethod="fallback")
Order of decorators matters: Retry > CircuitBreaker > RateLimiter > Bulkhead
Configure instances in application.yml under resilience4j.circuitbreaker.instances
Actuator endpoints expose metrics: /actuator/circuitbreakers

Resources: Baeldung — Resilience4j · Baeldung — Spring + Resilience4j

Spring Retry

A Spring module providing @Retryable annotation for declarative retry logic with configurable backoff and recovery methods.

@Retryable(value=Exception.class, maxAttempts=3, backoff=@Backoff(delay=1000, multiplier=2))
@Recover method handles final failure after all retries exhausted
Enable with @EnableRetry on configuration class
Simpler than Resilience4j but lacks circuit breaker and bulkhead

Resources: Baeldung — Spring Retry · System Design Primer

Feign Client with Fallback

Using Spring Cloud OpenFeign's fallback mechanism to provide default responses when a remote service is unavailable.

Define fallback class implementing the Feign interface with default responses
@FeignClient(name="service", fallback=ServiceFallback.class)
Integrates with Resilience4j CircuitBreaker via spring-cloud-circuitbreaker-resilience4j
FallbackFactory gives access to the cause exception for smarter fallbacks

Resources: Baeldung — OpenFeign · Baeldung — Resilience4j

Consistent Hashing & Load Balancing Light

Hashing

Virtual Nodes

Mapping each physical node to multiple positions on the hash ring to improve load distribution and reduce hotspots.

Without vnodes, adding/removing a node only affects its immediate neighbour
With vnodes, load redistributes more evenly across all nodes
Typical: 100-256 vnodes per physical node
Trade-off: more vnodes = better balance but larger routing metadata

Resources: System Design Primer — Consistent Hashing · Martin Kleppmann

Consistent Hash Ring

A ring-based hash space where keys and nodes are mapped onto the same circle; a key is assigned to the next node clockwise on the ring.

Adding a node only reassigns keys from its clockwise neighbour — minimal disruption
Removing a node shifts its keys to the next node — O(K/N) keys move
Used in Cassandra, DynamoDB, Memcached, and CDN routing
Combines with virtual nodes for even distribution

Resources: System Design Primer — Consistent Hashing · Martin Kleppmann

Rendezvous Hashing

Also called "highest random weight" hashing: for each key, compute a score for every node and pick the node with the highest score.

No ring structure needed — each key independently selects its node
Adding/removing a node only affects keys that had that node as their top choice
Simpler to implement than consistent hashing with vnodes
O(N) per lookup where N = number of nodes; fine for moderate N

Resources: System Design Primer · Martin Kleppmann

Jump Consistent Hash

A fast, minimal-memory consistent hash function that maps keys to buckets with near-perfect balance and minimal reassignment.

O(ln N) time, O(1) memory — no ring or vnode table
Only supports adding/removing the last bucket — not arbitrary nodes
Ideal for sharding across a known, sequential set of servers
Google paper: "A Fast, Minimal Memory, Consistent Hash Algorithm"

Resources: System Design Primer · Martin Kleppmann

Load Balancing

Round-Robin / Weighted / Least-Connections

The three fundamental load-balancing algorithms used to distribute traffic across backend servers.

Round-robin: cycles through servers sequentially — simple, assumes uniform capacity
Weighted round-robin: servers with higher weights get proportionally more requests
Least-connections: routes to the server with fewest active connections — better for uneven request durations
Many LBs support weighted least-connections as a hybrid

Resources: System Design Primer — Load Balancer · AWS ELB Docs

L4 vs L7 LB

Layer-4 load balancers route based on TCP/UDP port and IP, while Layer-7 balancers inspect HTTP headers, paths, and cookies.

L4: faster, lower overhead, no TLS termination needed at LB
L7: content-based routing, path/host rules, header inspection, WebSocket support
L7 enables A/B testing, canary deployments, and API gateway patterns
AWS: NLB (L4) vs ALB (L7); both support TLS termination

Resources: System Design Primer — Load Balancer · AWS ELB Docs

Sticky Sessions

Binding a client to a specific backend server for the duration of a session, typically using cookies or source IP hashing.

Useful when server holds in-memory session state
ALB sticky sessions use an AWSALB cookie with configurable duration
Drawback: uneven load distribution if some sessions are heavier
Better alternative: externalize session state (Redis, DB) and use stateless servers

Resources: AWS ALB — Sticky Sessions · System Design Primer

Health Check Strategies

Mechanisms for a load balancer to verify backend server availability, removing unhealthy instances from the rotation.

TCP health check: verifies port is open — fast but shallow
HTTP health check: hits a /health endpoint, checks status code — verifies app is responsive
Deep health check: /health/ready verifies DB, cache, and downstream dependencies
Configure: interval, timeout, healthy/unhealthy thresholds
Separate liveness (restart me) from readiness (stop sending traffic)

Resources: AWS ALB — Health Checks · Baeldung — Spring Actuator

Recommended Resources

System Design Primer — Comprehensive open-source guide covering distributed system concepts with diagrams
Martin Kleppmann — Designing Data-Intensive Applications — The definitive book on distributed data systems; chapters 5-9 are directly relevant
Apache Kafka Documentation — Official docs covering internals, configuration, and operations
AWS SQS Developer Guide — Official reference for SQS features, patterns, and best practices
Baeldung — Resilience4j Guide — Practical tutorials on circuit breakers, retries, bulkheads, and rate limiters in Spring Boot
AWS Elastic Load Balancing — ALB, NLB, and CLB documentation with architecture guides