← Back to Tracker

Pillar 4 — Distributed Systems

The theory behind what you build every day

Core Theory Full-Focus

Tip: You don't need to implement Raft, but explaining it confidently in HLD is a differentiator at top-tier companies.

Consensus & Replication

Leader Election

The process by which distributed nodes agree on a single coordinator to avoid split-brain and order operations consistently.

  • Prevents conflicting writes by funnelling mutations through one node
  • Common in ZooKeeper (ZAB), etcd (Raft), and Kafka controller
  • Election triggers: leader crash, network partition, or manual failover
  • Trade-off between election speed and false-positive detection

Resources: System Design Primer · Kleppmann on Distributed Locking

Raft Consensus (Conceptual)

An understandable consensus algorithm that decomposes the problem into leader election, log replication, and safety.

  • Three roles: Leader, Follower, Candidate
  • Leader appends entries to its log and replicates to followers
  • Committed once a majority (quorum) acknowledges
  • Term numbers prevent stale leaders from accepting writes

Resources: System Design Primer · Martin Kleppmann

Quorum Reads/Writes

A strategy requiring W + R > N to guarantee overlap between write and read replica sets, ensuring consistency without full replication.

  • N = total replicas, W = write quorum, R = read quorum
  • W + R > N ensures at least one node has the latest write
  • Tunable: favour availability (low W) or consistency (high W)
  • Used in Cassandra, DynamoDB, and Riak

Resources: System Design Primer · Martin Kleppmann

Gossip Protocol

A peer-to-peer communication protocol where nodes periodically exchange state information with random peers, eventually propagating data to all nodes.

  • Probabilistic: O(log N) rounds to reach all nodes
  • Resilient to node failures — no single point of failure
  • Used for failure detection, membership, and metadata propagation
  • Examples: Cassandra gossip, Consul serf, SWIM protocol

Resources: System Design Primer · Martin Kleppmann

Vector Clocks

A mechanism to capture causal ordering of events across distributed nodes, enabling conflict detection without a global clock.

  • Each node maintains a vector of logical counters, one per node
  • If V(a) < V(b) in all positions, a causally precedes b
  • Concurrent events have incomparable vectors — require conflict resolution
  • Used in Amazon Dynamo's original design for versioning

Resources: System Design Primer · Martin Kleppmann

Lamport Timestamps

A simple logical clock that provides a total ordering of events, though it cannot determine causality on its own.

  • Each event increments the local counter; messages carry timestamps
  • Receiver sets its clock to max(local, received) + 1
  • If a → b then L(a) < L(b), but the converse is not necessarily true
  • Foundation for more sophisticated clocks like vector clocks

Resources: System Design Primer · Martin Kleppmann

Failure Modes

Network Partitions

When a network fault divides a cluster into isolated subgroups that cannot communicate, forcing a choice between consistency and availability (CAP theorem).

  • CAP theorem: during a partition, choose CP (reject writes) or AP (allow divergence)
  • Real-world partitions are partial and transient, not binary
  • Detecting partitions: heartbeat timeouts, phi-accrual failure detector
  • Recovery involves conflict resolution (last-write-wins, CRDTs, manual merge)

Resources: System Design Primer — CAP · Martin Kleppmann

Byzantine Faults (Awareness)

Failures where a node behaves arbitrarily (sends conflicting messages, lies, or acts maliciously) rather than simply crashing.

  • Requires 3f+1 nodes to tolerate f Byzantine faults
  • Most internal systems assume crash-fail, not Byzantine
  • Relevant for blockchain, multi-tenant, or untrusted environments
  • Interview angle: know the distinction, not the full BFT algorithm

Resources: System Design Primer · Martin Kleppmann

Cascading Failures

When the failure of one component overloads its dependents, triggering a domino effect across the system.

  • Common trigger: retry storms after a single service goes down
  • Mitigations: circuit breakers, bulkheads, load shedding, backpressure
  • Timeouts without backoff amplify the problem
  • Design for graceful degradation, not just happy-path throughput

Resources: System Design Primer · Baeldung — Resilience4j

Split-Brain Scenarios

When a partition causes two sub-clusters to each elect their own leader, leading to divergent state and data corruption.

  • Classic in master-slave setups without proper fencing
  • Prevention: quorum-based elections, fencing tokens, STONITH
  • ZooKeeper ephemeral nodes and leader epoch help detect stale leaders
  • Interview tip: tie this to leader election and network partition topics

Resources: System Design Primer · Kleppmann on Distributed Locking

Kafka Depth Full-Focus

Tip: This is your strongest card — real production experience. Nail the guarantees section to make this a genuine differentiator.

Kafka Internals

Partition & Offset Model

Kafka's fundamental data model: topics are split into ordered, immutable partitions, and each record within a partition has a unique sequential offset.

  • Partitions enable parallelism — each consumed by one consumer in a group
  • Offsets are per-partition, monotonically increasing, and never reused
  • Ordering is guaranteed only within a partition, not across partitions
  • Partition count is a one-way door — increasing changes key routing
  • Key-based partitioning ensures related events land in the same partition

Resources: Kafka Docs — Topics · System Design Primer

Log Segment & Compaction

Kafka stores data as append-only log segments on disk. Compaction retains only the latest value per key, enabling infinite retention for changelog topics.

  • Segments are time- or size-bounded files; old segments are deleted or compacted
  • Delete policy: remove segments older than retention.ms
  • Compact policy: keep latest record per key, discard older duplicates
  • Compacted topics power KTable state stores and CDC use cases

Resources: Kafka Docs — Compaction · Baeldung — Log Compaction

ISR — In-Sync Replicas

The set of replicas that are fully caught up with the leader. Only ISR members are eligible to become leader on failover.

  • A replica falls out of ISR if it lags beyond replica.lag.time.max.ms
  • min.insync.replicas defines minimum ISR size for acks=all writes
  • If ISR shrinks below min.insync.replicas, producers receive NotEnoughReplicasException
  • Monitoring ISR shrink events is critical for operational health

Resources: Kafka Docs — Replication · Baeldung — ISR

Leader Epoch

A monotonically increasing counter that identifies a leader's term, preventing stale replicas from truncating valid committed data during recovery.

  • Replaces the older high-watermark-based truncation that could lose data
  • On becoming leader, the new leader increments the epoch
  • Followers use the leader epoch to determine the correct truncation point
  • Critical for exactly-once and transactional producer correctness

Resources: Kafka Documentation · System Design Primer

Consumer Group Rebalancing

When consumers join, leave, or crash, Kafka redistributes partition ownership across the group to maintain even load.

  • Eager rebalance: revoke all partitions, then reassign (causes pause)
  • Cooperative (incremental) rebalance: only migrate affected partitions
  • Triggers: consumer crash, new consumer, topic metadata change
  • session.timeout.ms and heartbeat.interval.ms control detection speed
  • Rebalance storms can kill throughput — tune max.poll.interval.ms

Resources: Kafka Docs — Consumer Configs · Baeldung — Rebalancing

Sticky Partition Assignor

An assignment strategy that minimises partition movement during rebalances, keeping existing assignments intact when possible.

  • Reduces rebalance overhead compared to range or round-robin assignors
  • Preserves consumer-to-partition affinity across rebalances
  • Works well with cooperative rebalance protocol
  • Configure via partition.assignment.strategy on consumer

Resources: Kafka Docs — Consumer Configs · Baeldung — Rebalancing

Delivery Guarantees

At-Most-Once / At-Least-Once / Exactly-Once

The three delivery semantics that define how many times a consumer processes a message, each with distinct trade-offs.

  • At-most-once: commit offset before processing — may lose messages
  • At-least-once: commit after processing — may duplicate on crash
  • Exactly-once: idempotent producer + transactional consumer — highest overhead
  • Most production systems use at-least-once with idempotent consumers
  • Interview angle: explain when exactly-once is worth the cost

Resources: Kafka Docs — Semantics · Baeldung — Exactly Once

Idempotent Producer

A Kafka producer mode that assigns a producer ID and sequence number to each record, allowing the broker to de-duplicate retries.

  • Enable with enable.idempotence=true (default since Kafka 3.0)
  • Guarantees exactly-once per partition for a single producer session
  • Requires acks=all and max.in.flight.requests.per.connection ≤ 5
  • Does not span multiple partitions — use transactions for that

Resources: Kafka Docs — Producer Configs · Baeldung — Idempotent Producer

Transactional Producer (2PC)

Enables atomic writes across multiple partitions and topics, ensuring all-or-nothing semantics via a two-phase commit protocol internal to Kafka.

  • Set transactional.id on the producer to enable transactions
  • beginTransaction() → send() → sendOffsetsToTransaction() → commitTransaction()
  • Consumers with isolation.level=read_committed see only committed records
  • Adds latency — use only when atomicity across partitions is required

Resources: Kafka Docs — Transactions · Baeldung — Exactly Once

Consumer Offset Commit Strategies

How and when a consumer marks records as processed, directly impacting delivery guarantees and rebalance behaviour.

  • Auto-commit (enable.auto.commit=true): periodic, risks at-most-once on crash
  • Sync commit (commitSync): blocks until broker confirms, safer but slower
  • Async commit (commitAsync): non-blocking, risk of out-of-order commits
  • Best practice: async in loop, sync in shutdown/rebalance listener

Resources: Kafka Docs — Consumer Configs · Baeldung — Offset Commit

Kafka Tuning

batch.size & linger.ms

Producer batching controls that trade latency for throughput by accumulating records before sending.

  • batch.size (default 16KB): max bytes per batch per partition
  • linger.ms (default 0): how long to wait for more records before sending
  • Increasing both improves throughput but adds latency
  • For high-throughput: batch.size=64KB+, linger.ms=5-20

Resources: Kafka Docs — Producer Configs · Baeldung — Kafka Producer

acks=all vs acks=1

The producer acknowledgement setting that controls durability guarantees at the cost of latency.

  • acks=0: fire-and-forget, fastest, may lose data
  • acks=1: leader acknowledges, fast but risks data loss if leader dies before replication
  • acks=all (-1): all ISR replicas acknowledge, strongest durability
  • Pair acks=all with min.insync.replicas=2 for production safety

Resources: Kafka Docs — Producer Configs · Baeldung — Reliability

max.poll.records

Controls the maximum number of records returned in a single consumer poll(), affecting processing latency and rebalance stability.

  • Default is 500; lower it for slow processing to avoid rebalance timeouts
  • Must finish processing within max.poll.interval.ms or get kicked from group
  • Higher values improve throughput for fast, lightweight consumers
  • Tune alongside max.poll.interval.ms and session.timeout.ms

Resources: Kafka Docs — Consumer Configs · Baeldung — Consumer Config

Lag Monitoring

Tracking the difference between the latest produced offset and the consumer's committed offset to detect processing bottlenecks.

  • Consumer lag = log-end-offset minus committed-offset per partition
  • Tools: kafka-consumer-groups.sh, Burrow, Prometheus + JMX exporter
  • Increasing lag signals under-provisioned consumers or slow processing
  • Alert on sustained lag growth, not momentary spikes

Resources: Kafka Docs — Monitoring · System Design Primer

Schema Registry Basics

A centralized service (typically Confluent Schema Registry) that stores and validates Avro/Protobuf/JSON schemas for Kafka topics, enforcing compatibility.

  • Schemas are versioned and identified by a numeric ID embedded in messages
  • Compatibility modes: BACKWARD, FORWARD, FULL, NONE
  • Prevents breaking changes from reaching consumers
  • BACKWARD (default): new schema can read old data — safest for consumers

Resources: Kafka Documentation · Baeldung — Schema Registry

SQS & Async Patterns Light

SQS

Standard vs FIFO

AWS SQS offers two queue types with fundamentally different ordering and deduplication guarantees.

  • Standard: nearly unlimited throughput, best-effort ordering, at-least-once delivery
  • FIFO: strict ordering within message groups, exactly-once processing, 300 TPS (3000 with batching)
  • FIFO queues require MessageGroupId for ordering scope
  • Choose Standard unless ordering or deduplication is a hard requirement

Resources: AWS SQS Docs · System Design Primer

Visibility Timeout

The period during which a message is invisible to other consumers after being received, giving the current consumer time to process and delete it.

  • Default: 30 seconds; max: 12 hours
  • If processing takes longer, extend with ChangeMessageVisibility
  • On timeout expiry without deletion, message becomes visible again (retry)
  • Set ≥ expected processing time + buffer to avoid duplicate processing

Resources: AWS SQS — Visibility Timeout · Baeldung — SQS

Dead-Letter Queue (DLQ)

A secondary queue that receives messages that fail processing after a configured number of retries, preventing poison messages from blocking the pipeline.

  • Configure maxReceiveCount on the source queue's redrive policy
  • DLQ messages retain original attributes for debugging
  • Monitor DLQ depth as a key operational alert
  • Redrive: replay DLQ messages back to source queue after fixing the bug

Resources: AWS SQS — DLQ · Baeldung — SQS

Long Polling

A receive technique where the SQS call waits (up to 20 seconds) for messages to arrive before returning, reducing empty responses and cost.

  • Set WaitTimeSeconds > 0 on ReceiveMessage (or queue-level default)
  • Reduces API calls and cost compared to short polling
  • Queries all servers, reducing false-empty responses
  • 20 seconds is the maximum and recommended value

Resources: AWS SQS — Long Polling · System Design Primer

Message Deduplication

Mechanisms to ensure a message is processed only once, critical for financial and order-processing pipelines.

  • FIFO queues: 5-minute deduplication window using MessageDeduplicationId
  • Content-based deduplication: SQS generates ID from SHA-256 of body
  • Standard queues: no built-in dedup; implement at consumer side (idempotency key in DB)
  • Combine with idempotent consumers for end-to-end exactly-once

Resources: AWS SQS — Deduplication · Baeldung — SQS

Async Patterns

Outbox Pattern

Write domain events to an "outbox" table in the same database transaction as the business operation, then relay them to the message broker asynchronously.

  • Solves the dual-write problem: DB + message broker atomicity
  • A poller or CDC connector (Debezium) reads the outbox and publishes
  • Guarantees at-least-once event delivery without distributed transactions
  • Outbox table columns: id, aggregate_type, payload, created_at, published

Resources: Baeldung — Outbox Pattern · System Design Primer

Saga Pattern — Choreography vs Orchestration

A pattern for managing distributed transactions across microservices using a sequence of local transactions with compensating actions on failure.

  • Choreography: services emit events and react to each other — decentralised but hard to trace
  • Orchestration: a central coordinator directs the sequence — explicit flow but single point of concern
  • Each step has a compensating transaction for rollback
  • Choose orchestration when the flow has many steps or complex branching
  • Interview favourite: "How would you handle a failed payment after order creation?"

Resources: Baeldung — Saga Pattern · System Design Primer

Event Sourcing Basics

Instead of storing current state, persist every state-changing event as an immutable log. Current state is derived by replaying events.

  • Full audit trail by design — every change is a first-class record
  • Enables temporal queries: "What was the state at time T?"
  • Snapshots prevent slow replay for aggregates with many events
  • Pairs naturally with CQRS for separate read and write models

Resources: Martin Kleppmann · System Design Primer

CQRS Overview

Command Query Responsibility Segregation separates the write model (commands) from the read model (queries), allowing independent optimisation of each.

  • Write side: normalised, optimised for consistency and validation
  • Read side: denormalised, optimised for query patterns and speed
  • Eventual consistency between read and write models is the trade-off
  • Often combined with event sourcing: events update the read projections

Resources: Martin Kleppmann · System Design Primer

Resilience Patterns Both

Resilience

Circuit Breaker (Resilience4j)

Monitors calls to a downstream service and "trips open" after a failure threshold, failing fast instead of waiting for inevitable timeouts.

  • Three states: CLOSED (normal) → OPEN (fail fast) → HALF_OPEN (probe)
  • Configure: failureRateThreshold, waitDurationInOpenState, permittedNumberOfCallsInHalfOpenState
  • Prevents cascading failures by shedding load on unhealthy dependencies
  • Pair with a fallback to return cached/default responses when open

Resources: Baeldung — Resilience4j · System Design Primer

Retry with Exponential Backoff & Jitter

Automatically retry failed calls with increasing delays and random jitter to avoid thundering herd problems.

  • Delay = base * 2^attempt + random(0, jitter)
  • Jitter decorrelates retries from multiple clients
  • Set a max retry count to avoid infinite loops
  • Only retry on transient errors (5xx, timeout), not 4xx
  • AWS recommends "full jitter" for best spread

Resources: Baeldung — Backoff & Jitter · AWS — Retry Best Practices

Bulkhead Isolation

Limit concurrency to a dependency so that a slow or failing service cannot exhaust the caller's entire thread pool.

  • Thread-pool bulkhead: dedicated thread pool per dependency
  • Semaphore bulkhead: limits concurrent calls without a separate pool
  • Prevents one slow dependency from consuming all resources
  • Named after ship bulkheads that contain flooding to one compartment

Resources: Baeldung — Resilience4j · System Design Primer

Rate Limiting Algorithms

Techniques to cap the number of requests a system accepts in a given window, protecting against abuse and overload.

  • Token bucket: tokens refill at a steady rate; each request costs a token. Allows bursts up to bucket capacity
  • Leaky bucket: requests drain at a fixed rate; excess is queued or rejected. Smooths output rate
  • Sliding window: counts requests in a rolling time window, more accurate than fixed windows
  • Distributed rate limiting: use Redis (INCR + TTL or Lua script) for cluster-wide limits

Resources: System Design Primer — Rate Limiting · Baeldung — Rate Limiting

Timeout & Deadline Propagation

Setting time bounds on every outbound call and propagating remaining budget through the call chain to prevent indefinite waits.

  • Every network call must have a timeout — no timeout is a bug
  • Deadline propagation: pass remaining time budget in request headers (e.g., gRPC deadlines)
  • Downstream services should check remaining budget before expensive work
  • Prevents resource leaks from hung connections piling up

Resources: Baeldung — Timeouts · System Design Primer

Fallback Strategies

Predefined alternative responses when a primary call fails, ensuring graceful degradation instead of hard errors.

  • Cache fallback: return stale cached data when upstream is down
  • Default value: return a safe static response
  • Alternate service: route to a secondary provider
  • Combine with circuit breaker: fallback fires when circuit is OPEN

Resources: Baeldung — Resilience4j · System Design Primer

Resilience Libraries

Resilience4j Annotations

Declarative resilience in Spring Boot using annotations like @CircuitBreaker, @Retry, @Bulkhead, and @RateLimiter.

  • @CircuitBreaker(name="myService", fallbackMethod="fallback")
  • @Retry(name="myService", fallbackMethod="fallback")
  • Order of decorators matters: Retry > CircuitBreaker > RateLimiter > Bulkhead
  • Configure instances in application.yml under resilience4j.circuitbreaker.instances
  • Actuator endpoints expose metrics: /actuator/circuitbreakers

Resources: Baeldung — Resilience4j · Baeldung — Spring + Resilience4j

Spring Retry

A Spring module providing @Retryable annotation for declarative retry logic with configurable backoff and recovery methods.

  • @Retryable(value=Exception.class, maxAttempts=3, backoff=@Backoff(delay=1000, multiplier=2))
  • @Recover method handles final failure after all retries exhausted
  • Enable with @EnableRetry on configuration class
  • Simpler than Resilience4j but lacks circuit breaker and bulkhead

Resources: Baeldung — Spring Retry · System Design Primer

Feign Client with Fallback

Using Spring Cloud OpenFeign's fallback mechanism to provide default responses when a remote service is unavailable.

  • Define fallback class implementing the Feign interface with default responses
  • @FeignClient(name="service", fallback=ServiceFallback.class)
  • Integrates with Resilience4j CircuitBreaker via spring-cloud-circuitbreaker-resilience4j
  • FallbackFactory gives access to the cause exception for smarter fallbacks

Resources: Baeldung — OpenFeign · Baeldung — Resilience4j

Consistent Hashing & Load Balancing Light

Hashing

Virtual Nodes

Mapping each physical node to multiple positions on the hash ring to improve load distribution and reduce hotspots.

  • Without vnodes, adding/removing a node only affects its immediate neighbour
  • With vnodes, load redistributes more evenly across all nodes
  • Typical: 100-256 vnodes per physical node
  • Trade-off: more vnodes = better balance but larger routing metadata

Resources: System Design Primer — Consistent Hashing · Martin Kleppmann

Consistent Hash Ring

A ring-based hash space where keys and nodes are mapped onto the same circle; a key is assigned to the next node clockwise on the ring.

  • Adding a node only reassigns keys from its clockwise neighbour — minimal disruption
  • Removing a node shifts its keys to the next node — O(K/N) keys move
  • Used in Cassandra, DynamoDB, Memcached, and CDN routing
  • Combines with virtual nodes for even distribution

Resources: System Design Primer — Consistent Hashing · Martin Kleppmann

Rendezvous Hashing

Also called "highest random weight" hashing: for each key, compute a score for every node and pick the node with the highest score.

  • No ring structure needed — each key independently selects its node
  • Adding/removing a node only affects keys that had that node as their top choice
  • Simpler to implement than consistent hashing with vnodes
  • O(N) per lookup where N = number of nodes; fine for moderate N

Resources: System Design Primer · Martin Kleppmann

Jump Consistent Hash

A fast, minimal-memory consistent hash function that maps keys to buckets with near-perfect balance and minimal reassignment.

  • O(ln N) time, O(1) memory — no ring or vnode table
  • Only supports adding/removing the last bucket — not arbitrary nodes
  • Ideal for sharding across a known, sequential set of servers
  • Google paper: "A Fast, Minimal Memory, Consistent Hash Algorithm"

Resources: System Design Primer · Martin Kleppmann

Load Balancing

Round-Robin / Weighted / Least-Connections

The three fundamental load-balancing algorithms used to distribute traffic across backend servers.

  • Round-robin: cycles through servers sequentially — simple, assumes uniform capacity
  • Weighted round-robin: servers with higher weights get proportionally more requests
  • Least-connections: routes to the server with fewest active connections — better for uneven request durations
  • Many LBs support weighted least-connections as a hybrid

Resources: System Design Primer — Load Balancer · AWS ELB Docs

L4 vs L7 LB

Layer-4 load balancers route based on TCP/UDP port and IP, while Layer-7 balancers inspect HTTP headers, paths, and cookies.

  • L4: faster, lower overhead, no TLS termination needed at LB
  • L7: content-based routing, path/host rules, header inspection, WebSocket support
  • L7 enables A/B testing, canary deployments, and API gateway patterns
  • AWS: NLB (L4) vs ALB (L7); both support TLS termination

Resources: System Design Primer — Load Balancer · AWS ELB Docs

Sticky Sessions

Binding a client to a specific backend server for the duration of a session, typically using cookies or source IP hashing.

  • Useful when server holds in-memory session state
  • ALB sticky sessions use an AWSALB cookie with configurable duration
  • Drawback: uneven load distribution if some sessions are heavier
  • Better alternative: externalize session state (Redis, DB) and use stateless servers

Resources: AWS ALB — Sticky Sessions · System Design Primer

Health Check Strategies

Mechanisms for a load balancer to verify backend server availability, removing unhealthy instances from the rotation.

  • TCP health check: verifies port is open — fast but shallow
  • HTTP health check: hits a /health endpoint, checks status code — verifies app is responsive
  • Deep health check: /health/ready verifies DB, cache, and downstream dependencies
  • Configure: interval, timeout, healthy/unhealthy thresholds
  • Separate liveness (restart me) from readiness (stop sending traffic)

Resources: AWS ALB — Health Checks · Baeldung — Spring Actuator

Recommended Resources