The process by which distributed nodes agree on a single coordinator to avoid split-brain and order operations consistently.
Resources: System Design Primer · Kleppmann on Distributed Locking
An understandable consensus algorithm that decomposes the problem into leader election, log replication, and safety.
Resources: System Design Primer · Martin Kleppmann
A strategy requiring W + R > N to guarantee overlap between write and read replica sets, ensuring consistency without full replication.
Resources: System Design Primer · Martin Kleppmann
A peer-to-peer communication protocol where nodes periodically exchange state information with random peers, eventually propagating data to all nodes.
Resources: System Design Primer · Martin Kleppmann
A mechanism to capture causal ordering of events across distributed nodes, enabling conflict detection without a global clock.
Resources: System Design Primer · Martin Kleppmann
A simple logical clock that provides a total ordering of events, though it cannot determine causality on its own.
Resources: System Design Primer · Martin Kleppmann
When a network fault divides a cluster into isolated subgroups that cannot communicate, forcing a choice between consistency and availability (CAP theorem).
Resources: System Design Primer — CAP · Martin Kleppmann
Failures where a node behaves arbitrarily (sends conflicting messages, lies, or acts maliciously) rather than simply crashing.
Resources: System Design Primer · Martin Kleppmann
When the failure of one component overloads its dependents, triggering a domino effect across the system.
Resources: System Design Primer · Baeldung — Resilience4j
When a partition causes two sub-clusters to each elect their own leader, leading to divergent state and data corruption.
Resources: System Design Primer · Kleppmann on Distributed Locking
Kafka's fundamental data model: topics are split into ordered, immutable partitions, and each record within a partition has a unique sequential offset.
Resources: Kafka Docs — Topics · System Design Primer
Kafka stores data as append-only log segments on disk. Compaction retains only the latest value per key, enabling infinite retention for changelog topics.
Resources: Kafka Docs — Compaction · Baeldung — Log Compaction
The set of replicas that are fully caught up with the leader. Only ISR members are eligible to become leader on failover.
Resources: Kafka Docs — Replication · Baeldung — ISR
A monotonically increasing counter that identifies a leader's term, preventing stale replicas from truncating valid committed data during recovery.
Resources: Kafka Documentation · System Design Primer
When consumers join, leave, or crash, Kafka redistributes partition ownership across the group to maintain even load.
Resources: Kafka Docs — Consumer Configs · Baeldung — Rebalancing
An assignment strategy that minimises partition movement during rebalances, keeping existing assignments intact when possible.
Resources: Kafka Docs — Consumer Configs · Baeldung — Rebalancing
The three delivery semantics that define how many times a consumer processes a message, each with distinct trade-offs.
Resources: Kafka Docs — Semantics · Baeldung — Exactly Once
A Kafka producer mode that assigns a producer ID and sequence number to each record, allowing the broker to de-duplicate retries.
Resources: Kafka Docs — Producer Configs · Baeldung — Idempotent Producer
Enables atomic writes across multiple partitions and topics, ensuring all-or-nothing semantics via a two-phase commit protocol internal to Kafka.
Resources: Kafka Docs — Transactions · Baeldung — Exactly Once
How and when a consumer marks records as processed, directly impacting delivery guarantees and rebalance behaviour.
Resources: Kafka Docs — Consumer Configs · Baeldung — Offset Commit
Producer batching controls that trade latency for throughput by accumulating records before sending.
Resources: Kafka Docs — Producer Configs · Baeldung — Kafka Producer
The producer acknowledgement setting that controls durability guarantees at the cost of latency.
Resources: Kafka Docs — Producer Configs · Baeldung — Reliability
Controls the maximum number of records returned in a single consumer poll(), affecting processing latency and rebalance stability.
Resources: Kafka Docs — Consumer Configs · Baeldung — Consumer Config
Tracking the difference between the latest produced offset and the consumer's committed offset to detect processing bottlenecks.
Resources: Kafka Docs — Monitoring · System Design Primer
A centralized service (typically Confluent Schema Registry) that stores and validates Avro/Protobuf/JSON schemas for Kafka topics, enforcing compatibility.
Resources: Kafka Documentation · Baeldung — Schema Registry
AWS SQS offers two queue types with fundamentally different ordering and deduplication guarantees.
Resources: AWS SQS Docs · System Design Primer
The period during which a message is invisible to other consumers after being received, giving the current consumer time to process and delete it.
Resources: AWS SQS — Visibility Timeout · Baeldung — SQS
A secondary queue that receives messages that fail processing after a configured number of retries, preventing poison messages from blocking the pipeline.
Resources: AWS SQS — DLQ · Baeldung — SQS
A receive technique where the SQS call waits (up to 20 seconds) for messages to arrive before returning, reducing empty responses and cost.
Resources: AWS SQS — Long Polling · System Design Primer
Mechanisms to ensure a message is processed only once, critical for financial and order-processing pipelines.
Resources: AWS SQS — Deduplication · Baeldung — SQS
Write domain events to an "outbox" table in the same database transaction as the business operation, then relay them to the message broker asynchronously.
Resources: Baeldung — Outbox Pattern · System Design Primer
A pattern for managing distributed transactions across microservices using a sequence of local transactions with compensating actions on failure.
Resources: Baeldung — Saga Pattern · System Design Primer
Instead of storing current state, persist every state-changing event as an immutable log. Current state is derived by replaying events.
Resources: Martin Kleppmann · System Design Primer
Command Query Responsibility Segregation separates the write model (commands) from the read model (queries), allowing independent optimisation of each.
Resources: Martin Kleppmann · System Design Primer
Monitors calls to a downstream service and "trips open" after a failure threshold, failing fast instead of waiting for inevitable timeouts.
Resources: Baeldung — Resilience4j · System Design Primer
Automatically retry failed calls with increasing delays and random jitter to avoid thundering herd problems.
Resources: Baeldung — Backoff & Jitter · AWS — Retry Best Practices
Limit concurrency to a dependency so that a slow or failing service cannot exhaust the caller's entire thread pool.
Resources: Baeldung — Resilience4j · System Design Primer
Techniques to cap the number of requests a system accepts in a given window, protecting against abuse and overload.
Resources: System Design Primer — Rate Limiting · Baeldung — Rate Limiting
Setting time bounds on every outbound call and propagating remaining budget through the call chain to prevent indefinite waits.
Resources: Baeldung — Timeouts · System Design Primer
Predefined alternative responses when a primary call fails, ensuring graceful degradation instead of hard errors.
Resources: Baeldung — Resilience4j · System Design Primer
Declarative resilience in Spring Boot using annotations like @CircuitBreaker, @Retry, @Bulkhead, and @RateLimiter.
Resources: Baeldung — Resilience4j · Baeldung — Spring + Resilience4j
A Spring module providing @Retryable annotation for declarative retry logic with configurable backoff and recovery methods.
Resources: Baeldung — Spring Retry · System Design Primer
Using Spring Cloud OpenFeign's fallback mechanism to provide default responses when a remote service is unavailable.
Resources: Baeldung — OpenFeign · Baeldung — Resilience4j
Mapping each physical node to multiple positions on the hash ring to improve load distribution and reduce hotspots.
Resources: System Design Primer — Consistent Hashing · Martin Kleppmann
A ring-based hash space where keys and nodes are mapped onto the same circle; a key is assigned to the next node clockwise on the ring.
Resources: System Design Primer — Consistent Hashing · Martin Kleppmann
Also called "highest random weight" hashing: for each key, compute a score for every node and pick the node with the highest score.
Resources: System Design Primer · Martin Kleppmann
A fast, minimal-memory consistent hash function that maps keys to buckets with near-perfect balance and minimal reassignment.
Resources: System Design Primer · Martin Kleppmann
The three fundamental load-balancing algorithms used to distribute traffic across backend servers.
Resources: System Design Primer — Load Balancer · AWS ELB Docs
Layer-4 load balancers route based on TCP/UDP port and IP, while Layer-7 balancers inspect HTTP headers, paths, and cookies.
Resources: System Design Primer — Load Balancer · AWS ELB Docs
Binding a client to a specific backend server for the duration of a session, typically using cookies or source IP hashing.
Resources: AWS ALB — Sticky Sessions · System Design Primer
Mechanisms for a load balancer to verify backend server availability, removing unhealthy instances from the rotation.
Resources: AWS ALB — Health Checks · Baeldung — Spring Actuator