Pillar 7 — Observability & Operations

Three Pillars of Observability Light

Interview tip: Interviewers expect SDE 2s to articulate how they would monitor a service from day one. Frame answers around metrics, logs, and traces as complementary signals — not substitutes. Be ready to sketch a Grafana dashboard for any system you design.

Metrics

RED Method (Rate, Errors, Duration) #278

A request-scoped monitoring methodology that measures the three golden signals for every microservice endpoint.

Rate — requests per second your service is handling
Errors — the fraction of those requests that are failing (5xx, timeouts)
Duration — distribution of response latencies (use histograms, not averages)
Ideal for request-driven services; pairs with USE for infrastructure
Map each metric to a Prometheus counter or histogram in interviews

Grafana Dashboards Prometheus Instrumentation Google SRE — Monitoring

USE Method (Utilization, Saturation, Errors) #279

A resource-scoped methodology for analyzing infrastructure bottlenecks in CPUs, memory, disks, and network interfaces.

Utilization — percentage of time the resource is busy (CPU%, disk I/O%)
Saturation — how much extra work is queued (run-queue length, swap usage)
Errors — count of error events (ECC errors, dropped packets)
Complementary to RED; USE targets infrastructure, RED targets application layer
In interviews, mention Brendan Gregg as the originator for credibility

Brendan Gregg — USE Method Prometheus Node Exporter Grafana Docs

Prometheus Data Model #280

Prometheus stores all data as time series — streams of timestamped values identified by a metric name plus key-value label pairs.

Every time series is uniquely identified by its metric name and label set
Labels enable powerful dimensional queries (e.g., filter by service, method, status code)
High-cardinality labels (user IDs) are an anti-pattern — they explode storage
Pull-based scraping model simplifies service discovery vs. push-based agents
PromQL provides rate(), histogram_quantile(), and aggregation operators

Prometheus Data Model PromQL Basics Metric Naming Best Practices

Gauge vs Counter vs Histogram #281

The three core Prometheus metric types serve fundamentally different use cases — choosing the wrong type leads to incorrect queries.

Counter — monotonically increasing value (requests_total); use rate() to derive RPS
Gauge — value that goes up and down (temperature, queue depth, active connections)
Histogram — samples observations into configurable buckets (request_duration_seconds); enables percentile calculations
Summary is similar to histogram but computes quantiles client-side (cannot be aggregated across instances)
Interview trap: never use rate() on a gauge or take an average of a counter directly

Prometheus Metric Types Histograms & Summaries Baeldung — Micrometer + Prometheus

Alerting Thresholds #282

Designing alerts that fire on symptoms (high latency, error rate) rather than causes (CPU spike) reduces noise and improves incident response.

Alert on SLO burn rate rather than static thresholds for adaptive sensitivity
Use multi-window alerting: fast burn (2%) over 1h AND slow burn (5%) over 6h
Avoid alert fatigue — every alert must have a runbook and require human action
Page for symptoms, ticket for causes, log for diagnostics
Prometheus alerting rules use the for clause to avoid transient spikes

Google SRE Workbook — Alerting on SLOs Prometheus Alertmanager Grafana Alerting

Logs

Structured Logging (JSON) #283

Emitting logs as structured JSON objects instead of plain text enables machine parsing, indexing, and querying at scale.

Include standard fields: timestamp, level, service, traceId, spanId, message
Enables log aggregation tools (ELK, Loki) to index and filter without regex
Use MDC (Mapped Diagnostic Context) in Java/Spring to auto-attach request context
Avoid logging PII — mask sensitive fields at the logger level
Structured logs bridge the gap between logs and traces via correlation IDs

Baeldung — Structured Logging in Java OpenTelemetry Logs Spec Grafana Loki Docs

Correlation IDs / Trace IDs #284

A unique identifier propagated across service boundaries so that every log, metric, and trace for a single request can be correlated.

Generate a UUID/trace ID at the API gateway; propagate it via HTTP headers (X-Request-Id, traceparent)
Store in MDC/thread-local so every log line automatically includes it
Enables "search by traceId" in Kibana/Loki to reconstruct the full request journey
OpenTelemetry defines W3C Trace Context (traceparent header) as the standard
Critical for debugging in microservice architectures — without it, logs are isolated islands

OTel Context Propagation Baeldung — Spring Cloud Sleuth W3C Trace Context Spec

Log Levels Strategy #285

A disciplined log-level policy ensures that production logs are actionable without drowning operators in noise.

ERROR — something failed that requires human attention (alerts trigger on this)
WARN — unexpected situation that the system handled but should be investigated
INFO — key business events (order placed, payment processed); keep minimal in production
DEBUG/TRACE — developer diagnostics; enable dynamically (Spring Boot Actuator, log level endpoint)
Interview point: explain how you would change log levels at runtime without redeploying

Baeldung — Spring Boot Logging Google SRE — Monitoring

ELK / Loki Stack Overview #286

Centralized log aggregation stacks that collect, index, and visualize logs from distributed services for search and alerting.

ELK: Elasticsearch (indexing + search) + Logstash (ingestion pipeline) + Kibana (visualization)
Loki: Grafana's log aggregation system; indexes only labels (not full text) for lower cost
Loki uses LogQL; Elasticsearch uses KQL/Lucene — know basic query syntax for interviews
Fluentd/Filebeat as log shippers that tail files and forward to the aggregation backend
Trade-off: ELK gives powerful full-text search; Loki is cheaper and integrates natively with Grafana

Grafana Loki Docs Elasticsearch Reference Baeldung — Java Logs to ELK

Traces

Distributed Tracing Concepts #287

Distributed tracing captures the end-to-end journey of a request across multiple services, representing it as a tree of spans.

A trace is a directed acyclic graph of spans sharing the same trace ID
Each span has: operation name, start time, duration, tags/attributes, parent span ID
Root span represents the entry point (e.g., API gateway); child spans represent downstream calls
Reveals latency bottlenecks, fan-out patterns, and error propagation paths
Interview scenario: "Service A calls B and C in parallel; C calls D — draw the trace waterfall"

OTel — Traces Concepts Baeldung — Distributed Tracing Google SRE — Monitoring

OpenTelemetry Instrumentation #288

OpenTelemetry (OTel) is the CNCF standard for generating, collecting, and exporting telemetry data — the vendor-neutral future of observability.

Provides APIs and SDKs for metrics, traces, and logs in one unified framework
Auto-instrumentation via Java agent attaches to HTTP clients, JDBC, gRPC without code changes
Manual instrumentation with @WithSpan annotation or Tracer API for custom spans
OTel Collector acts as a pipeline: receives, processes (batching, sampling), and exports to backends
Export to Jaeger, Zipkin, Grafana Tempo, or any OTLP-compatible backend

OTel Java Instrumentation OTel Collector Docs OTel Signals Overview

Jaeger / Zipkin #289

Popular open-source distributed tracing backends that store, query, and visualize trace data from instrumented services.

Jaeger: Uber-origin, CNCF graduated; supports Cassandra/Elasticsearch as storage backends
Zipkin: Twitter-origin, mature ecosystem; uses a simpler architecture with fewer components
Both provide waterfall trace visualization, service dependency graphs, and latency analysis
Jaeger supports adaptive sampling; Zipkin uses simpler rate-based sampling
Modern trend: Grafana Tempo as a cost-effective alternative using object storage (S3)

Jaeger Documentation Zipkin.io Grafana Tempo Docs

Trace Sampling Strategies #290

At high traffic volumes, tracing every request is impractical — sampling strategies balance observability coverage with storage and performance costs.

Head-based sampling: decide at the entry point whether to trace (simple but may miss errors)
Tail-based sampling: decide after the trace completes (captures all errors/slow requests but requires buffering)
Probabilistic: sample a fixed percentage (e.g., 1%) — simple and predictable
Rate-limiting: cap traces per second to control costs
Always-sample for error traces and slow requests; probabilistic for normal traffic

OTel — Sampling Jaeger Sampling OTel Collector Config

SLI / SLO / SLA Light

Interview tip: SRE concepts are increasingly common in SDE 2 rounds. Be ready to define SLIs for any service you discuss, explain how error budgets drive release velocity, and describe what happens when an SLO is breached. Refer to the Google SRE books — interviewers notice.

SLI / SLO / SLA

SLI Examples for APIs #291

Service Level Indicators are carefully chosen metrics that quantify how well your service is performing from the user's perspective.

Availability SLI: proportion of successful requests (status != 5xx) / total requests
Latency SLI: proportion of requests completed within a threshold (e.g., p99 < 300ms)
Throughput SLI: requests served per second within acceptable quality bounds
Correctness SLI: proportion of responses returning the right data (harder to measure, often via probes)
Interview pattern: "For a payment service, what SLIs would you define?" — availability + latency + correctness

Google SRE — SLOs SRE Workbook — Implementing SLOs Prometheus Instrumentation

Error Budget Concept #292

An error budget is the maximum allowable unreliability derived from your SLO — it quantifies how much risk you can take with deployments.

If SLO is 99.9% availability, error budget is 0.1% — about 43 minutes of downtime per month
When budget is healthy, teams ship faster (more deployments, feature flags, experiments)
When budget is exhausted, freeze deployments and focus on reliability improvements
Creates a data-driven conversation between product velocity and operational stability
Interview angle: "How do you balance shipping fast with keeping services reliable?"

Google SRE — Embracing Risk SRE Workbook — Error Budget Policy Google SRE — SLOs

SLO Violation Response #293

A well-defined escalation playbook for when your service breaches its SLO turns reliability conversations from reactive firefighting into structured process.

Immediate: auto-alert on-call, open an incident channel, assess blast radius
Short-term: roll back recent changes, enable circuit breakers, scale up capacity
Post-incident: blameless postmortem documenting timeline, root cause, and action items
Policy response: if error budget is burned, freeze non-critical deployments until budget recovers
Track "time to detect" and "time to mitigate" as meta-SLIs for your incident process

Google SRE — Postmortem Culture SRE Workbook — Error Budget Policy Google SRE — Managing Incidents

Toil vs Engineering Work #294

Toil is manual, repetitive, automatable, and tactical work that scales linearly with service growth — reducing it is a core SRE principle.

Examples of toil: manually restarting pods, hand-editing configs, running routine migration scripts
Engineering work: writing automation, improving monitoring, capacity planning — work that reduces future toil
Google SRE targets < 50% toil; exceeding that signals staffing or automation gaps
Measure toil by tracking time spent on repetitive operational tasks per sprint
Interview angle: "Tell me about a time you automated away a toil-heavy process"

Google SRE — Eliminating Toil SRE Workbook — Eliminating Toil Google SRE — Introduction

Performance Profiling Full-Focus

Interview tip: Performance profiling questions test depth. Be ready to walk through a real scenario: "Your API latency spiked to 2s — how do you investigate?" Start with metrics (RED), narrow with traces, then profile the hot path. Know your tools (Async Profiler, flame graphs) and demonstrate you have actually used them.

Performance

Async Profiler #295

A low-overhead sampling profiler for JVM applications that captures CPU, allocation, and lock profiles without the safepoint bias of traditional profilers.

Uses AsyncGetCallTrace API to sample at any point, not just JVM safepoints — gives accurate CPU profiles
Supports CPU, wall-clock, allocation, and lock contention profiling modes
Can attach to a running JVM without restart: asprof -d 30 -f profile.html <pid>
Generates flame graphs directly in HTML — no separate tools needed
Low overhead (typically < 5%) makes it safe for production use with sampling enabled

Async Profiler GitHub Baeldung — Async Profiler Brendan Gregg — Flame Graphs

Flame Graphs Reading #296

Flame graphs are a visualization of profiled stack traces where the x-axis represents the proportion of samples and the y-axis represents stack depth.

Width of a frame = percentage of total samples that include that function (wider = more time spent)
Y-axis is stack depth: bottom is the entry point, top is the leaf function doing actual work
X-axis ordering is alphabetical (not chronological) — a common misunderstanding
Look for "plateaus" (wide, flat tops) — these are the hot functions consuming the most CPU
Icicle graphs (inverted flame graphs) are useful for allocation profiling — top-down view

Brendan Gregg — Flame Graphs Baeldung — Java Flame Graphs Grafana Pyroscope

Thread Contention Analysis #297

Identifying lock contention and thread blocking is critical for diagnosing latency spikes in concurrent Java applications.

Use jstack or thread dumps to identify threads in BLOCKED/WAITING state
Async Profiler lock mode profiles contended synchronized blocks and ReentrantLock waits
Common culprits: synchronized HashMap, database connection pool exhaustion, single-threaded bottlenecks
Virtual threads (Java 21) reduce contention by allowing massive concurrency without thread pools
Interview scenario: "Throughput drops under load but CPU is low" — classic sign of lock contention

Baeldung — Thread Dumps Baeldung — Synchronized Brendan Gregg — Off-CPU Analysis

DB Connection Pool Sizing #298

Correctly sizing your database connection pool is one of the highest-impact performance optimizations — too few causes queuing, too many causes DB contention.

HikariCP formula: pool size = (core_count * 2) + effective_spindle_count (for spinning disks)
For SSDs/cloud databases, start with connections = 2 * CPU cores and benchmark
Monitor: active connections, idle connections, pending threads, connection wait time
Too large a pool overwhelms the DB — PostgreSQL recommends < 200 connections per instance
Use PgBouncer or ProxySQL for connection multiplexing when application pool size exceeds DB capacity

HikariCP — Pool Sizing Baeldung — HikariCP Baeldung — Connection Pooling

HTTP Connection Pool Tuning #299

HTTP client connection pools reuse TCP connections to avoid the overhead of repeated handshakes — misconfigured pools cause latency spikes or connection leaks.

Key settings: maxConnections, maxConnectionsPerRoute, connectionTimeout, socketTimeout, keepAliveStrategy
Too few connections per route = serialized requests under load; too many = file descriptor exhaustion
Set idle connection eviction to prevent stale connections from causing unexpected errors
Monitor with metrics: connection create rate, reuse rate, wait time, timeout count
Spring WebClient (Reactor Netty) pool defaults may be too conservative for high-throughput services — tune pendingAcquireMaxCount

Baeldung — HttpClient Connection Management Baeldung — WebClient Connections Prometheus Instrumentation

Recommended Resources

Google SRE Books (Free Online)

The definitive guides to SLIs, SLOs, error budgets, toil elimination, and incident management.

Prometheus Documentation

Official docs covering data model, PromQL, alerting rules, and best practices for metrics.

OpenTelemetry Documentation

The CNCF standard for traces, metrics, and logs. Focus on Java instrumentation and the Collector.

Grafana Stack Documentation

Grafana (dashboards), Loki (logs), Tempo (traces), Mimir (metrics) — a unified observability stack.

Baeldung — Grafana + Prometheus

Practical Java-focused tutorials on integrating Spring Boot with Prometheus and Grafana.

Brendan Gregg — Performance

The authority on systems performance: USE method, flame graphs, profiling methodologies.

Baeldung — Java Profilers

Overview of JVM profiling tools: Async Profiler, JFR, VisualVM, and when to use each.

HikariCP Wiki

Essential reading for connection pool sizing, configuration, and troubleshooting in Java services.