Three Pillars of Observability Light

Interview tip: Interviewers expect SDE 2s to articulate how they would monitor a service from day one. Frame answers around metrics, logs, and traces as complementary signals — not substitutes. Be ready to sketch a Grafana dashboard for any system you design.

Metrics

RED Method (Rate, Errors, Duration) #278

A request-scoped monitoring methodology that measures the three golden signals for every microservice endpoint.

USE Method (Utilization, Saturation, Errors) #279

A resource-scoped methodology for analyzing infrastructure bottlenecks in CPUs, memory, disks, and network interfaces.

Prometheus Data Model #280

Prometheus stores all data as time series — streams of timestamped values identified by a metric name plus key-value label pairs.

Gauge vs Counter vs Histogram #281

The three core Prometheus metric types serve fundamentally different use cases — choosing the wrong type leads to incorrect queries.

Alerting Thresholds #282

Designing alerts that fire on symptoms (high latency, error rate) rather than causes (CPU spike) reduces noise and improves incident response.

Logs

Structured Logging (JSON) #283

Emitting logs as structured JSON objects instead of plain text enables machine parsing, indexing, and querying at scale.

Correlation IDs / Trace IDs #284

A unique identifier propagated across service boundaries so that every log, metric, and trace for a single request can be correlated.

Log Levels Strategy #285

A disciplined log-level policy ensures that production logs are actionable without drowning operators in noise.

ELK / Loki Stack Overview #286

Centralized log aggregation stacks that collect, index, and visualize logs from distributed services for search and alerting.

Traces

Distributed Tracing Concepts #287

Distributed tracing captures the end-to-end journey of a request across multiple services, representing it as a tree of spans.

OpenTelemetry Instrumentation #288

OpenTelemetry (OTel) is the CNCF standard for generating, collecting, and exporting telemetry data — the vendor-neutral future of observability.

Jaeger / Zipkin #289

Popular open-source distributed tracing backends that store, query, and visualize trace data from instrumented services.

Trace Sampling Strategies #290

At high traffic volumes, tracing every request is impractical — sampling strategies balance observability coverage with storage and performance costs.

SLI / SLO / SLA Light

Interview tip: SRE concepts are increasingly common in SDE 2 rounds. Be ready to define SLIs for any service you discuss, explain how error budgets drive release velocity, and describe what happens when an SLO is breached. Refer to the Google SRE books — interviewers notice.

SLI / SLO / SLA

SLI Examples for APIs #291

Service Level Indicators are carefully chosen metrics that quantify how well your service is performing from the user's perspective.

Error Budget Concept #292

An error budget is the maximum allowable unreliability derived from your SLO — it quantifies how much risk you can take with deployments.

SLO Violation Response #293

A well-defined escalation playbook for when your service breaches its SLO turns reliability conversations from reactive firefighting into structured process.

Toil vs Engineering Work #294

Toil is manual, repetitive, automatable, and tactical work that scales linearly with service growth — reducing it is a core SRE principle.

Performance Profiling Full-Focus

Interview tip: Performance profiling questions test depth. Be ready to walk through a real scenario: "Your API latency spiked to 2s — how do you investigate?" Start with metrics (RED), narrow with traces, then profile the hot path. Know your tools (Async Profiler, flame graphs) and demonstrate you have actually used them.

Performance

Async Profiler #295

A low-overhead sampling profiler for JVM applications that captures CPU, allocation, and lock profiles without the safepoint bias of traditional profilers.

Flame Graphs Reading #296

Flame graphs are a visualization of profiled stack traces where the x-axis represents the proportion of samples and the y-axis represents stack depth.

Thread Contention Analysis #297

Identifying lock contention and thread blocking is critical for diagnosing latency spikes in concurrent Java applications.

DB Connection Pool Sizing #298

Correctly sizing your database connection pool is one of the highest-impact performance optimizations — too few causes queuing, too many causes DB contention.

HTTP Connection Pool Tuning #299

HTTP client connection pools reuse TCP connections to avoid the overhead of repeated handshakes — misconfigured pools cause latency spikes or connection leaks.

Recommended Resources

Google SRE Books (Free Online)

The definitive guides to SLIs, SLOs, error budgets, toil elimination, and incident management.

Prometheus Documentation

Official docs covering data model, PromQL, alerting rules, and best practices for metrics.

OpenTelemetry Documentation

The CNCF standard for traces, metrics, and logs. Focus on Java instrumentation and the Collector.

Grafana Stack Documentation

Grafana (dashboards), Loki (logs), Tempo (traces), Mimir (metrics) — a unified observability stack.

Baeldung — Grafana + Prometheus

Practical Java-focused tutorials on integrating Spring Boot with Prometheus and Grafana.

Brendan Gregg — Performance

The authority on systems performance: USE method, flame graphs, profiling methodologies.

Baeldung — Java Profilers

Overview of JVM profiling tools: Async Profiler, JFR, VisualVM, and when to use each.

HikariCP Wiki

Essential reading for connection pool sizing, configuration, and troubleshooting in Java services.