How you keep production healthy — increasingly tested at SDE 2 level
Three Pillars of Observability
Light
Interview tip: Interviewers expect SDE 2s to articulate how they would monitor a service from day one. Frame answers around metrics, logs, and traces as complementary signals — not substitutes. Be ready to sketch a Grafana dashboard for any system you design.
Metrics
RED Method (Rate, Errors, Duration) #278
A request-scoped monitoring methodology that measures the three golden signals for every microservice endpoint.
Rate — requests per second your service is handling
Errors — the fraction of those requests that are failing (5xx, timeouts)
Duration — distribution of response latencies (use histograms, not averages)
Ideal for request-driven services; pairs with USE for infrastructure
Map each metric to a Prometheus counter or histogram in interviews
Interview tip: SRE concepts are increasingly common in SDE 2 rounds. Be ready to define SLIs for any service you discuss, explain how error budgets drive release velocity, and describe what happens when an SLO is breached. Refer to the Google SRE books — interviewers notice.
SLI / SLO / SLA
SLI Examples for APIs #291
Service Level Indicators are carefully chosen metrics that quantify how well your service is performing from the user's perspective.
Availability SLI: proportion of successful requests (status != 5xx) / total requests
Latency SLI: proportion of requests completed within a threshold (e.g., p99 < 300ms)
Throughput SLI: requests served per second within acceptable quality bounds
Correctness SLI: proportion of responses returning the right data (harder to measure, often via probes)
Interview pattern: "For a payment service, what SLIs would you define?" — availability + latency + correctness
A well-defined escalation playbook for when your service breaches its SLO turns reliability conversations from reactive firefighting into structured process.
Immediate: auto-alert on-call, open an incident channel, assess blast radius
Short-term: roll back recent changes, enable circuit breakers, scale up capacity
Post-incident: blameless postmortem documenting timeline, root cause, and action items
Policy response: if error budget is burned, freeze non-critical deployments until budget recovers
Track "time to detect" and "time to mitigate" as meta-SLIs for your incident process
Interview tip: Performance profiling questions test depth. Be ready to walk through a real scenario: "Your API latency spiked to 2s — how do you investigate?" Start with metrics (RED), narrow with traces, then profile the hot path. Know your tools (Async Profiler, flame graphs) and demonstrate you have actually used them.
Performance
Async Profiler #295
A low-overhead sampling profiler for JVM applications that captures CPU, allocation, and lock profiles without the safepoint bias of traditional profilers.
Uses AsyncGetCallTrace API to sample at any point, not just JVM safepoints — gives accurate CPU profiles
Supports CPU, wall-clock, allocation, and lock contention profiling modes
Can attach to a running JVM without restart: asprof -d 30 -f profile.html <pid>
Generates flame graphs directly in HTML — no separate tools needed
Low overhead (typically < 5%) makes it safe for production use with sampling enabled
Correctly sizing your database connection pool is one of the highest-impact performance optimizations — too few causes queuing, too many causes DB contention.
HTTP client connection pools reuse TCP connections to avoid the overhead of repeated handshakes — misconfigured pools cause latency spikes or connection leaks.