Interview Observability

How would you investigate a latency spike using metrics, traces, and logs together? [Advanced]

Answer

To investigate a latency spike, I start with metrics to identify scope and timing, use traces to find the slow path or dependency, and use logs to inspect exact errors or state changes for affected requests.

Technical explanation

Metrics answer when, where, how many users, which endpoints, and whether errors or saturation also changed.

Traces answer which service or dependency consumed time and whether retries or fan-out amplified latency.

Logs answer detailed causes such as timeout messages, SQL errors, throttling responses, or bad configuration.

Hands-on example

Runbook: 1) Check p95/p99 by endpoint and region. 2) Compare traffic and error rate. 3) Check recent deploy annotations. 4) Open slow traces around the spike. 5) Identify slow span, such as payment authorize. 6) Search Splunk logs by trace_id and service=payment. 7) Mitigate with rollback, traffic shift, or dependency escalation.

Preparing for an interview?

Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.

More Observability interview questions

← All Observability questions