How would you investigate a latency spike using metrics, traces, and logs together? [Advanced]
Answer
To investigate a latency spike, I start with metrics to identify scope and timing, use traces to find the slow path or dependency, and use logs to inspect exact errors or state changes for affected requests.
Technical explanation
Metrics answer when, where, how many users, which endpoints, and whether errors or saturation also changed.
Traces answer which service or dependency consumed time and whether retries or fan-out amplified latency.
Logs answer detailed causes such as timeout messages, SQL errors, throttling responses, or bad configuration.
Hands-on example
Runbook: 1) Check p95/p99 by endpoint and region. 2) Compare traffic and error rate. 3) Check recent deploy annotations. 4) Open slow traces around the spike. 5) Identify slow span, such as payment authorize. 6) Search Splunk logs by trace_id and service=payment. 7) Mitigate with rollback, traffic shift, or dependency escalation.
Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.
More Observability interview questions
- What is observability, and how is it different from traditional monitoring? [Basic]
- What are the three pillars of observability (metrics, logs, traces)? [Basic]
- What is the difference between monitoring and observability in practice? [Basic]
- What are the four golden signals of monitoring? [Basic]
- What is the difference between the USE method and the RED method? [Basic]
- When would you use the USE method versus the RED method? [Basic]
- What is an SLI, an SLO, and an SLA, and how do they relate? [Basic]
- How do you choose good SLIs for a service? [Basic]