How would you investigate a latency spike using metrics, traces, and logs together? [Advanced]

Question

Accepted Answer

To investigate a latency spike, I start with metrics to identify scope and timing, use traces to find the slow path or dependency, and use logs to inspect exact errors or state changes for affected requests. Metrics answer when, where, how many users, which endpoints, and whether errors or saturation also changed. Traces answer which service or dependency consumed time and whether retries or fan-out amplified latency. Logs answer detailed causes such as timeout messages, SQL errors, throttling responses, or bad configuration.

How would you investigate a latency spike using metrics, traces, and logs together? [Advanced]

Answer

Technical explanation

Hands-on example

More Observability interview questions