How do you correlate logs and metrics during an incident? [Advanced]

Question

Accepted Answer

During an incident, I correlate logs and metrics by aligning time range, service, environment, deployment version, trace_id, request ID, and user-impact dimensions. Metrics show scope and trend; logs explain specific errors and events. Start with symptom metrics such as error ratio, latency, and traffic to identify when and where the issue began. Use labels and annotations to identify service, endpoint, version, and region. Jump to logs using trace_id or service/time filters to find concrete exceptions, dependency errors, and state changes.

How do you correlate logs and metrics during an incident? [Advanced]

Answer

Technical explanation

Hands-on example

More Observability interview questions