How do you correlate logs and metrics during an incident? [Advanced]
Answer
During an incident, I correlate logs and metrics by aligning time range, service, environment, deployment version, trace_id, request ID, and user-impact dimensions. Metrics show scope and trend; logs explain specific errors and events.
Technical explanation
Start with symptom metrics such as error ratio, latency, and traffic to identify when and where the issue began.
Use labels and annotations to identify service, endpoint, version, and region.
Jump to logs using trace_id or service/time filters to find concrete exceptions, dependency errors, and state changes.
Hands-on example
Hands-on: Grafana shows checkout 5xx started at 10:05 after version v2.3.1. Open Splunk with earliest=10:00 latest=10:20 service=checkout version=v2.3.1 level=ERROR. stats count by error_code shows PAYMENT_TIMEOUT dominating. Traces confirm payment dependency latency.
Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.
More Observability interview questions
- What is observability, and how is it different from traditional monitoring? [Basic]
- What are the three pillars of observability (metrics, logs, traces)? [Basic]
- What is the difference between monitoring and observability in practice? [Basic]
- What are the four golden signals of monitoring? [Basic]
- What is the difference between the USE method and the RED method? [Basic]
- When would you use the USE method versus the RED method? [Basic]
- What is an SLI, an SLO, and an SLA, and how do they relate? [Basic]
- How do you choose good SLIs for a service? [Basic]