Interview Observability

How do you correlate logs and metrics during an incident? [Advanced]

Answer

During an incident, I correlate logs and metrics by aligning time range, service, environment, deployment version, trace_id, request ID, and user-impact dimensions. Metrics show scope and trend; logs explain specific errors and events.

Technical explanation

Start with symptom metrics such as error ratio, latency, and traffic to identify when and where the issue began.

Use labels and annotations to identify service, endpoint, version, and region.

Jump to logs using trace_id or service/time filters to find concrete exceptions, dependency errors, and state changes.

Hands-on example

Hands-on: Grafana shows checkout 5xx started at 10:05 after version v2.3.1. Open Splunk with earliest=10:00 latest=10:20 service=checkout version=v2.3.1 level=ERROR. stats count by error_code shows PAYMENT_TIMEOUT dominating. Traces confirm payment dependency latency.

Preparing for an interview?

Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.

More Observability interview questions

← All Observability questions