How do you design alerts that page a human only when action is required? [Advanced]
Answer
I design human-page alerts around user impact, urgency, ownership, and required action. If no immediate human action is needed, the signal should become a ticket, dashboard annotation, or automated remediation rather than a page.
Technical explanation
Use SLO burn-rate alerts for page-worthy service symptoms.
Require every page to include service, severity, owner, runbook, dashboard, and recent-change links.
Tune alerts with historical page reviews and remove alerts that do not lead to action.
Hands-on example
Hands-on: convert CPUHigh pages into tickets unless CPU saturation is proven to cause SLO burn. Keep pages for checkout high burn rate, payment dependency outage, and data loss risk. Add Alertmanager inhibition so pod-level alerts do not page when the service-level SLO alert is already firing.
Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.
More Observability interview questions
- What is observability, and how is it different from traditional monitoring? [Basic]
- What are the three pillars of observability (metrics, logs, traces)? [Basic]
- What is the difference between monitoring and observability in practice? [Basic]
- What are the four golden signals of monitoring? [Basic]
- What is the difference between the USE method and the RED method? [Basic]
- When would you use the USE method versus the RED method? [Basic]
- What is an SLI, an SLO, and an SLA, and how do they relate? [Basic]
- How do you choose good SLIs for a service? [Basic]