Interview › Resume & Behavioral
What does the 99.99% availability SLA you operate translate to in allowed downtime per month, and how do you track it?
Resume & Behavioral · Basic level
Answer
A 99.99% availability target means the service can be unavailable for only 0.01% of the measurement window. For a 30-day month, that is about 4.32 minutes of allowed downtime; over a year it is about 52.6 minutes. I track it through user-facing SLIs, not just host uptime: successful request rate, critical journey success, latency thresholds, and sometimes synthetic checks. I also watch error-budget burn so we can react early instead of finding out at month end that the service missed its target.
Technical explanation
99.99% leaves a 0.01% error budget. For 30 days: 43,200 minutes x 0.0001 = 4.32 minutes.
Availability should be user-centric. A pod can be running while a critical API or user journey is failing.
Burn-rate alerting is key for four-nines services because a short severe incident can consume the monthly budget quickly.
Hands-on example
1. Define SLI: good requests / total valid requests, where good means non-5xx and under the agreed latency threshold.
2. Create a dashboard with SLO attainment, error budget remaining, fast burn, slow burn, top incidents, and top dependency contributors.
3. During reviews, correlate downtime minutes with incident timeline, deployment history, and follow-up actions.
Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.
More Resume & Behavioral interview questions
- Your title is Senior DevOps / SRE Lead - how do you personally define the difference between DevOps and SRE?
- Tell me about a typical day in your current role at Intuit.
- Tell me about the most business-critical incident you have owned end to end.
- Walk me through the Redis-to-Valkey migration: why migrate, what was your plan, and what could have gone wrong?
- How did you design and validate the rollback strategy for the RDS PostgreSQL and MySQL upgrades?
- What does 'minimal downtime' mean precisely for your data-store upgrades - did you achieve zero downtime, and how?
- Describe the Istio service-mesh enablement you led: what problem did it solve and how did you roll it out safely?
- How did you reduce CI/CD pipeline run times - what was slow, what did you change, and by how much did it improve?