What is an error budget, and how have you used one to make a decision?

Question

Accepted Answer

I measure reliability from the user's point of view. Uptime alone can hide partial failures, high latency, data freshness issues, or dependency degradation. I choose SLIs around critical journeys, such as successful requests under a latency threshold, job freshness, correctness, or transaction completion. SLOs become useful when they drive decisions: release risk, reliability investment, incident response, and error-budget trade-offs. A metric should become an SLO when it represents a user-visible promise and will change engineering behavior if missed. Keep SLOs few and trusted. Use supporting metrics such as CPU, memory, restarts, queue depth, and DB connections for diagnosis. Error budget = 100% - SLO target; burn rate shows how quickly unreliability is being consumed.

What is an error budget, and how have you used one to make a decision?

Answer

Technical explanation

Hands-on example

More Resume & Behavioral interview questions