Interview Resume & Behavioral

What is an error budget, and how have you used one to make a decision?

Resume & Behavioral · Intermediate level

Answer

I measure reliability from the user's point of view. Uptime alone can hide partial failures, high latency, data freshness issues, or dependency degradation. I choose SLIs around critical journeys, such as successful requests under a latency threshold, job freshness, correctness, or transaction completion. SLOs become useful when they drive decisions: release risk, reliability investment, incident response, and error-budget trade-offs.

Technical explanation

A metric should become an SLO when it represents a user-visible promise and will change engineering behavior if missed.

Keep SLOs few and trusted. Use supporting metrics such as CPU, memory, restarts, queue depth, and DB connections for diagnosis.

Error budget = 100% - SLO target; burn rate shows how quickly unreliability is being consumed.

Hands-on example

1. Map the top user journey and define good events and total events.

2. Example API SLI: valid requests that return non-5xx under 500 ms divided by total valid requests.

3. Backtest the SLO using 30-90 days of data, then build a dashboard and burn-rate alerts.

4. Use monthly reviews to decide whether to ship faster, pause risky changes, or prioritize reliability work.

Preparing for an interview?

Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.

More Resume & Behavioral interview questions

← All Resume & Behavioral questions