Interview Istio & Service Mesh

How do you decide retry budgets to avoid retry storms in the mesh?

Istio & Service Mesh · Advanced level

Answer

I decide retry budgets from the user latency budget, downstream capacity, idempotency, and incident behavior. The goal is to recover from transient failures without multiplying traffic so much that a struggling service collapses.

Technical explanation

Retries should be limited by attempts, per-try timeout, total timeout, and retry conditions.

Non-idempotent operations need idempotency keys or should not be retried blindly by the mesh.

Monitor retry rate as its own signal; a retry spike often means an incident is already developing.

Hands-on example

Budget example:

User-facing endpoint budget: 1s.

Downstream normal p95: 120ms.

Policy: attempts=2, perTryTimeout=200ms, timeout=600ms.

Alert when retry request rate exceeds 5 percent of original request rate for 5 minutes.

During brownouts, reduce retries or shed load.

Preparing for an interview?

Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.

More Istio & Service Mesh interview questions

← All Istio & Service Mesh questions