Interview › Istio & Service Mesh
How do you decide retry budgets to avoid retry storms in the mesh?
Istio & Service Mesh · Advanced level
Answer
I decide retry budgets from the user latency budget, downstream capacity, idempotency, and incident behavior. The goal is to recover from transient failures without multiplying traffic so much that a struggling service collapses.
Technical explanation
Retries should be limited by attempts, per-try timeout, total timeout, and retry conditions.
Non-idempotent operations need idempotency keys or should not be retried blindly by the mesh.
Monitor retry rate as its own signal; a retry spike often means an incident is already developing.
Hands-on example
Budget example:
User-facing endpoint budget: 1s.
Downstream normal p95: 120ms.
Policy: attempts=2, perTryTimeout=200ms, timeout=600ms.
Alert when retry request rate exceeds 5 percent of original request rate for 5 minutes.
During brownouts, reduce retries or shed load.
Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.
More Istio & Service Mesh interview questions
- What is Istio, and what are the core capabilities it provides?
- What is the difference between the Istio control plane and data plane?
- What is istiod, and what does it do?
- What is Envoy, and what role does it play in Istio?
- What is the sidecar pattern, and how does Istio inject the proxy?
- How does automatic sidecar injection work (namespace label, webhook)?
- What is the Istio ambient (sidecarless) mode, and how does it differ from sidecar mode?
- What is the difference between ztunnel and a waypoint proxy in ambient mode?