How do you decide retry budgets to avoid retry storms in the mesh?

Question

Accepted Answer

I decide retry budgets from the user latency budget, downstream capacity, idempotency, and incident behavior. The goal is to recover from transient failures without multiplying traffic so much that a struggling service collapses. Retries should be limited by attempts, per-try timeout, total timeout, and retry conditions. Non-idempotent operations need idempotency keys or should not be retried blindly by the mesh. Monitor retry rate as its own signal; a retry spike often means an incident is already developing.

How do you decide retry budgets to avoid retry storms in the mesh?

Answer

Technical explanation

Hands-on example

More Istio & Service Mesh interview questions