Interview questionsIstio & Service Mesh

Istio & Service Mesh interview questions & answers

99 Istio & Service Mesh interview questions, each answered three ways: a concise spoken answer, a technical explanation, and a hands-on example.

Tip: paste the job description + your resume into our free resume checker to see which of these skills the role actually requires.

All questions

  1. What is Istio, and what are the core capabilities it provides?
  2. What is the difference between the Istio control plane and data plane?
  3. What is istiod, and what does it do?
  4. What is Envoy, and what role does it play in Istio?
  5. What is the sidecar pattern, and how does Istio inject the proxy?
  6. How does automatic sidecar injection work (namespace label, webhook)?
  7. What is the Istio ambient (sidecarless) mode, and how does it differ from sidecar mode?
  8. What is the difference between ztunnel and a waypoint proxy in ambient mode?
  9. What problem does Istio solve that Kubernetes Services alone do not?
  10. How does traffic flow through an Envoy sidecar for inbound and outbound requests?
  11. What is a VirtualService, and what does it control?
  12. What is a DestinationRule, and how does it relate to a VirtualService?
  13. What is a Gateway resource, and how does it differ from a Kubernetes Ingress?
  14. What is the difference between an Istio ingress gateway and an egress gateway?
  15. What is a ServiceEntry, and when do you need one?
  16. How do you do weighted traffic splitting for a canary release in Istio?
  17. How would you implement a canary deployment progressively shifting traffic with Istio?
  18. How do you implement a blue-green deployment using Istio?
  19. What are subsets in a DestinationRule, and how are they used?
  20. How does Istio do request routing based on headers or paths?
  21. What is fault injection in Istio, and why would you use it?
  22. How do you inject delays or aborts to test resilience with Istio?
  23. What are retries in Istio, and what are the risks of misconfiguring them?
  24. What are timeouts in Istio, and how do they interact with retries?
  25. What is a circuit breaker in Istio, and how is it configured (outlier detection, connection pool)?
  26. What is outlier detection, and how does it eject unhealthy hosts?
  27. What is mutual TLS (mTLS), and how does Istio provide it automatically?
  28. What is the difference between PERMISSIVE and STRICT mTLS mode?
  29. Why would you start with PERMISSIVE mTLS during a rollout?
  30. What is a PeerAuthentication policy?
  31. What is a RequestAuthentication policy, and how does it validate JWTs?
  32. What is an AuthorizationPolicy, and how do you enforce service-to-service access control?
  33. How does Istio enable a Zero Trust posture inside the cluster?
  34. How does Istio issue and rotate workload certificates (SPIFFE/SPIRE concepts)?
  35. What telemetry does Istio provide out of the box (metrics, logs, traces)?
  36. How does Istio integrate with Prometheus for metrics?
  37. What are the golden signals Istio exposes (latency, traffic, errors, saturation)?
  38. How does Istio enable distributed tracing, and what is required from the application?
  39. Why must applications propagate trace headers even with Istio?
  40. How does Istio integrate with Grafana and Kiali, and what does Kiali show?
  41. What is the typical latency and resource overhead of the sidecar, and how do you minimise it?
  42. How do you troubleshoot a request that is failing only inside the mesh?
  43. How do you use istioctl proxy-config and proxy-status to debug Envoy?
  44. What does istioctl analyze do?
  45. How do you debug mTLS handshake failures between two services?
  46. What is a common cause of 503 errors in Istio, and how do you diagnose it?
  47. Why might traffic bypass the sidecar, and how do you verify injection?
  48. How do you exclude certain ports or IP ranges from sidecar interception?
  49. How do you handle non-HTTP (TCP) traffic in Istio?
  50. How does Istio handle headless services and StatefulSets?
  51. What is the difference between Istio and an API gateway?
  52. How do Istio Gateways relate to the Kubernetes Gateway API?
  53. What is the Kubernetes Gateway API, and how is Istio adopting it?
  54. How do you roll out Istio to existing workloads with minimal disruption (as you did at Intuit)?
  55. How do you upgrade Istio safely (canary control plane, revision tags)?
  56. What are Istio revisions and revision tags, and why use them for upgrades?
  57. How do you do a canary upgrade of the Istio control plane?
  58. What are the failure modes if istiod is unavailable?
  59. Does the data plane keep working if the control plane goes down, and why?
  60. How do you enforce that all traffic leaving the mesh goes through an egress gateway?
  61. How would you restrict which external services workloads can reach with Istio?
  62. What is locality-aware load balancing, and why does it help latency and cost?
  63. How does Istio handle multi-cluster service discovery at a high level?
  64. What is the difference between a primary-remote and a multi-primary multi-cluster setup?
  65. How do you measure the performance impact of enabling Istio?
  66. How do you decide whether a service should be in the mesh or not?
  67. When is a service mesh overkill, and what lighter alternatives exist?
  68. How do you handle secrets and certificates for the ingress gateway (TLS termination)?
  69. What is SNI-based routing, and how does the ingress gateway use it?
  70. How would you implement rate limiting in Istio (local and global)?
  71. How do you integrate an external authorization service with Istio?
  72. How does Istio interact with NetworkPolicies — do you need both?
  73. What is the difference between L4 and L7 policy enforcement in the mesh?
  74. How do you observe and reduce the error rate of a specific service via the mesh?
  75. How would you use the mesh to enforce least-privilege between microservices?
  76. How do you test an AuthorizationPolicy before enforcing it (dry-run)?
  77. How do you roll back a bad VirtualService change quickly?
  78. What metrics would you alert on for the mesh itself?
  79. How do you capacity-plan the ingress gateway?
  80. How do you handle gradual migration of services into mTLS STRICT mode?
  81. What is the impact of the sidecar on application startup and shutdown ordering?
  82. How do you ensure the sidecar is ready before the app starts taking traffic?
  83. How do you drain connections gracefully during a rolling update with Istio?
  84. What is the role of the holdApplicationUntilProxyStarts setting?
  85. How does Istio support traffic mirroring (shadowing), and why is it useful?
  86. How would you mirror production traffic to a new version for testing?
  87. How do you debug high tail latency introduced after enabling the mesh?
  88. How do you decide retry budgets to avoid retry storms in the mesh?
  89. How does Istio help with progressive delivery alongside Argo Rollouts or Flagger?
  90. What observability gaps does Istio NOT fill that you still need application instrumentation for?
  91. How do you secure the Istio control plane itself?
  92. What RBAC is needed to manage Istio resources safely?
  93. How do you prevent a team from misconfiguring routing for a shared gateway?
  94. How would you structure Istio config ownership across many teams?
  95. How do you validate mesh config changes in CI before applying?
  96. What is your rollback strategy if an Istio upgrade degrades traffic?
  97. How do you measure whether the mesh is actually improving reliability?
  98. What recent Istio feature have you evaluated, and what value would it bring?
  99. How do you justify the operational complexity of a service mesh to leadership?

What is Istio, and what are the core capabilities it provides?Basic

Answer

Istio is a service mesh implementation for Kubernetes and other environments. Its core capabilities are traffic management, security, and observability: routing, canary releases, retries, timeouts, mTLS, authorization, JWT validation, metrics, logs, traces, and integration with gateways.

Technical explanation

Istio provides APIs such as VirtualService, DestinationRule, Gateway, ServiceEntry, PeerAuthentication, RequestAuthentication, and AuthorizationPolicy.

The data plane can run as Envoy sidecars or, in ambient mode, through node-level ztunnel plus optional waypoint proxies.

The control plane, mainly istiod, translates high-level Istio and Kubernetes configuration into proxy configuration.

Hands-on example

Hands-on checklist:

$ istioctl install --set profile=demo -y

$ kubectl label namespace app istio-injection=enabled

$ kubectl apply -n app -f deployment.yaml

$ istioctl proxy-status

Then add a VirtualService for traffic routing, a PeerAuthentication for mTLS, and an AuthorizationPolicy for access control.

What is the difference between the Istio control plane and data plane?Basic

Answer

The control plane computes and distributes configuration; the data plane enforces it on live traffic. In Istio, istiod is the main control-plane component, while Envoy sidecars, ingress gateways, egress gateways, ztunnel, and waypoint proxies are data-plane components.

Technical explanation

The control plane watches Kubernetes and Istio resources, validates desired state, issues certificates, and pushes xDS configuration.

The data plane processes actual packets and requests, so it applies mTLS, routing, telemetry, retries, and policy.

A key operational point is that existing data-plane proxies continue using last-known-good config if the control plane is temporarily unavailable.

Hands-on example

Debug separation:

$ kubectl get pods -n istio-system

$ istioctl proxy-status

If istiod is unhealthy, focus on config distribution and certificates. If one service is failing while proxies are synced, inspect Envoy listeners, clusters, routes, and policies for that workload.

What is istiod, and what does it do?Basic

Answer

istiod is Istio's main control-plane service. It combines service discovery, configuration translation, certificate authority functions, and sidecar-injection support so the mesh proxies receive the right configuration and workload identity.

Technical explanation

istiod watches Kubernetes Services, Endpoints, pods, namespaces, and Istio CRDs.

It pushes Envoy configuration through xDS, including listeners, routes, clusters, endpoints, and secrets.

It also supports workload certificate issuance and rotation so mTLS can be automatic rather than manually managed per service.

Hands-on example

Useful commands:

$ kubectl -n istio-system get deploy,svc,pods -l app=istiod

$ kubectl -n istio-system logs deploy/istiod --tail=100

$ istioctl proxy-status

When proxies are stale or rejected, compare istiod logs with the proxy-status output before changing application code.

What is Envoy, and what role does it play in Istio?Basic

Answer

Envoy is the high-performance proxy Istio uses to enforce mesh behavior. In sidecar mode, each workload pod gets an Envoy proxy; at the edge, ingress and egress gateways are Envoy proxies; in ambient mode, waypoint proxies use Envoy for L7 features.

Technical explanation

Envoy can terminate and originate mTLS, route HTTP/gRPC/TCP traffic, collect metrics, enforce policies, and perform retries or circuit breaking.

Istio programs Envoy dynamically using xDS, so operators manage intent through Istio resources rather than hand-writing Envoy config.

For troubleshooting, Envoy is often the best source of truth because it shows the actual listeners, clusters, routes, and endpoints in use.

Hands-on example

Inspect Envoy for a pod:

$ istioctl proxy-config listener deploy/productpage -n app

$ istioctl proxy-config route deploy/productpage -n app

$ istioctl proxy-config cluster deploy/productpage -n app | grep reviews

If a route is missing here, the problem is mesh config distribution, not the application binary.

What is the sidecar pattern, and how does Istio inject the proxy?Basic

Answer

The sidecar pattern runs an auxiliary container alongside the application container in the same pod. Istio injects an Envoy sidecar so all inbound and outbound traffic can be intercepted, secured, routed, and observed without changing the application process.

Technical explanation

Injection is usually done by a Kubernetes mutating admission webhook triggered by namespace labels or revision labels.

Traffic redirection is configured by init containers or Istio CNI so application traffic flows through Envoy.

The application still listens on its normal port; Envoy becomes the policy and telemetry enforcement point around it.

Hands-on example

Verify sidecar injection:

$ kubectl label namespace payments istio-injection=enabled

$ kubectl rollout restart deploy -n payments

$ kubectl get pod -n payments -o jsonpath='{.items[0].spec.containers[*].name}'

Expected output includes the application container and istio-proxy.

How does automatic sidecar injection work (namespace label, webhook)?Basic

Answer

Automatic sidecar injection uses a Kubernetes mutating admission webhook. When a pod is created in a labeled namespace, the webhook patches the pod spec to add the istio-proxy container, volumes, environment, lifecycle settings, and traffic-redirection configuration.

Technical explanation

The classic label is istio-injection=enabled. For revision-based installs, teams use istio.io/rev or a revision tag.

Injection only happens when the pod is created, so existing pods must be restarted after a namespace label change.

Injection can be disabled per pod with sidecar.istio.io/inject: 'false' when a workload must stay outside the mesh.

Hands-on example

Example:

$ kubectl label namespace app istio.io/rev=stable --overwrite

$ kubectl rollout restart deployment -n app

$ kubectl describe pod -n app <pod> | grep -A3 istio-proxy

If the pod has only one container, check namespace labels, webhook status, and pod annotations.

What is the Istio ambient (sidecarless) mode, and how does it differ from sidecar mode?Basic

Answer

Istio ambient mode is a sidecarless data-plane mode. Instead of injecting an Envoy sidecar into every pod, ambient mode uses node-level ztunnel for secure L4 mesh behavior and optional waypoint proxies when a workload needs L7 features.

Technical explanation

Sidecar mode gives each workload its own Envoy proxy, which provides very granular L7 control but adds per-pod resource overhead and lifecycle considerations.

Ambient mode reduces per-pod proxy footprint and can simplify onboarding because workloads do not need sidecar injection to join the mesh.

The tradeoff is architectural: L4 capabilities are handled by ztunnel, while L7 policy and routing require waypoint proxies.

Hands-on example

Migration sketch:

1. Install ambient components and Istio CNI.

2. Label a test namespace for ambient mode.

3. Validate L4 mTLS and basic connectivity.

4. Add a waypoint only for services that need L7 routing or authorization.

5. Update dashboards because telemetry labels can differ from sidecar mode.

What is the difference between ztunnel and a waypoint proxy in ambient mode?Basic

Answer

ztunnel is the node-level secure overlay component in ambient mode, while a waypoint proxy is an optional L7 Envoy proxy for a service, namespace, or security boundary. ztunnel handles L4 identity, mTLS, and routing; waypoints handle HTTP-aware features such as L7 routing and authorization.

Technical explanation

ztunnel is deployed per node and captures traffic for ambient workloads without modifying every pod.

Waypoint proxies are used when traffic needs L7 decisions based on HTTP path, method, headers, JWT claims, or advanced authorization.

This split lets teams avoid a sidecar everywhere while still enabling deeper policy where needed.

Hands-on example

Example decision:

Service A only needs encrypted service-to-service traffic: use ambient ztunnel only.

Service B needs path-based allow/deny and HTTPRoute traffic splitting: attach a waypoint to that service or namespace.

Check components:

$ kubectl get ds -n istio-system ztunnel

$ kubectl get gateway -A | grep waypoint

What problem does Istio solve that Kubernetes Services alone do not?Basic

Answer

Kubernetes Services provide stable virtual IPs, DNS names, and basic L4 load balancing. Istio adds service identity, mTLS, L7 routing, retries, timeouts, circuit breaking, telemetry, and policy controls that Kubernetes Services alone do not provide.

Technical explanation

A Kubernetes Service does not know that version v2 should receive 5 percent of traffic or that requests with a specific header should go to a canary.

Kubernetes NetworkPolicy can control L3/L4 network access, but it does not provide HTTP-method, path, JWT-claim, or service-identity decisions at L7.

Istio complements Kubernetes rather than replacing Services; it uses Services as part of service discovery.

Hands-on example

Compare:

Kubernetes Service: app calls http://reviews.default.svc.cluster.local.

Istio VirtualService: route 90 percent to reviews v1 and 10 percent to reviews v2, with timeout, retry, and telemetry.

This gives release control without changing the application endpoint.

How does traffic flow through an Envoy sidecar for inbound and outbound requests?Basic

Answer

In sidecar mode, outbound traffic from the application is redirected to the local Envoy sidecar, which applies outbound routing, mTLS, policy, and telemetry before sending to the destination. Inbound traffic reaches the destination sidecar first, then Envoy forwards it to the application container.

Technical explanation

Outbound path: app process -> local Envoy -> destination Envoy or gateway -> destination app.

Inbound path: network -> destination Envoy -> local application port.

Because Envoy is on both sides, Istio can authenticate workload identity, encrypt traffic, produce source/destination metrics, and enforce routing consistently.

Hands-on example

Trace a request:

$ kubectl exec -n app deploy/sleep -c sleep -- curl -s http://reviews:9080/ratings

$ istioctl proxy-config route deploy/sleep -n app

$ istioctl proxy-config cluster deploy/sleep -n app | grep reviews

If the route exists outbound but the destination listener is missing, inspect the destination proxy next.

What is a VirtualService, and what does it control?Basic

Answer

A VirtualService defines traffic-routing rules for one or more hosts. It controls how requests are matched and routed based on host, URI path, headers, ports, weights, retries, timeouts, fault injection, and mirroring.

Technical explanation

VirtualService is usually paired with DestinationRule when routing to named subsets such as v1 and v2.

It can apply to internal mesh traffic or to traffic entering through a Gateway.

It is a powerful production object, so changes should be reviewed, validated, and rolled out like application code.

Hands-on example

Example route by path:

apiVersion: networking.istio.io/v1

kind: VirtualService

metadata:

name: reviews

spec:

hosts: [reviews]

http:

- match:

- uri:

prefix: /v2

route:

- destination:

host: reviews

subset: v2

- route:

- destination:

host: reviews

subset: v1

What is a DestinationRule, and how does it relate to a VirtualService?Basic

Answer

A DestinationRule defines policies for traffic after routing has selected a service destination. It commonly declares subsets, load-balancing behavior, connection-pool limits, TLS mode, and outlier detection. VirtualService chooses where traffic goes; DestinationRule defines how traffic behaves at that destination.

Technical explanation

Subsets map logical labels such as v1 and v2 to workload labels on pods.

Traffic policies can be global for a host or overridden per subset.

Without the matching DestinationRule subsets, a VirtualService that references subset v2 will not route correctly.

Hands-on example

Example subset definition:

apiVersion: networking.istio.io/v1

kind: DestinationRule

metadata:

name: reviews

spec:

host: reviews

subsets:

- name: v1

labels:

version: v1

- name: v2

labels:

version: v2

trafficPolicy:

loadBalancer:

simple: LEAST_REQUEST

What is a Gateway resource, and how does it differ from a Kubernetes Ingress?Basic

Answer

An Istio Gateway configures an Envoy gateway proxy to accept traffic on specific ports, hosts, and TLS settings. Kubernetes Ingress is a simpler Kubernetes API for HTTP ingress, while Istio Gateway gives Istio-native control and is often paired with VirtualService for detailed routing.

Technical explanation

A Gateway selects gateway pods by label and describes what traffic those proxies should listen for.

A VirtualService then binds to that Gateway and defines routing to internal services.

For newer designs, Kubernetes Gateway API is increasingly preferred because it standardizes Gateway and route resources across implementations.

Hands-on example

Ingress pattern:

apiVersion: networking.istio.io/v1

kind: Gateway

metadata:

name: public-gw

spec:

selector:

istio: ingressgateway

servers:

- port:

number: 443

name: https

protocol: HTTPS

tls:

mode: SIMPLE

credentialName: app-tls

hosts: [app.example.com]

What is the difference between an Istio ingress gateway and an egress gateway?Basic

Answer

An ingress gateway controls traffic entering the mesh from outside, while an egress gateway controls traffic leaving the mesh to external services. Ingress is about exposing internal services safely; egress is about centralizing and auditing outbound access.

Technical explanation

Ingress gateway concerns include TLS termination, WAF/load-balancer integration, host routing, and external client authentication.

Egress gateway concerns include restricting destinations, consistent TLS origination, network allowlisting, and audit logs for outbound calls.

Both are data-plane proxies, but their security boundaries and operational runbooks are different.

Hands-on example

Egress use case:

Only the istio-egressgateway has firewall access to api.partner.com.

Workloads call the external host through ServiceEntry and VirtualService.

Network teams allow outbound internet only from the egress gateway nodes or security group, giving a single audited path.

What is a ServiceEntry, and when do you need one?Basic

Answer

A ServiceEntry adds external or otherwise non-Kubernetes services to Istio's service registry. I use it when mesh workloads must call an external API, database, VM, or service that Istio cannot discover from Kubernetes Services.

Technical explanation

ServiceEntry lets Istio understand the host, ports, protocols, resolution mode, and endpoints for external services.

It is required in locked-down meshes when outbound traffic policy allows only registered external services.

It can be combined with VirtualService, DestinationRule, and egress gateway routing.

Hands-on example

Example external API:

apiVersion: networking.istio.io/v1

kind: ServiceEntry

metadata:

name: partner-api

spec:

hosts: [api.partner.com]

location: MESH_EXTERNAL

ports:

- number: 443

name: https

protocol: TLS

resolution: DNS

How do you do weighted traffic splitting for a canary release in Istio?Basic

Answer

Weighted traffic splitting is done with a VirtualService that sends percentages of traffic to different DestinationRule subsets. For a canary, I might route 95 percent to v1 and 5 percent to v2, observe metrics, then gradually increase v2.

Technical explanation

The DestinationRule defines subsets such as v1 and v2 based on pod labels.

The VirtualService assigns integer weights to each subset, and the weights should add up to 100.

Canary decisions should be based on error rate, latency, saturation, and business metrics rather than time alone.

Hands-on example

Canary example:

http:

- route:

- destination:

host: checkout

subset: stable

weight: 95

- destination:

host: checkout

subset: canary

weight: 5

Watch:

$ kubectl -n istio-system port-forward svc/prometheus 9090

Query istio_requests_total and request duration by destination_version.

How would you implement a canary deployment progressively shifting traffic with Istio?Basic

Answer

I implement progressive canary by deploying the new version beside the stable version, routing a small percentage with Istio, validating telemetry and business checks, then increasing traffic in controlled steps. If SLOs degrade, I immediately set the canary weight back to zero.

Technical explanation

Start with 1 to 5 percent or header-only traffic, depending on risk.

Use automated gates for p95 latency, 5xx rate, dependency errors, and application-specific correctness signals.

Keep the old version deployed until the new version has survived normal and peak traffic.

Hands-on example

Rollout sequence:

1. Deploy checkout v2 with label version=v2.

2. Create DestinationRule subsets stable and canary.

3. Set VirtualService weights 99/1.

4. After metrics pass, move to 95/5, 90/10, 75/25, 50/50, 100/0.

5. Roll back by applying the previous VirtualService from Git.

How do you implement a blue-green deployment using Istio?Basic

Answer

For blue-green deployment, I keep two complete versions available and switch traffic at the routing layer. In Istio, blue and green are DestinationRule subsets or separate services, and the VirtualService points all traffic to one environment until cutover.

Technical explanation

Blue-green is simpler than a long canary when compatibility risk is low but cutover needs to be quick.

The green environment should receive smoke tests and possibly mirrored traffic before receiving real users.

Rollback is a routing change back to blue, but database migrations must be backward-compatible or explicitly rolled back.

Hands-on example

VirtualService cutover:

Before: route 100 percent to subset blue.

After validation: route 100 percent to subset green.

Rollback: reapply the previous Git revision.

Commands:

$ kubectl apply -f virtualservice-green.yaml

$ kubectl rollout status deploy/checkout-green

$ istioctl proxy-config route deploy/ingressgateway -n istio-system | grep checkout

What are subsets in a DestinationRule, and how are they used?Basic

Answer

Subsets are named groups of endpoints for a service, usually selected by pod labels such as version: v1 or version: v2. They are defined in a DestinationRule and then referenced by VirtualService routes.

Technical explanation

Subsets let routing policy target logical versions without creating separate Kubernetes Services for every release.

Each subset can have its own traffic policy, such as load balancing, connection pools, or TLS settings.

The pod labels must match exactly, otherwise the subset has no endpoints and routing can fail with 503-style errors.

Hands-on example

Validate subset endpoints:

$ kubectl get pods -n app -l app=reviews --show-labels

$ istioctl proxy-config endpoints deploy/productpage -n app | grep reviews

If subset v2 has no endpoints, fix deployment labels or DestinationRule subset labels before changing retry policy.

How does Istio do request routing based on headers or paths?Basic

Answer

Istio routes by matching request attributes in a VirtualService. For HTTP traffic, it can match URI prefixes or exact paths, methods, headers, query parameters, gateways, ports, and source labels, then route to a destination subset or service.

Technical explanation

Header-based routing is useful for internal testers, beta users, or requests carrying a specific release header.

Path-based routing is common at ingress gateways for routing /api, /admin, or /static to different backends.

Match rules are evaluated in order, so specific rules should come before general catch-all routes.

Hands-on example

Header route example:

match:

- headers:

x-canary-user:

exact: 'true'

route:

- destination:

host: checkout

subset: canary

Then test:

$ curl -H 'x-canary-user: true' https://app.example.com/checkout

What is fault injection in Istio, and why would you use it?Basic

Answer

Fault injection intentionally adds delays or aborts to mesh traffic so teams can test timeout behavior, retry safety, fallbacks, and user impact. It is a controlled resilience test, not a production failure by accident.

Technical explanation

Delay faults simulate slow dependencies, network latency, or saturated downstream services.

Abort faults simulate HTTP errors such as 500 or 503 responses.

Fault injection should be scoped carefully to a test namespace, header, or small traffic segment to avoid broad production impact.

Hands-on example

Example test plan:

1. Match only requests with header x-chaos-test: true.

2. Inject a 2 second delay to ratings.

3. Confirm checkout timeout is lower than user SLA and fallback is graceful.

4. Remove the VirtualService fault rule after the test.

How do you inject delays or aborts to test resilience with Istio?Basic

Answer

Delays and aborts are configured under the fault section of a VirtualService HTTP route. A delay pauses matched requests before forwarding; an abort returns a configured error directly from the proxy.

Technical explanation

Use percentage fields to limit blast radius.

Use header matching so only test traffic is affected.

Always validate that retries, timeouts, and application fallback behavior interact as expected.

Hands-on example

Delay example:

fault:

delay:

percentage:

value: 10

fixedDelay: 2s

Abort example:

fault:

abort:

percentage:

value: 5

httpStatus: 503

Test:

$ curl -H 'x-chaos-test: true' http://checkout/

What are retries in Istio, and what are the risks of misconfiguring them?Basic

Answer

Retries let Envoy automatically retry failed requests under configured conditions. They can improve resilience for transient failures, but misconfigured retries can amplify load, duplicate non-idempotent operations, and create retry storms during incidents.

Technical explanation

Retries are safer for idempotent GET or read operations than for payment, order creation, or side-effecting writes.

Retry attempts must be bounded by timeout budgets and downstream capacity.

Retry policies should specify retryOn conditions, attempts, perTryTimeout, and overall route timeout.

Hands-on example

Safe-ish retry example for a read API:

retries:

attempts: 2

perTryTimeout: 300ms

retryOn: gateway-error,connect-failure,refused-stream

timeout: 1s

Do not blindly apply this to POST /charge. For writes, prefer idempotency keys and explicit application-level retry design.

What are timeouts in Istio, and how do they interact with retries?Basic

Answer

A timeout defines the maximum time a request is allowed to take before Envoy stops waiting. Timeouts and retries must be designed together because each retry consumes part of the overall latency budget.

Technical explanation

If the overall timeout is 1 second and perTryTimeout is 400 ms with 2 retries, there is little room for network and application variability.

Too-long timeouts keep resources tied up and increase queueing; too-short timeouts cause false failures.

Timeouts should align with upstream SLOs, downstream behavior, and client expectations.

Hands-on example

Example budget:

Client SLA: 2s.

Gateway timeout: 1800ms.

Service A to Service B timeout: 800ms.

Retries: attempts=2, perTryTimeout=250ms.

Validation:

$ fortio load -qps 50 -t 2m http://checkout/

Watch p95, p99, retry count, and 5xx rate.

What is a circuit breaker in Istio, and how is it configured (outlier detection, connection pool)?Basic

Answer

In Istio, circuit breaking is configured through DestinationRule trafficPolicy, mainly connectionPool and outlierDetection. It protects services by limiting connections, pending requests, and by ejecting unhealthy endpoints from load balancing temporarily.

Technical explanation

Connection-pool settings prevent a caller from overwhelming a downstream service with too many concurrent connections or queued requests.

Outlier detection removes endpoints that repeatedly fail, which reduces traffic to bad pods while they recover.

Circuit breaking must be tuned with realistic traffic tests because too-aggressive limits can create artificial outages.

Hands-on example

DestinationRule sketch:

trafficPolicy:

connectionPool:

tcp:

maxConnections: 100

http:

http1MaxPendingRequests: 50

maxRequestsPerConnection: 10

outlierDetection:

consecutive5xxErrors: 5

interval: 10s

baseEjectionTime: 30s

maxEjectionPercent: 50

What is outlier detection, and how does it eject unhealthy hosts?Basic

Answer

Outlier detection is Envoy's mechanism for identifying unhealthy upstream endpoints and temporarily ejecting them from the load-balancing pool. In Istio, it is configured in a DestinationRule under trafficPolicy.outlierDetection.

Technical explanation

It can react to consecutive 5xx responses, gateway errors, local-origin failures, or success-rate based signals depending on configuration and protocol.

Ejected hosts are not removed forever; they are reintroduced after the ejection time, then evaluated again.

It complements Kubernetes readiness probes but catches runtime failures visible from client traffic.

Hands-on example

Troubleshoot ejection:

$ istioctl proxy-config clusters deploy/frontend -n app -o json | grep -i outlier

$ kubectl logs deploy/frontend -c istio-proxy -n app | grep -i outlier

Then correlate with destination pod logs and readiness status.

What is mutual TLS (mTLS), and how does Istio provide it automatically?Basic

Answer

mTLS means both client and server authenticate each other using certificates, then encrypt the connection. Istio provides this automatically by issuing workload certificates, configuring proxies with identities, and using those identities during service-to-service communication.

Technical explanation

Each workload gets a SPIFFE-like identity tied to its service account and trust domain.

Envoy proxies use certificates from Istio to establish encrypted and authenticated connections.

Once mTLS is enabled, policy can reason about authenticated service identity instead of relying only on IP addresses.

Hands-on example

Check mTLS:

$ istioctl authn tls-check deploy/frontend.app

$ istioctl proxy-config secret deploy/frontend -n app

Apply STRICT in a namespace:

apiVersion: security.istio.io/v1

kind: PeerAuthentication

metadata:

name: default

namespace: app

spec:

mtls:

mode: STRICT

What is the difference between PERMISSIVE and STRICT mTLS mode?Basic

Answer

PERMISSIVE mTLS accepts both plaintext and mTLS traffic, while STRICT requires mTLS. PERMISSIVE is useful during migration; STRICT is the target for strong zero-trust enforcement inside the mesh.

Technical explanation

PERMISSIVE lets meshed and non-meshed workloads communicate while sidecars or ambient enrollment are rolled out.

STRICT prevents plaintext clients from connecting to protected workloads.

A namespace should move to STRICT only after all expected callers are in the mesh and telemetry shows mTLS is being used.

Hands-on example

Migration check:

$ istioctl authn tls-check deploy/backend -n app

$ kubectl get pods -n app --show-labels

$ kubectl get pods -n app -o custom-columns=NAME:.metadata.name,CONTAINERS:.spec.containers[*].name

If any required client lacks istio-proxy or ambient enrollment, do not switch that path to STRICT yet.

Why would you start with PERMISSIVE mTLS during a rollout?Basic

Answer

I start with PERMISSIVE mTLS because it reduces migration risk. It allows existing plaintext clients and newly meshed clients to coexist while we identify traffic paths, fix missing injection, and validate that mTLS is actually negotiated before enforcing STRICT.

Technical explanation

Large clusters often have cronjobs, legacy clients, external callers, and ad-hoc tools that are easy to miss.

PERMISSIVE mode lets telemetry expose which workloads are using mTLS without immediately causing outages.

The migration should still have a deadline; PERMISSIVE should be a rollout phase, not the final security posture.

Hands-on example

Rollout plan:

1. Enable sidecar injection or ambient in one namespace.

2. Apply PeerAuthentication PERMISSIVE.

3. Verify tls-check and request metrics.

4. Fix non-mesh callers.

5. Apply STRICT during a controlled window.

6. Alert on plaintext attempts or 403/503 spikes.

What is a PeerAuthentication policy?Basic

Answer

A PeerAuthentication policy controls how workloads accept peer connections, especially mTLS mode. It can be applied mesh-wide, namespace-wide, or workload-specific, and it determines whether inbound traffic must use mutual TLS.

Technical explanation

PeerAuthentication is about peer identity and transport authentication, not end-user JWT authentication.

Workload-specific policies use selectors; namespace policies without selectors apply broadly in that namespace.

It is commonly used to move from PERMISSIVE to STRICT mTLS in stages.

Hands-on example

Namespace STRICT example:

apiVersion: security.istio.io/v1

kind: PeerAuthentication

metadata:

name: default

namespace: payments

spec:

mtls:

mode: STRICT

Validate:

$ istioctl analyze -n payments

$ istioctl authn tls-check deploy/api -n payments

What is a RequestAuthentication policy, and how does it validate JWTs?Basic

Answer

RequestAuthentication validates end-user or caller JWTs at the proxy. It tells Istio where to find the token and how to validate it using an issuer and JWKS. It authenticates the request token but does not by itself authorize access; AuthorizationPolicy enforces what is allowed.

Technical explanation

RequestAuthentication produces authenticated request principal information when the JWT is valid.

Invalid tokens are rejected when the policy applies, but missing-token behavior usually requires AuthorizationPolicy if a token is mandatory.

It is useful at ingress gateways and internal services that need consistent JWT validation.

Hands-on example

JWT policy sketch:

apiVersion: security.istio.io/v1

kind: RequestAuthentication

metadata:

name: app-jwt

namespace: app

spec:

selector:

matchLabels:

app: orders

jwtRules:

- issuer: https://issuer.example.com/

jwksUri: https://issuer.example.com/.well-known/jwks.json

Then add AuthorizationPolicy requiring requestPrincipals.

What is an AuthorizationPolicy, and how do you enforce service-to-service access control?Basic

Answer

AuthorizationPolicy enforces access control for workloads. It can allow, deny, or audit traffic based on source workload identity, namespace, principals, HTTP methods, paths, ports, IP blocks, and JWT claims.

Technical explanation

mTLS gives authenticated workload identity; AuthorizationPolicy uses that identity to enforce least privilege.

DENY policies are evaluated carefully because a broad DENY can break traffic across a namespace.

Policies should be tested with dry run or limited scope before production enforcement.

Hands-on example

Allow only frontend to call orders:

apiVersion: security.istio.io/v1

kind: AuthorizationPolicy

metadata:

name: allow-frontend

namespace: orders

spec:

selector:

matchLabels:

app: orders

action: ALLOW

rules:

- from:

- source:

principals: [cluster.local/ns/frontend/sa/frontend]

to:

- operation:

methods: [GET, POST]

How does Istio enable a Zero Trust posture inside the cluster?Intermediate

Answer

Istio enables zero trust by giving workloads strong identities, encrypting service-to-service traffic with mTLS, enforcing explicit authorization policies, validating request credentials, and producing audit-friendly telemetry for every service edge.

Technical explanation

Zero trust means the network location is not enough to trust a caller; identity and policy must be verified on each request path.

Istio can enforce service-account based access instead of relying only on pod IPs or flat cluster networking.

It should be combined with Kubernetes RBAC, NetworkPolicy, secret management, image security, and admission controls for a complete posture.

Hands-on example

Zero-trust rollout:

1. Standardize service accounts per workload.

2. Enable mTLS STRICT.

3. Create default-deny AuthorizationPolicy per namespace.

4. Add explicit ALLOW policies for known service edges.

5. Monitor denied traffic and fix legitimate flows through Git-reviewed policy changes.

How does Istio issue and rotate workload certificates (SPIFFE/SPIRE concepts)?Intermediate

Answer

Istio issues and rotates workload certificates through its CA functionality in istiod. Workload identities are commonly represented as SPIFFE-style URIs based on trust domain, namespace, and service account, which allows proxies to authenticate services rather than IP addresses.

Technical explanation

A typical identity looks like spiffe://cluster.local/ns/payments/sa/payments-api.

The proxy obtains certificates and secrets from the control plane and uses them for mTLS handshakes.

SPIRE is a separate SPIFFE implementation; Istio uses SPIFFE concepts and can integrate with external CA or trust-domain models depending on architecture.

Hands-on example

Inspect a workload cert:

$ istioctl proxy-config secret deploy/payments-api -n payments

$ istioctl proxy-config secret deploy/payments-api -n payments -o json

Check subject, SAN URI, expiration, and whether certificates are rotating before expiry.

What telemetry does Istio provide out of the box (metrics, logs, traces)?Intermediate

Answer

Istio provides telemetry for service traffic out of the box: request metrics, proxy access logs when enabled, and distributed tracing integration. The common signals include request volume, success and error counts, latency histograms, source and destination labels, response codes, and security policy effects.

Technical explanation

Metrics are generated by the proxy, so basic service graph visibility appears even when application instrumentation is incomplete.

Access logs help debug individual requests, but they should be sampled or scoped in high-volume production environments.

Tracing still requires applications to propagate trace headers so spans can be connected end-to-end.

Hands-on example

Verification:

$ kubectl -n istio-system port-forward svc/prometheus 9090

PromQL examples:

sum(rate(istio_requests_total[5m])) by (destination_workload, response_code)

histogram_quantile(0.95, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_workload))

How does Istio integrate with Prometheus for metrics?Intermediate

Answer

Istio integrates with Prometheus by exposing proxy and control-plane metrics that Prometheus can scrape. The data plane emits request metrics such as istio_requests_total and duration histograms, while Istio components expose health and operational metrics.

Technical explanation

In production, Prometheus scraping is usually managed by Prometheus Operator ServiceMonitor or PodMonitor resources, or equivalent platform configuration.

Metric cardinality must be controlled; excessive labels can overload Prometheus.

Dashboards and alerts should distinguish workload traffic metrics from mesh control-plane and gateway metrics.

Hands-on example

Prometheus query examples:

Error rate:

sum(rate(istio_requests_total{response_code=~'5..'}[5m])) by (destination_workload)

/

sum(rate(istio_requests_total[5m])) by (destination_workload)

Control plane scrape check:

$ kubectl -n istio-system get svc istiod -o yaml | grep prometheus -A5

What are the golden signals Istio exposes (latency, traffic, errors, saturation)?Intermediate

Answer

Istio exposes the golden signals as traffic, errors, latency, and saturation-related proxy metrics. For SRE work, I alert on error-rate burn, p95/p99 latency, request volume changes, and gateway or proxy saturation rather than just pod health.

Technical explanation

Traffic is represented by request rate and byte counters.

Errors are response codes, gRPC status, reset reasons, and policy denials.

Latency is captured in histograms; saturation is inferred from proxy CPU/memory, connection counts, pending requests, and gateway load.

Hands-on example

Dashboard panels:

1. RPS by source and destination.

2. 5xx percentage by destination workload.

3. p95 and p99 latency by route.

4. Envoy CPU/memory for gateways and high-volume sidecars.

5. mTLS or authorization denials after policy changes.

How does Istio enable distributed tracing, and what is required from the application?Intermediate

Answer

Istio enables distributed tracing by integrating Envoy with tracing backends such as Zipkin, Jaeger, OpenTelemetry collectors, or vendor systems. Envoy can create spans at proxy boundaries, but applications must propagate trace headers for a complete trace across services.

Technical explanation

Without header propagation, each service may create separate traces that cannot be stitched into one transaction.

Common headers include traceparent, b3, x-request-id, and related context depending on the tracing stack.

Sampling policy should balance troubleshooting value with storage and performance cost.

Hands-on example

Hands-on flow:

1. Configure Istio tracing provider to send to an OpenTelemetry Collector.

2. Ensure app framework propagates W3C traceparent or B3 headers.

3. Call frontend -> checkout -> payments.

4. Open the tracing backend and verify one trace contains spans for all three services.

Why must applications propagate trace headers even with Istio?Intermediate

Answer

Applications must propagate trace headers because the proxy can observe hops but cannot automatically know how an application maps an inbound request to an outbound request. Header propagation carries the trace context across process boundaries.

Technical explanation

Envoy can create spans, but the application decides which outbound calls belong to the inbound request being handled.

If the app drops headers, downstream services may receive traffic but create unrelated traces.

Framework-level instrumentation with OpenTelemetry is the cleanest way to preserve context consistently.

Hands-on example

Code-level check:

Incoming request contains traceparent.

The service must copy context when calling downstream:

GET /payment HTTP/1.1

traceparent: 00-<trace-id>-<span-id>-01

Test by sending one request and checking whether frontend, checkout, and payment appear under the same trace ID.

How does Istio integrate with Grafana and Kiali, and what does Kiali show?Intermediate

Answer

Grafana visualizes Istio metrics, while Kiali shows the mesh topology and service graph, including traffic edges, health, request rates, response codes, mTLS status, and Istio configuration relationships. Together they help operators move from metrics to topology-aware diagnosis.

Technical explanation

Grafana is good for time-series dashboards, SLOs, and historical trends.

Kiali is useful for understanding which services call each other and whether traffic is flowing as expected through the mesh.

Kiali can also highlight misconfigurations or missing links between VirtualService, DestinationRule, Gateway, and workloads.

Hands-on example

Troubleshooting example:

1. Grafana shows checkout 5xx rate increased.

2. Kiali shows the failing edge is checkout -> payments, not checkout -> inventory.

3. Inspect Istio config for that edge.

4. Use proxy-config clusters/endpoints for checkout to confirm payments endpoints and outlier status.

What is the typical latency and resource overhead of the sidecar, and how do you minimise it?Intermediate

Answer

The sidecar adds CPU, memory, startup, and latency overhead because every request passes through an additional proxy. The exact overhead depends on traffic volume, protocol, telemetry, TLS, filters, and resource limits, so I measure it in my own environment rather than quoting a single universal number.

Technical explanation

Overhead is reduced by right-sizing proxy CPU/memory, limiting high-cardinality telemetry, avoiding unnecessary Envoy filters, and applying mesh only where value justifies cost.

High-QPS gateways and chatty services need dedicated capacity tests.

Ambient mode can reduce per-pod sidecar overhead, but L7 waypoint usage still needs capacity planning.

Hands-on example

Measurement plan:

Run the same load test without mesh and with mesh.

$ fortio load -qps 200 -t 10m http://checkout/

Compare p50/p95/p99 latency, CPU, memory, connection count, retries, and error rate.

Then tune proxy resources and telemetry before declaring the mesh too expensive.

How do you troubleshoot a request that is failing only inside the mesh?Intermediate

Answer

For a request failing only inside the mesh, I isolate whether the failure is routing, mTLS, authorization, endpoint discovery, gateway configuration, or application behavior. I compare direct pod behavior, Service behavior, and meshed behavior rather than assuming it is an application bug.

Technical explanation

Start with status: pod readiness, sidecar injection, proxy sync, and istioctl analyze.

Then inspect Envoy route, cluster, listener, endpoint, and secret config on the source and destination.

Finally inspect proxy access logs for response flags such as UF, NR, UO, RBAC, or TLS errors.

Hands-on example

Runbook:

$ istioctl analyze -n app

$ istioctl proxy-status

$ istioctl proxy-config route deploy/source -n app

$ istioctl proxy-config endpoints deploy/source -n app | grep destination

$ kubectl logs deploy/source -c istio-proxy -n app --tail=100

Map the error code to route, endpoint, mTLS, or policy.

How do you use istioctl proxy-config and proxy-status to debug Envoy?Intermediate

Answer

I use istioctl proxy-status to check whether proxies are connected and synced with istiod. I use istioctl proxy-config to inspect the actual Envoy configuration for listeners, routes, clusters, endpoints, bootstrap, and secrets.

Technical explanation

proxy-status quickly shows stale or disconnected proxies, which points to control-plane or network issues.

proxy-config answers what the proxy is actually enforcing, not what I intended to configure.

For routing bugs, route and cluster output usually finds the problem faster than reading YAML alone.

Hands-on example

Commands:

$ istioctl proxy-status

$ istioctl proxy-config listeners deploy/frontend -n app

$ istioctl proxy-config routes deploy/frontend -n app

$ istioctl proxy-config clusters deploy/frontend -n app

$ istioctl proxy-config secrets deploy/frontend -n app

If a proxy is STALE, restart only after checking why it cannot receive or apply config.

What does istioctl analyze do?Intermediate

Answer

istioctl analyze validates Istio and Kubernetes configuration for common mesh problems. It detects issues like invalid hosts, unreachable subsets, conflicting gateways, missing sidecars, policy mistakes, and configuration that will not behave as expected.

Technical explanation

It is useful both interactively during troubleshooting and in CI before applying changes.

It does not replace runtime testing, but it catches many preventable outages before proxies receive bad config.

Warnings should be triaged; some may be acceptable intentionally, but critical errors should block deployment.

Hands-on example

CI example:

$ istioctl analyze -A --failure-threshold Error

For a pull request, render Helm/Kustomize output first:

$ kustomize build overlays/prod > rendered.yaml

$ istioctl analyze -f rendered.yaml --failure-threshold Warning

Fail the pipeline on invalid VirtualService or DestinationRule references.

How do you debug mTLS handshake failures between two services?Intermediate

Answer

To debug mTLS handshake failures, I verify both workloads are in the mesh, check PeerAuthentication mode, inspect DestinationRule TLS settings, confirm certificates and trust domains, and read proxy logs for TLS or authentication errors.

Technical explanation

Common causes include one side not injected, STRICT mode with plaintext caller, wrong DestinationRule TLS mode, trust-domain mismatch, expired certificates, or traffic bypassing Envoy.

The source proxy must have the right cluster TLS configuration, and the destination proxy must have valid workload certificates.

Use tls-check and proxy-config secret before changing application code.

Hands-on example

Commands:

$ istioctl authn tls-check deploy/client -n app

$ istioctl proxy-config secret deploy/client -n app

$ istioctl proxy-config cluster deploy/client -n app | grep backend

$ kubectl logs deploy/client -c istio-proxy -n app | grep -i tls

If STRICT is enabled and client has no sidecar, the fix is onboarding the client or scoping policy.

What is a common cause of 503 errors in Istio, and how do you diagnose it?Intermediate

Answer

A common cause of 503 in Istio is that Envoy has no healthy upstream endpoints or no valid route to the selected subset. It can also come from mTLS mismatch, outlier ejection, gateway routing errors, or upstream connection failures.

Technical explanation

If a VirtualService routes to subset v2 but DestinationRule labels do not match any pods, Envoy can return 503.

Proxy access-log flags help narrow the class of issue: NR for no route, UF for upstream failure, UH for no healthy upstream, and RBAC for denied requests.

Always compare Kubernetes endpoints with Envoy endpoints.

Hands-on example

Diagnosis:

$ kubectl get endpoints backend -n app

$ istioctl proxy-config endpoints deploy/frontend -n app | grep backend

$ istioctl proxy-config route deploy/frontend -n app | grep backend

$ kubectl logs deploy/frontend -c istio-proxy -n app --tail=200

Fix labels, subsets, readiness, or TLS policy based on the missing piece.

Why might traffic bypass the sidecar, and how do you verify injection?Intermediate

Answer

Traffic may bypass the sidecar if the pod was not injected, traffic uses excluded ports or IP ranges, hostNetwork is used, iptables/CNI redirection failed, the app binds or routes unusually, or an operator explicitly disabled injection or capture annotations.

Technical explanation

The first check is whether the pod actually has istio-proxy and the expected annotations.

Then verify sidecar status, listeners, and whether the traffic uses a port included in capture rules.

Bypass can create security gaps because mTLS and AuthorizationPolicy may not apply.

Hands-on example

Verification:

$ kubectl get pod <pod> -n app -o jsonpath='{.spec.containers[*].name}'

$ kubectl get pod <pod> -n app -o jsonpath='{.metadata.annotations.sidecar\.istio\.io/status}'

$ istioctl proxy-config listeners <pod> -n app

If no istio-proxy appears, restart after fixing namespace labels or revision tags.

How do you exclude certain ports or IP ranges from sidecar interception?Intermediate

Answer

Istio can exclude specific inbound ports, outbound ports, outbound IP ranges, or interfaces from sidecar interception using traffic.sidecar.istio.io annotations. I use this only for well-understood exceptions because exclusions bypass mesh policy and telemetry.

Technical explanation

Examples include node-local agents, backup traffic, special database clients, or ports that cannot tolerate proxy interception.

Every exclusion should be documented with owner, reason, expiry, and compensating controls.

After applying an annotation, the pod must be recreated for injection and redirection config to change.

Hands-on example

Pod annotation example:

metadata:

annotations:

traffic.sidecar.istio.io/excludeOutboundIPRanges: 169.254.169.254/32

traffic.sidecar.istio.io/excludeInboundPorts: '15020'

Validate:

$ kubectl rollout restart deploy/app -n app

$ istioctl proxy-config listeners deploy/app -n app

How do you handle non-HTTP (TCP) traffic in Istio?Intermediate

Answer

Istio can handle non-HTTP TCP traffic with L4 routing, mTLS, telemetry, and authorization based on ports, IPs, principals, and services. It cannot apply HTTP path, method, or header rules to opaque TCP traffic.

Technical explanation

Protocol detection depends on service port names and traffic behavior, so port naming matters.

For raw TCP, VirtualService tcp routes and AuthorizationPolicy TCP rules are used.

For databases and stateful protocols, test connection pooling, long-lived connections, and failover behavior carefully.

Hands-on example

TCP ServiceEntry example for an external DB:

ports:

- number: 5432

name: tcp-postgres

protocol: TCP

Then policy can allow only the app service account to that port.

Test with psql and watch Envoy TCP connection metrics rather than HTTP response-code metrics.

How does Istio handle headless services and StatefulSets?Intermediate

Answer

Istio can work with headless services and StatefulSets, but I pay close attention to service discovery, DNS, stable pod identities, and protocol behavior. Headless services expose individual pod endpoints, which may interact differently with Envoy routing and load balancing than normal ClusterIP services.

Technical explanation

Stateful workloads often use long-lived connections and identity-sensitive peer addresses, so mesh behavior must be tested before production rollout.

Subsets can still use labels, but per-pod routing may require careful hostnames or service entries depending on the use case.

For databases or brokers, verify readiness, mTLS compatibility, connection draining, and client failover behavior.

Hands-on example

StatefulSet validation:

$ kubectl get svc mydb -o yaml | grep clusterIP

$ kubectl exec deploy/client -c app -- nslookup mydb-0.mydb.default.svc.cluster.local

$ istioctl proxy-config endpoints deploy/client -n app | grep mydb

Run failover tests before enabling STRICT mTLS for the data path.

What is the difference between Istio and an API gateway?Intermediate

Answer

Istio and an API gateway solve overlapping but different problems. An API gateway primarily manages north-south client-to-service traffic at the edge, while Istio manages east-west service-to-service traffic inside the platform and can also provide ingress and egress gateways.

Technical explanation

API gateways often focus on developer portals, API keys, external auth, request transformation, quotas, and public API lifecycle.

Istio focuses on workload identity, mTLS, service graph telemetry, internal authorization, and traffic control across microservices.

Many mature platforms use both: an API gateway at the public edge and Istio inside the cluster.

Hands-on example

Example architecture:

Internet -> API Gateway/WAF -> Istio Ingress Gateway -> internal services.

The API gateway handles public API products and client auth.

Istio handles mTLS, internal AuthorizationPolicy, canary routing, service telemetry, and egress controls.

How do Istio Gateways relate to the Kubernetes Gateway API?Intermediate

Answer

Istio Gateways are Istio's native API for configuring gateway proxies. The Kubernetes Gateway API is a broader Kubernetes standard for Gateway, HTTPRoute, TCPRoute, and related resources. Istio supports Gateway API so teams can use a more portable and role-oriented model.

Technical explanation

The Istio Gateway API usually pairs Gateway with VirtualService.

The Kubernetes Gateway API separates infrastructure ownership of Gateways from application ownership of Routes.

This separation is helpful in multi-team platforms where platform teams own shared gateways and app teams own route attachments.

Hands-on example

Ownership model:

Platform team applies Gateway in infra namespace.

App team applies HTTPRoute in app namespace with parentRefs to that Gateway.

CI checks allowed hostnames and namespaces before merge.

This reduces accidental edits to a shared Istio Gateway object.

What is the Kubernetes Gateway API, and how is Istio adopting it?Intermediate

Answer

The Kubernetes Gateway API is a standardized Kubernetes networking API intended to be more expressive and role-oriented than Ingress. It introduces resources such as GatewayClass, Gateway, and route types. Istio supports it as a way to configure ingress and mesh traffic with standard Kubernetes APIs.

Technical explanation

GatewayClass represents the implementation type, such as Istio.

Gateway represents listener infrastructure and allowed route attachment.

HTTPRoute or TCPRoute represents application routing rules that attach to a Gateway.

Hands-on example

Gateway API sketch:

kind: Gateway

metadata:

name: public

spec:

gatewayClassName: istio

listeners:

- name: https

port: 443

protocol: HTTPS

---

kind: HTTPRoute

spec:

parentRefs:

- name: public

rules:

- backendRefs:

- name: checkout

port: 8080

How do you roll out Istio to existing workloads with minimal disruption (as you did at Intuit)?Intermediate

Answer

I would roll out Istio to existing workloads in waves, starting with low-risk namespaces, using PERMISSIVE mTLS, strong telemetry, and clear rollback. The goal is to learn real traffic patterns before enforcing strict policy or advanced routing.

Technical explanation

Start with discovery: service owners, ports, protocols, cronjobs, external dependencies, and readiness probes.

Use revision labels or namespace labels so onboarding is controlled and reversible.

Move from observe-only to mTLS PERMISSIVE, then to STRICT and AuthorizationPolicy after traffic is understood.

Hands-on example

Wave plan:

1. Install Istio with a revision.

2. Onboard one non-critical namespace.

3. Restart workloads to inject sidecars.

4. Validate logs, metrics, probes, and dependency calls.

5. Add PeerAuthentication PERMISSIVE.

6. Move to STRICT after tls-check is clean.

7. Repeat by service tier with a runbook and owner signoff.

How do you upgrade Istio safely (canary control plane, revision tags)?Intermediate

Answer

I upgrade Istio safely by installing the new control plane as a canary revision, moving a small set of workloads to that revision, validating telemetry and traffic, then promoting the revision tag and rolling the rest gradually. I avoid in-place upgrades that change every workload at once.

Technical explanation

Revision-based upgrades let old and new control planes coexist during validation.

Workload migration requires restart because sidecar injection happens at pod creation.

Rollback is moving the namespace revision tag back and restarting affected workloads, assuming CRDs and APIs remain compatible.

Hands-on example

Upgrade example:

$ istioctl install --set revision=1-28 -y

$ kubectl label namespace canary istio.io/rev=1-28 --overwrite

$ kubectl rollout restart deploy -n canary

$ istioctl proxy-status

After validation:

$ istioctl tag set stable --revision 1-28

What are Istio revisions and revision tags, and why use them for upgrades?Intermediate

Answer

Istio revisions identify different installed control-plane versions, and revision tags provide stable labels such as stable or canary that point to a specific revision. They are used to control which workloads get injected with which sidecar version during upgrades.

Technical explanation

A namespace can be labeled istio.io/rev=1-27 or istio.io/rev=stable.

Tags decouple application namespaces from raw version names, making promotion and rollback simpler.

They also support canary control-plane validation without reconfiguring the entire cluster.

Hands-on example

Example:

$ istioctl tag set stable --revision 1-27

$ istioctl tag set canary --revision 1-28

$ kubectl label namespace payments istio.io/rev=canary --overwrite

$ kubectl rollout restart deploy -n payments

If canary fails, point the namespace back to stable and restart only that namespace.

How do you do a canary upgrade of the Istio control plane?Intermediate

Answer

A canary upgrade installs the new Istio control plane alongside the old one, then migrates a small set of workloads or namespaces to the new revision. I validate proxy sync, mTLS, routing, telemetry, gateway behavior, and application SLOs before expanding.

Technical explanation

Use low-risk but representative workloads first, not an empty demo service only.

Check CRD compatibility, deprecated fields, EnvoyFilter behavior, and custom telemetry before migration.

Gate expansion on both mesh health and application SLOs.

Hands-on example

Canary runbook:

$ istioctl install --set revision=new -y

$ kubectl label ns sample istio.io/rev=new --overwrite

$ kubectl rollout restart deploy -n sample

$ istioctl proxy-status | grep sample

$ istioctl analyze -A

Run smoke and load tests, then move one production namespace at a time.

What are the failure modes if istiod is unavailable?Intermediate

Answer

If istiod is unavailable, existing proxies generally continue forwarding traffic with their last-known-good configuration, but new config will not propagate, new or restarted sidecars may fail to get config or certificates, certificate rotation can be impacted, and injection or validation webhooks may fail depending on configuration.

Technical explanation

Existing data-plane traffic is not normally on the control-plane request path.

Risk increases during pod restarts, scaling events, certificate renewal windows, and config rollouts.

The blast radius depends on istiod replicas, PDBs, cluster DNS, API-server connectivity, and webhook failure policies.

Hands-on example

Failure test in staging:

1. Scale istiod to zero.

2. Confirm existing service calls still work.

3. Try creating a new injected pod.

4. Try applying a VirtualService change.

5. Restore istiod and verify proxy-status returns SYNCED.

Document exact failure behavior for your platform.

Does the data plane keep working if the control plane goes down, and why?Intermediate

Answer

Yes, the data plane can keep serving existing traffic if the control plane goes down because Envoy proxies already have their last accepted configuration. However, they cannot receive new routes, endpoints, certificates, or policy updates until control-plane connectivity is restored.

Technical explanation

This separation is an important resilience property of the mesh.

It does not mean the control plane is optional; prolonged outage can affect scaling, rotations, and rollout safety.

Gateways and sidecars should be monitored separately from istiod so teams know whether they have a control-plane issue or a data-plane issue.

Hands-on example

Operational check:

$ istioctl proxy-status

If proxies show connected and synced, traffic problems are likely data-plane or app-specific.

If proxies are disconnected but traffic still works, avoid risky config changes until istiod is restored and proxies resync.

How do you enforce that all traffic leaving the mesh goes through an egress gateway?Intermediate

Answer

To force outbound mesh traffic through an egress gateway, I combine Istio outbound traffic policy, ServiceEntry, VirtualService, DestinationRule, AuthorizationPolicy, and network controls. The mesh config routes allowed external hosts to the egress gateway, while firewall or NetworkPolicy blocks direct pod egress.

Technical explanation

Istio config alone is not enough if pods can directly reach the internet at the network layer.

ServiceEntry defines known external services; VirtualService sends that traffic through the egress gateway.

NetworkPolicy, cloud security groups, NAT rules, or firewall policy should allow outbound only from the egress gateway path.

Hands-on example

Implementation flow:

1. Set outboundTrafficPolicy to REGISTRY_ONLY if appropriate.

2. Create ServiceEntry for api.partner.com.

3. Route host through istio-egressgateway.

4. Allow only egress gateway subnet/security group to external firewall.

5. Test direct pod curl fails while routed egress succeeds.

How would you restrict which external services workloads can reach with Istio?Intermediate

Answer

I restrict external access by using REGISTRY_ONLY outbound policy, defining approved external destinations with ServiceEntry, routing sensitive traffic through egress gateways, and enforcing Kubernetes or cloud network controls so workloads cannot bypass the mesh.

Technical explanation

ServiceEntry creates an allowlist at the mesh layer.

AuthorizationPolicy can restrict which service accounts are allowed to call specific egress paths.

External access should be reviewed like firewall rules: owner, business justification, destination, port, data classification, and expiry.

Hands-on example

Example controls:

Allowed: payments service account -> api.payment-provider.com:443 through egress gateway.

Denied: any namespace -> random internet host.

Validation:

$ kubectl exec deploy/payments -- curl https://api.payment-provider.com

$ kubectl exec deploy/payments -- curl https://example.org

The second request should fail or be blocked.

What is locality-aware load balancing, and why does it help latency and cost?Intermediate

Answer

Locality-aware load balancing prefers endpoints in the same zone, region, or network locality when possible. It helps reduce latency, cross-zone or cross-region cost, and blast radius during partial failures.

Technical explanation

Kubernetes and cloud environments often label nodes with topology information such as region and zone.

Istio can use locality information and failover rules to prefer local endpoints and fail over only when needed.

This is especially valuable for multi-zone and multi-cluster services where cross-zone traffic has both performance and cost impact.

Hands-on example

Example design:

Service checkout runs in zones a, b, and c.

Clients in zone a prefer checkout pods in zone a.

If zone a endpoints become unhealthy, traffic fails over to b or c.

Measure cross-zone bytes before and after to prove latency and cost improvement.

How does Istio handle multi-cluster service discovery at a high level?Intermediate

Answer

At a high level, Istio multi-cluster service discovery lets workloads in one cluster discover and securely call services in another cluster. It uses shared or federated trust, endpoint discovery, east-west gateways where needed, and mesh configuration that understands multiple networks and clusters.

Technical explanation

Multi-cluster designs vary by network reachability, trust model, and control-plane topology.

A flat network is simpler; separate networks commonly require east-west gateways.

Operational concerns include identity, DNS, failover, locality, certificate trust, gateway capacity, and config ownership.

Hands-on example

Validation checklist:

1. Confirm clusters share trust or have configured trust bundles.

2. Confirm remote secrets or discovery integration.

3. Deploy sample service in cluster A and caller in cluster B.

4. Verify mTLS identity across clusters.

5. Test failover and locality by draining one cluster's endpoints.

What is the difference between a primary-remote and a multi-primary multi-cluster setup?Intermediate

Answer

In a primary-remote setup, one primary cluster runs the control plane and remote clusters run workloads connected to that control plane. In a multi-primary setup, each cluster has its own control plane, and the control planes share discovery and trust for cross-cluster mesh behavior.

Technical explanation

Primary-remote can centralize management but creates dependency on the primary control plane for remote workloads.

Multi-primary improves control-plane locality and autonomy but adds more operational complexity.

The right choice depends on cluster count, network latency, team ownership, failure domains, and compliance boundaries.

Hands-on example

Decision example:

Two clusters in one region managed by one platform team: primary-remote may be acceptable.

Many clusters across regions with local platform ownership: multi-primary is usually more resilient.

Test by losing the control plane in one cluster and observing config updates, certificate behavior, and traffic continuity.

How do you measure the performance impact of enabling Istio?Intermediate

Answer

I measure Istio's performance impact by comparing baseline and mesh-enabled workloads under the same load profile. I look at p50/p95/p99 latency, CPU, memory, connection counts, request errors, retries, TLS cost, gateway saturation, and application throughput.

Technical explanation

A valid test uses representative payload sizes, concurrency, keepalive behavior, and dependency depth.

Measure both sidecar resource usage and application resource changes because proxy behavior can affect app latency and connection patterns.

Separate gateway overhead from east-west service call overhead.

Hands-on example

Experiment:

1. Deploy checkout without mesh in staging.

2. Run a 30 minute load test.

3. Enable mesh and repeat.

4. Enable mTLS STRICT and repeat.

5. Add retries/timeouts and repeat.

Report delta in p99 latency, CPU per RPS, memory per pod, and SLO error budget impact.

How do you decide whether a service should be in the mesh or not?Advanced

Answer

I decide based on value versus risk and cost. A service belongs in the mesh when it benefits from mTLS identity, authorization, traffic control, observability, or progressive delivery. I avoid onboarding services where proxying creates unsupported behavior, unnecessary overhead, or no meaningful platform benefit.

Technical explanation

Good candidates are internal HTTP/gRPC services with multiple callers and clear security or release-control needs.

Riskier candidates include latency-critical ultra-low-latency paths, unusual protocols, hostNetwork workloads, and some stateful systems without testing.

The decision should be explicit, documented, and revisited as mesh modes and service needs evolve.

Hands-on example

Scoring model:

Security need: 0-5

Traffic-control need: 0-5

Observability gap: 0-5

Protocol compatibility risk: 0-5

Operational owner readiness: 0-5

Onboard high-value, low-risk services first; keep exceptions with compensating controls.

When is a service mesh overkill, and what lighter alternatives exist?Advanced

Answer

A service mesh is overkill when the environment has few services, simple traffic paths, limited security requirements, or a team that cannot operate the additional control plane and proxy layer. Lighter alternatives include Kubernetes Services, NetworkPolicy, API gateways, library-based retries, OpenTelemetry instrumentation, and cloud load balancer features.

Technical explanation

The mesh should solve real organizational and technical problems, not be adopted because it is fashionable.

Complexity includes upgrades, CRD governance, proxy tuning, telemetry cost, policy debugging, and incident-response training.

A lighter design may be better until the platform reaches enough service count, risk, or compliance need.

Hands-on example

Decision example:

A cluster with 6 services and one team may use Ingress, NetworkPolicy, Prometheus, and app-level OpenTelemetry.

A platform with 300 services, many teams, strict internal mTLS, and progressive delivery needs can justify Istio.

Review the decision against SLO and audit requirements.

How do you handle secrets and certificates for the ingress gateway (TLS termination)?Advanced

Answer

For ingress gateway TLS termination, I store certificates as Kubernetes TLS secrets or use a certificate manager integration, reference them from the Gateway using credentialName, and restrict secret access to the gateway namespace and platform automation.

Technical explanation

cert-manager is commonly used to automate issuance and renewal from an internal CA or ACME provider.

Gateway TLS mode SIMPLE terminates TLS at the gateway; PASSTHROUGH keeps TLS to the backend and uses SNI routing.

Secret governance matters: only approved automation should create or rotate gateway certificates.

Hands-on example

TLS secret example:

$ kubectl -n istio-ingress create secret tls app-tls --cert=tls.crt --key=tls.key

Gateway snippet:

tls:

mode: SIMPLE

credentialName: app-tls

hosts:

- app.example.com

Validate:

$ openssl s_client -connect app.example.com:443 -servername app.example.com

What is SNI-based routing, and how does the ingress gateway use it?Advanced

Answer

SNI-based routing uses the Server Name Indication value in the TLS ClientHello to route encrypted traffic before HTTP is decrypted. An Istio ingress gateway can match hosts in TLS PASSTHROUGH mode and send traffic to the correct backend based on SNI.

Technical explanation

SNI routing is useful when the gateway should not terminate TLS, such as when backend services own their certificates.

Because the gateway does not decrypt traffic in PASSTHROUGH mode, it cannot route based on HTTP path or headers.

For HTTP path routing, terminate TLS at the gateway or use another design that exposes HTTP metadata to the proxy.

Hands-on example

PASSTHROUGH sketch:

Gateway server:

port: 443 HTTPS

tls:

mode: PASSTHROUGH

hosts: [secure.example.com]

VirtualService tls match:

- sniHosts: [secure.example.com]

route:

- destination:

host: secure-backend

port:

number: 443

How would you implement rate limiting in Istio (local and global)?Advanced

Answer

Istio rate limiting can be local or global. Local rate limiting is enforced independently by each proxy and is good for simple per-pod protection. Global rate limiting uses an external rate-limit service so limits can be shared across replicas and gateways.

Technical explanation

Local limits are simpler and avoid an external dependency, but each proxy has its own counter.

Global limits are better for tenant-level, API-key, or user-level quotas across multiple gateway replicas.

Rate limits should be paired with clear response codes, dashboards, and exemption processes.

Hands-on example

Implementation example:

Local: EnvoyFilter or Telemetry/filter configuration for token bucket at ingress.

Global: ingress gateway -> Envoy external rate limit filter -> rate-limit service backed by Redis.

Test:

$ hey -n 1000 -c 50 https://api.example.com/orders

Expect 429 when configured thresholds are exceeded.

How do you integrate an external authorization service with Istio?Advanced

Answer

External authorization delegates the allow/deny decision to an external auth service through Envoy's ext_authz integration. I use it when policy depends on business context, entitlements, tenant state, or centralized authorization logic that is not practical to encode only in AuthorizationPolicy.

Technical explanation

The proxy sends selected request metadata to the external auth service.

The auth service returns allow or deny, optionally with headers to add or remove.

Availability and latency of the auth service become part of the request path, so it needs SLOs, caching strategy, and failure-mode design.

Hands-on example

Design:

Gateway receives request with JWT.

RequestAuthentication validates token.

ext_authz sends user, tenant, path, and method to authz-service.

authz-service checks entitlements and returns allow/deny.

Load test the authz service and decide fail-open vs fail-closed per route risk.

How does Istio interact with NetworkPolicies — do you need both?Advanced

Answer

Istio and Kubernetes NetworkPolicies operate at different layers, and I usually want both. NetworkPolicy provides L3/L4 network segmentation enforced by the CNI, while Istio provides identity-aware mTLS and L7 policies such as method, path, and JWT-claim checks.

Technical explanation

NetworkPolicy can block bypass paths if a pod tries to avoid the sidecar or call directly at the network layer.

Istio AuthorizationPolicy can express service-account and HTTP-level intent that NetworkPolicy cannot.

Defense in depth is stronger than relying on either layer alone.

Hands-on example

Example:

NetworkPolicy allows traffic to payments only from frontend namespace on port 8080.

Istio AuthorizationPolicy allows only principal cluster.local/ns/frontend/sa/frontend and only POST /charge.

If one layer is bypassed or misconfigured, the other still reduces blast radius.

What is the difference between L4 and L7 policy enforcement in the mesh?Advanced

Answer

L4 policy enforcement uses connection-level attributes such as source identity, destination port, IP, and TCP protocol. L7 policy enforcement understands application protocol metadata such as HTTP method, path, headers, host, gRPC service, and JWT claims.

Technical explanation

L4 policy is generally cheaper and works for opaque TCP protocols.

L7 policy is more expressive but requires protocol awareness and, in ambient mode, usually waypoint proxies for L7 decisions.

Use L4 for broad segmentation and L7 for application-level least privilege.

Hands-on example

Example:

L4: frontend service account can connect to orders on port 8080.

L7: frontend can GET /orders and POST /orders, but cannot DELETE /orders.

Policy design starts with L4 deny-by-default, then adds L7 controls for critical APIs.

How do you observe and reduce the error rate of a specific service via the mesh?Advanced

Answer

To observe and reduce a specific service's error rate, I first identify the failing edge, response codes, and source workloads using Istio metrics and access logs. Then I determine whether errors come from app behavior, routing, mTLS, authorization, endpoint health, retries, or downstream saturation.

Technical explanation

Mesh telemetry shows which caller-to-callee relationship is failing, which is faster than looking only at pod restarts.

Reducing error rate might involve rollback, fixing a route, changing readiness, tuning retries, ejecting bad endpoints, or adding capacity.

I avoid hiding real errors with retries until I understand the root cause.

Hands-on example

PromQL:

sum(rate(istio_requests_total{destination_workload='payments',response_code=~'5..'}[5m])) by (source_workload,response_code)

Then inspect:

$ istioctl proxy-config endpoints deploy/checkout -n app | grep payments

$ kubectl logs deploy/checkout -c istio-proxy -n app --tail=200

How would you use the mesh to enforce least-privilege between microservices?Advanced

Answer

I enforce least privilege by combining mTLS STRICT, dedicated service accounts, default-deny AuthorizationPolicy, explicit ALLOW rules for known service edges, JWT validation where user context matters, and CI validation so policy changes are reviewed before production.

Technical explanation

The service account becomes the workload identity, so workloads should not share a broad default service account.

Start by observing traffic to build an allowlist, but move to enforcement once owners validate required flows.

Policy should be owned as code and tested with representative requests.

Hands-on example

Least-privilege rollout:

1. Inventory edges from Istio telemetry for 14 days.

2. Replace default service accounts.

3. Apply namespace default-deny.

4. Add ALLOW policies per service edge.

5. Dry-run or canary the policy.

6. Enforce and alert on denied legitimate traffic.

How do you test an AuthorizationPolicy before enforcing it (dry-run)?Advanced

Answer

I test AuthorizationPolicy using dry-run where supported, narrow selectors, staging namespaces, synthetic requests, and access-log review before enforcing. Dry-run lets me see what would be denied without actually breaking production traffic.

Technical explanation

Dry-run is especially useful when introducing DENY policies or default-deny posture.

I also run positive and negative test cases: allowed caller succeeds, unauthorized caller fails, wrong method fails, wrong JWT claim fails.

After enforcement, I monitor 403 responses and RBAC response flags closely.

Hands-on example

Dry-run annotation example:

metadata:

annotations:

istio.io/dry-run: 'true'

Then send test traffic and inspect proxy metrics/logs for authorization decision signals.

PromQL idea:

sum(rate(istio_requests_total{response_code='403'}[5m])) by (source_workload,destination_workload)

How do you roll back a bad VirtualService change quickly?Advanced

Answer

To roll back a bad VirtualService quickly, I keep mesh config in Git, apply changes through CI/CD, and revert to the last known-good manifest. Operationally, the fastest rollback is usually setting weights back to the stable subset or reapplying the previous VirtualService revision.

Technical explanation

Bad VirtualService changes can cause no-route errors, wrong host matches, canary overload, or broken gateway routes.

A rollback should be a small config change, not a redeploy of every service.

I validate rollback by checking proxy routes and live error-rate recovery.

Hands-on example

Rollback commands:

$ git revert <bad-commit>

$ kubectl apply -f virtualservice.yaml

Fast emergency patch:

$ kubectl patch virtualservice checkout -n app --type merge -p '<known-good-json>'

Validate:

$ istioctl proxy-config route deploy/ingressgateway -n istio-system | grep checkout

$ kubectl logs deploy/ingressgateway -c istio-proxy -n istio-system --tail=100

What metrics would you alert on for the mesh itself?Advanced

Answer

I alert on mesh control-plane health, proxy sync, gateway health, xDS push errors, certificate expiration, injection failures, 5xx/error-rate at gateways, mTLS or authorization failures, high proxy CPU/memory, rejected config, and abnormal request latency introduced at the proxy layer.

Technical explanation

Control-plane alerts tell us whether the mesh can accept changes and support scaling events.

Data-plane alerts tell us whether user traffic is affected.

Gateway alerts need special attention because gateways are shared choke points.

Hands-on example

Alert examples:

1. istiod unavailable or no ready replicas.

2. Proxy sync stale for more than 5 minutes.

3. Ingress gateway 5xx burn rate exceeds SLO.

4. Certificate expiry under threshold.

5. Envoy memory near limit or OOMKilled.

6. Spike in RBAC denied traffic after a policy deploy.

How do you capacity-plan the ingress gateway?Advanced

Answer

I capacity-plan the ingress gateway like a shared production load balancer. I estimate peak RPS, concurrent connections, TLS handshakes, payload size, response size, header size, route complexity, retry behavior, CPU, memory, network throughput, and availability requirements.

Technical explanation

TLS termination and high-cardinality telemetry can be CPU expensive.

Gateway autoscaling should use meaningful signals such as CPU, request rate, active connections, and latency where available.

The gateway deployment needs pod anti-affinity, PDBs, readiness, load-balancer health checks, and safe rollout strategy.

Hands-on example

Capacity test:

$ fortio load -qps 5000 -c 200 -t 30m https://app.example.com/

Watch ingressgateway CPU, memory, downstream connections, p99 latency, 5xx, TLS errors, and node network.

Set HPA and resource requests based on tested headroom, not averages from quiet periods.

How do you handle gradual migration of services into mTLS STRICT mode?Advanced

Answer

For gradual migration to mTLS STRICT, I first enable the mesh in PERMISSIVE mode, identify all callers, verify that expected traffic uses mTLS, fix non-meshed clients, then apply STRICT at workload or namespace scope in waves.

Technical explanation

Do not switch a namespace to STRICT until batch jobs, cronjobs, external clients, probes, and legacy services are accounted for.

Use PeerAuthentication selectors for smaller blast radius when needed.

Monitor 503, TLS errors, and failed handshakes during each wave.

Hands-on example

Migration sequence:

1. PERMISSIVE namespace policy.

2. istioctl authn tls-check for important paths.

3. Enable STRICT for one workload selector.

4. Run smoke tests from every known caller.

5. Expand to namespace-level STRICT.

6. Add alert for plaintext attempts or handshake failures.

What is the impact of the sidecar on application startup and shutdown ordering?Advanced

Answer

The sidecar can affect startup and shutdown because the application may start before Envoy is ready, or terminate before Envoy finishes draining connections. If not handled, this can cause early request failures during startup or dropped in-flight requests during rolling updates.

Technical explanation

Startup ordering matters when the app immediately calls dependencies or receives traffic as soon as its container starts.

Shutdown ordering matters for long-lived HTTP/gRPC connections and graceful termination.

Readiness probes, preStop hooks, terminationGracePeriodSeconds, and Istio proxy lifecycle settings should be coordinated.

Hands-on example

Practical setup:

1. Enable holdApplicationUntilProxyStarts for sensitive workloads.

2. Ensure Kubernetes readiness waits for the app and proxy.

3. Add preStop sleep or graceful shutdown in app.

4. Set terminationGracePeriodSeconds long enough for Envoy drain plus app cleanup.

5. Test rolling update under live traffic.

How do you ensure the sidecar is ready before the app starts taking traffic?Advanced

Answer

I ensure the sidecar is ready before traffic by using Istio's proxy readiness integration, Kubernetes readiness probes, and, for workloads that make early outbound calls, holdApplicationUntilProxyStarts or equivalent proxy-start ordering. The service should not receive traffic until both app and proxy are ready.

Technical explanation

If only the application readiness is checked, Kubernetes may send traffic before Envoy has listeners and clusters.

Istio can rewrite HTTP probes so health checks work through sidecar interception.

For strict startup dependencies, hold the app until the proxy starts to avoid bootstrap failures.

Hands-on example

Validation:

$ kubectl describe pod <pod> -n app | grep -A5 Readiness

$ kubectl get pod <pod> -n app -o jsonpath='{.status.containerStatuses[*].ready}'

$ kubectl logs <pod> -c istio-proxy -n app | grep -i ready

Run a rolling restart while a client sends continuous requests and check for startup 503s.

How do you drain connections gracefully during a rolling update with Istio?Advanced

Answer

To drain connections gracefully during a rolling update, I coordinate Kubernetes termination settings, application shutdown, Envoy drain duration, readiness removal, and load-balancer behavior. The pod should stop receiving new traffic before the app exits, while existing requests complete where possible.

Technical explanation

Readiness should fail first so Kubernetes removes the pod from endpoints.

The app should stop accepting new work and complete in-flight requests.

Envoy should drain downstream connections within terminationGracePeriodSeconds.

Hands-on example

Runbook:

1. Configure app graceful shutdown on SIGTERM.

2. Set terminationGracePeriodSeconds to 30-60s or workload-specific value.

3. Use preStop if needed to give endpoint removal time.

4. Configure proxy drain duration if required.

5. Load test a rolling update and verify no 5xx spike.

What is the role of the holdApplicationUntilProxyStarts setting?Advanced

Answer

holdApplicationUntilProxyStarts delays application container startup until the Istio proxy is ready. It is useful for workloads that make outbound calls immediately at startup or are sensitive to receiving traffic before Envoy is initialized.

Technical explanation

It reduces early connection failures caused by the app racing ahead of the sidecar.

It can increase startup time slightly, so it should be used deliberately for workloads that need it.

It does not replace readiness probes or graceful shutdown design.

Hands-on example

Enable through proxy config annotation or mesh policy depending on platform standard:

metadata:

annotations:

proxy.istio.io/config: |

holdApplicationUntilProxyStarts: true

Then restart the pod and verify app logs start only after istio-proxy reports readiness.

How does Istio support traffic mirroring (shadowing), and why is it useful?Advanced

Answer

Traffic mirroring, or shadowing, sends a copy of live requests to another destination while the original request still goes to the primary service. It is useful for testing a new version with production-like traffic without affecting user responses.

Technical explanation

Mirrored traffic should not perform real side effects such as charging cards, sending emails, or writing authoritative data unless safely isolated.

The mirrored response is discarded, so it cannot directly affect the user's request.

Mirror percentage and destination must be controlled to avoid overloading the shadow service.

Hands-on example

VirtualService sketch:

route:

- destination:

host: checkout

subset: v1

weight: 100

mirror:

host: checkout

subset: v2

mirrorPercentage:

value: 10

Ensure v2 writes to a shadow database or runs in read-only mode before enabling.

How would you mirror production traffic to a new version for testing?Advanced

Answer

To mirror production traffic to a new version, I deploy the new version in an isolated mode, route normal traffic to stable, mirror a small percentage to the new version, and compare logs, traces, latency, and correctness metrics without returning mirrored responses to users.

Technical explanation

The shadow version must not trigger irreversible side effects.

Use separate downstream dependencies, mocked side effects, or idempotency guards.

Compare request handling, error rate, and output differences before canarying real traffic.

Hands-on example

Execution plan:

1. Deploy search-v2 with label version=v2.

2. Configure mirrorPercentage 1 percent.

3. Send v2 writes to a shadow index.

4. Compare top query results and latency.

5. Increase mirror to 10 percent if stable.

6. Move to real canary only after correctness checks pass.

How do you debug high tail latency introduced after enabling the mesh?Advanced

Answer

To debug high tail latency after enabling the mesh, I compare before/after latency at each hop: client, ingress gateway, source proxy, destination proxy, and application. I look for retries, connection-pool limits, mTLS CPU cost, DNS issues, telemetry overhead, EnvoyFilter cost, and downstream saturation.

Technical explanation

Tail latency is often amplified by retries, queueing, or connection limits rather than average proxy overhead.

Separate application latency from proxy-added latency using access logs, traces, and metrics from both source and destination.

Check resource throttling on istio-proxy; CPU limits can cause sharp p99 latency jumps.

Hands-on example

Debug steps:

$ kubectl top pod -n app --containers

$ istioctl proxy-config clusters deploy/frontend -n app | grep backend

PromQL: compare p99 istio_request_duration by source and destination.

Temporarily disable new retries or filters in staging to isolate the regression.

How do you decide retry budgets to avoid retry storms in the mesh?Advanced

Answer

I decide retry budgets from the user latency budget, downstream capacity, idempotency, and incident behavior. The goal is to recover from transient failures without multiplying traffic so much that a struggling service collapses.

Technical explanation

Retries should be limited by attempts, per-try timeout, total timeout, and retry conditions.

Non-idempotent operations need idempotency keys or should not be retried blindly by the mesh.

Monitor retry rate as its own signal; a retry spike often means an incident is already developing.

Hands-on example

Budget example:

User-facing endpoint budget: 1s.

Downstream normal p95: 120ms.

Policy: attempts=2, perTryTimeout=200ms, timeout=600ms.

Alert when retry request rate exceeds 5 percent of original request rate for 5 minutes.

During brownouts, reduce retries or shed load.

How does Istio help with progressive delivery alongside Argo Rollouts or Flagger?Advanced

Answer

Istio works well with Argo Rollouts or Flagger by providing the traffic-routing mechanism while the progressive delivery controller manages rollout steps and analysis gates. The controller adjusts VirtualService weights based on metrics and either promotes or rolls back automatically.

Technical explanation

Istio handles the data-plane traffic split between stable and canary subsets or services.

Argo Rollouts or Flagger automates step progression, metric checks, pauses, and rollback.

The best setup includes SLO-based metrics from Prometheus plus application-specific checks.

Hands-on example

Example workflow:

Argo Rollouts creates canary ReplicaSet.

It updates Istio VirtualService from 5 percent to 20 percent to 50 percent.

AnalysisTemplate checks Prometheus 5xx rate and p95 latency.

If the metric fails, Argo sets canary weight to 0 and marks rollout failed.

What observability gaps does Istio NOT fill that you still need application instrumentation for?Advanced

Answer

Istio does not replace application instrumentation. It shows network-level service telemetry, but it cannot fully explain business transactions, internal code paths, database query causes, cache hit logic, queue processing, or domain-specific correctness without application metrics and traces.

Technical explanation

The proxy sees requests at service boundaries, not every function call inside a process.

It cannot know why an order failed validation or which SQL query caused latency unless the app emits that context.

Use Istio telemetry with OpenTelemetry, structured logs, RED/USE metrics, and business KPIs.

Hands-on example

Example gap:

Istio shows checkout -> payment returns 500.

Application telemetry shows the actual reason: payment provider timeout after fraud-rule lookup.

Database metrics show fraud_rules query p99 increased.

Without app and DB instrumentation, the mesh only identifies the failing edge.

How do you secure the Istio control plane itself?Advanced

Answer

I secure the Istio control plane by isolating istio-system, restricting RBAC, limiting who can change Istio CRDs, protecting signing keys and root CA material, enabling audit logging, using supported versions, applying NetworkPolicies, and monitoring istiod health and config pushes.

Technical explanation

Anyone who can change AuthorizationPolicy, Gateway, VirtualService, EnvoyFilter, or mesh config can affect production traffic and security.

istiod should run with minimal required privileges and be protected by Kubernetes RBAC and admission controls.

Upgrade hygiene matters because the mesh is a privileged traffic-management layer.

Hands-on example

Hardening checklist:

1. No broad cluster-admin for app teams.

2. Separate platform admin role for Istio install and mesh config.

3. Admission policy blocks dangerous EnvoyFilters.

4. NetworkPolicy limits access to control-plane ports.

5. Alert on istiod restarts, xDS errors, and certificate issues.

What RBAC is needed to manage Istio resources safely?Advanced

Answer

RBAC should separate platform-level mesh administration from application-level route and policy ownership. App teams may manage their namespace's VirtualServices, AuthorizationPolicies, and routes, but shared gateways, mesh-wide policies, IstioOperator, EnvoyFilters, and root trust configuration should be tightly restricted.

Technical explanation

Least privilege prevents one team from accidentally breaking another team's traffic.

Cluster-scoped or shared resources require platform review and CI validation.

Use Kubernetes RBAC plus admission policies and GitOps ownership rules.

Hands-on example

Role model:

Platform Admin: install/upgrade Istio, manage root config, shared gateways.

Service Owner: manage namespace routes and policies for owned hosts.

Security Reviewer: approve DENY policies, external auth, and egress rules.

CI Bot: apply only validated manifests from approved repositories.

How do you prevent a team from misconfiguring routing for a shared gateway?Advanced

Answer

To prevent shared-gateway misconfiguration, I use ownership boundaries, Gateway API attachment rules where possible, host allowlists, admission policies, CI validation, and GitOps review. App teams should not directly edit the shared gateway listener configuration unless they own that gateway.

Technical explanation

Shared gateways are high-blast-radius resources because a bad host, TLS, or route configuration can affect many services.

The platform team should own listeners, certificates, and allowed route namespaces.

App teams can own route objects constrained to approved hostnames and namespaces.

Hands-on example

Control example:

1. Platform owns Gateway public-gw.

2. allowedRoutes permits only selected namespaces.

3. Admission policy verifies host suffix matches team domain.

4. CI runs istioctl analyze and host-conflict checks.

5. GitOps applies after approval from platform and service owner.

How would you structure Istio config ownership across many teams?Advanced

Answer

I structure Istio config ownership by separating platform-owned, security-owned, and service-owned resources. Platform owns installation, revisions, gateways, mesh config, and global defaults. Security owns baseline mTLS and authorization standards. Service teams own namespace-local routing and policies for their services within guardrails.

Technical explanation

Clear ownership reduces outage risk from overlapping VirtualServices or conflicting policies.

Git repository layout should mirror ownership and environment promotion.

Admission controls should enforce the ownership model because documentation alone is not enough.

Hands-on example

Repo layout:

mesh-platform/istio-install, revisions, gateways, telemetry defaults.

mesh-security/baseline PeerAuthentication and default-deny templates.

services/<team>/<service>/virtualservice, destinationrule, authz-policy.

CI validates each layer and prevents service repos from changing shared gateway selectors.

How do you validate mesh config changes in CI before applying?Advanced

Answer

I validate mesh config in CI by rendering manifests, running schema validation, istioctl analyze, policy tests, host/subset checks, and optionally deploying to an ephemeral or staging namespace for smoke tests before production GitOps sync.

Technical explanation

Many Istio outages are configuration mistakes, so static analysis provides high value.

CI should catch missing subsets, invalid gateways, host conflicts, dangerous wildcard routes, and overly broad AuthorizationPolicies.

Runtime smoke tests are still needed because static tools cannot prove application behavior.

Hands-on example

CI pipeline:

$ helm template chart/ -f values-prod.yaml > rendered.yaml

$ kubeconform -strict rendered.yaml

$ istioctl analyze -f rendered.yaml --failure-threshold Warning

Custom checks:

- No wildcard host on shared gateway without approval.

- DestinationRule subsets match deployment labels.

- AuthorizationPolicy DENY has owner and test evidence.

What is your rollback strategy if an Istio upgrade degrades traffic?Advanced

Answer

If an Istio upgrade degrades traffic, my rollback strategy is to stop expansion, move affected namespaces back to the previous revision or revision tag, restart affected workloads, and, if gateways are impacted, roll back gateway deployments or traffic routing first. I keep old control plane and manifests until the rollback window closes.

Technical explanation

The fastest safe rollback depends on whether the issue is sidecar data plane, gateway data plane, control plane, CRD/API behavior, or mesh config compatibility.

Revision-based upgrades make rollback targeted instead of cluster-wide.

Before upgrading, I define objective rollback triggers such as p99 latency, 5xx burn rate, proxy crash loop, or mTLS failures.

Hands-on example

Rollback runbook:

$ istioctl tag set stable --revision old

$ kubectl label ns payments istio.io/rev=stable --overwrite

$ kubectl rollout restart deploy -n payments

$ istioctl proxy-status | grep payments

For gateway issue:

$ kubectl rollout undo deploy/istio-ingressgateway -n istio-system

Verify SLO recovery before resuming upgrade.

How do you measure whether the mesh is actually improving reliability?Advanced

Answer

I measure whether the mesh improves reliability by comparing SLO outcomes before and after adoption: lower incident frequency, faster rollback, safer canaries, fewer plaintext or unauthorized paths, better service-edge visibility, reduced MTTR, and fewer release-related outages.

Technical explanation

The mesh should be judged by business and reliability outcomes, not just feature enablement.

Measure both benefits and costs: proxy overhead, operational incidents caused by mesh config, Prometheus cardinality, and platform toil.

A good adoption review includes control-plane availability, gateway availability, team onboarding speed, and policy compliance.

Hands-on example

Scorecard:

Before/after metrics:

- Release rollback time.

- Percentage of internal traffic using mTLS.

- Number of services with explicit least-privilege policy.

- MTTR for service-to-service incidents.

- p99 latency delta.

- Mesh-caused incidents per quarter.

Keep the mesh only if net reliability improves.

What recent Istio feature have you evaluated, and what value would it bring?Advanced

Answer

A recent Istio feature I would evaluate is ambient mode. Its value is reducing per-pod sidecar overhead and simplifying onboarding by using ztunnel for secure L4 mesh and optional waypoint proxies for L7 features where needed.

Technical explanation

Ambient mode can make mesh adoption easier for teams that are sensitive to sidecar resource cost or pod lifecycle complexity.

It changes the operational model: ztunnel handles the secure overlay, while waypoints must be designed around L7 security boundaries.

I would evaluate it through performance tests, observability changes, security policy coverage, and migration complexity rather than enabling it broadly on day one.

Hands-on example

Evaluation plan:

1. Pick one low-risk namespace.

2. Enable ambient mode and confirm ztunnel traffic capture.

3. Add a waypoint for a service needing L7 auth.

4. Compare CPU/memory, p99 latency, mTLS coverage, metrics labels, and policy behavior against sidecar mode.

5. Document unsupported cases and rollback steps.

How do you justify the operational complexity of a service mesh to leadership?Advanced

Answer

I justify service mesh complexity only when the benefits are measurable: stronger internal security, faster and safer releases, standardized traffic policy, better service-edge observability, and lower MTTR. I also present the operating cost honestly: upgrades, governance, proxy overhead, and training.

Technical explanation

Leadership cares about risk reduction, delivery speed, compliance, and operational efficiency, not just technology adoption.

A mesh should start with a targeted business case such as mTLS compliance, progressive delivery, or platform-wide service visibility.

I would propose phased adoption with success metrics and explicit exit criteria if the mesh does not deliver value.

Hands-on example

Leadership scorecard:

Benefits:

- 100 percent mTLS for tier-1 service paths.

- Canary rollback in under 5 minutes.

- Service dependency map for incident response.

- Reduced release-related incidents.

Costs:

- Proxy resource overhead.

- Platform ownership and training.

- Upgrade and config governance.

Decision: proceed only if measured benefits exceed ongoing operational cost.

Reference Notes Checked for Current Istio Terminology

Istio ambient overview: https://istio.io/latest/docs/ambient/overview/

Istio sidecar and ambient data plane modes: https://istio.io/latest/docs/overview/dataplane-modes/

Istio waypoint proxy usage: https://istio.io/latest/docs/ambient/usage/waypoint/

Istio Gateway reference: https://istio.io/latest/docs/reference/config/networking/gateway/

Istio Kubernetes Gateway API task: https://istio.io/latest/docs/tasks/traffic-management/ingress/gateway-api/

Istio AuthorizationPolicy reference: https://istio.io/latest/docs/reference/config/security/authorization-policy/

Istio AuthorizationPolicy dry run task: https://istio.io/latest/docs/tasks/security/authorization/authz-dry-run/

Istio resource annotations: https://istio.io/latest/docs/reference/config/annotations/

← All interview topics