Interview questionsKubernetes, Docker, Helm & Podman

Kubernetes, Docker, Helm & Podman interview questions & answers

100 Kubernetes, Docker, Helm & Podman interview questions, each answered three ways: a concise spoken answer, a technical explanation, and a hands-on example.

Tip: paste the job description + your resume into our free resume checker to see which of these skills the role actually requires.

All questions

  1. What is Kubernetes, and what problem does it solve over running containers manually?
  2. Explain the Kubernetes control plane components (API server, etcd, scheduler, controller manager).
  3. What runs on a worker node (kubelet, kube-proxy, container runtime)?
  4. What is a Pod, and why does Kubernetes schedule Pods rather than containers?
  5. What is the difference between a Pod, a ReplicaSet, and a Deployment?
  6. How does a Deployment perform a rolling update, and how do maxSurge and maxUnavailable work?
  7. How do you roll back a Deployment, and how does Kubernetes track revisions?
  8. What is a Service, and what are the types (ClusterIP, NodePort, LoadBalancer, ExternalName)?
  9. How does a Service select its Pods, and what happens if labels do not match?
  10. What is an Ingress, and how does it differ from a LoadBalancer Service?
  11. What is the difference between a liveness, readiness, and startup probe?
  12. What happens when a liveness probe fails versus when a readiness probe fails?
  13. What are resource requests and limits, and what happens when a container exceeds each?
  14. What is the difference between a request and a limit for CPU versus memory?
  15. What is OOMKilled, and how do you diagnose and prevent it?
  16. What is a ConfigMap, and how do you consume it in a Pod?
  17. What is a Secret, and how is it different from a ConfigMap (and is it actually encrypted)?
  18. How do you encrypt Kubernetes Secrets at rest?
  19. What is a namespace, and how do you use it for isolation and quotas?
  20. What is a ResourceQuota and a LimitRange?
  21. What is the difference between a StatefulSet and a Deployment?
  22. When would you use a DaemonSet, and give a real example.
  23. What is a Job versus a CronJob?
  24. What is a HorizontalPodAutoscaler, and what metrics can it scale on?
  25. What is the difference between the HPA, VPA, and Cluster Autoscaler?
  26. How does the Cluster Autoscaler decide to add or remove nodes?
  27. What is a PersistentVolume, a PersistentVolumeClaim, and a StorageClass?
  28. What is dynamic volume provisioning, and how does a StorageClass enable it?
  29. What is the difference between a volume, a PersistentVolume, and an emptyDir?
  30. How does Kubernetes DNS work for service discovery?
  31. Explain the Kubernetes networking model and the requirement that all Pods can reach each other.
  32. What is a CNI plugin, and name a few (Calico, Cilium, AWS VPC CNI)?
  33. What is kube-proxy, and how does it implement Service routing (iptables vs IPVS)?
  34. What is a NetworkPolicy, and what is the default Pod-to-Pod behaviour without one?
  35. What is RBAC, and what are Roles, ClusterRoles, RoleBindings, and ClusterRoleBindings?
  36. What is a ServiceAccount, and how do Pods use it to talk to the API server?
  37. How does the scheduler decide where to place a Pod?
  38. What are node selectors, node affinity, and anti-affinity?
  39. What are taints and tolerations, and how do they differ from affinity?
  40. What are Pod topology spread constraints, and why use them?
  41. What is a static Pod, and how does it differ from a normally scheduled Pod?
  42. What are init containers, and when would you use one?
  43. What is a sidecar container, and give a common use case?
  44. What is etcd, and why is it critical to back it up?
  45. How would you back up and restore an etcd cluster?
  46. What is a CRD (Custom Resource Definition), and what is an Operator?
  47. What is the controller/reconciliation loop pattern in Kubernetes?
  48. How do you troubleshoot a Pod stuck in Pending?
  49. How do you troubleshoot a Pod in CrashLoopBackOff?
  50. How do you troubleshoot an ImagePullBackOff error?
  51. What does kubectl describe show that kubectl get does not?
  52. How do you view logs from a crashed (previous) container instance?
  53. What is a PodDisruptionBudget, and how does it protect availability during maintenance?
  54. What is graceful termination, and how do preStop hooks and terminationGracePeriodSeconds work?
  55. How do you safely drain and cordon a node for maintenance?
  56. What is the difference between cordon, drain, and delete on a node?
  57. What are Kubernetes QoS classes (Guaranteed, Burstable, BestEffort)?
  58. How does Kubernetes handle a node that becomes NotReady?
  59. What is the difference between horizontal scaling of Pods and scaling nodes?
  60. How do you expose a service externally on EKS, and what gets created?
  61. What is a container, and how is it different from a virtual machine?
  62. What is a container image, and what are layers?
  63. How does Docker layer caching work, and how do you order a Dockerfile to exploit it?
  64. What is the difference between an image and a container?
  65. What is the difference between CMD and ENTRYPOINT in a Dockerfile?
  66. What is the difference between COPY and ADD?
  67. What is a multi-stage build, and why does it reduce image size and risk?
  68. Why should containers run as a non-root user, and how do you enforce it?
  69. What is a distroless or scratch image, and what are the trade-offs?
  70. What is the difference between EXPOSE and publishing a port with -p?
  71. What is a Docker volume versus a bind mount?
  72. How do you reduce the size of a Docker image?
  73. What is a .dockerignore file, and why does it matter?
  74. What is the difference between the build context and the image?
  75. What is the role of the ENTRYPOINT exec form versus shell form regarding signals?
  76. How do you debug a container that exits immediately on start?
  77. What is a container registry, and how does image tagging and digests work?
  78. Why is using the latest tag in production discouraged?
  79. What is the difference between Docker and containerd?
  80. What is Podman, and how does it differ architecturally from Docker (daemonless, rootless)?
  81. What are the security advantages of Podman's rootless and daemonless design?
  82. What is a Podman pod, and how does it relate to the Kubernetes Pod concept?
  83. How do you generate Kubernetes YAML from Podman (podman generate kube)?
  84. Is the Podman CLI compatible with Docker commands, and what aliasing is possible?
  85. What is Helm, and what problem does it solve over raw manifests?
  86. What are the parts of a Helm chart (Chart.yaml, values.yaml, templates, _helpers.tpl)?
  87. How does Helm templating work, and how do values get injected?
  88. What is the difference between helm install, helm upgrade, and helm rollback?
  89. How does Helm track releases and revisions?
  90. What is the difference between helm template and helm install?
  91. How do you manage environment-specific values across dev, staging, and prod with Helm?
  92. What are Helm hooks, and when would you use them?
  93. What is a Helm subchart and chart dependency, and how is it managed?
  94. How do you secure secrets in Helm (e.g., with helm-secrets or external stores)?
  95. How does Helm compare to Kustomize, and when would you choose each?
  96. How would you validate and lint a Helm chart in CI?
  97. How do you debug a failed Helm upgrade and a release stuck in pending-upgrade?
  98. How would you design a chart to be reusable across multiple services?
  99. What recent Kubernetes feature have you used, and what value did it bring?
  100. How would you harden a Kubernetes cluster (Pod Security Standards, RBAC, network policies, image policy)?

What is Kubernetes, and what problem does it solve over running containers manually?Basic

Answer

Kubernetes is a container orchestration platform. Instead of manually starting containers on individual hosts, I declare the desired state, and Kubernetes handles scheduling, health checks, restarts, service discovery, scaling, and rolling deployments across a cluster.

Technical explanation

Manual container operations fail at scale because humans cannot reliably handle placement, restarts, rollouts, service discovery, and drift across many hosts.

Kubernetes works through a desired-state API and controllers, so operations become declarative and repeatable instead of host-by-host commands.

Kubernetes resources are declarative API objects; controllers continuously drive actual state toward spec.

The practical interview angle is to connect the concept to reliability: scheduling, healing, scaling, rollout safety, and clear ownership.

Use kubectl get, describe, explain, and -o yaml to move from high-level view to exact spec/status details.

Hands-on example

1. Create a local lab with kind or minikube, then use it to demonstrate: compare a manual docker run deployment with a Kubernetes Deployment and Service.

2. Run kubectl get nodes -o wide, kubectl get pods -A, kubectl describe, and kubectl get -o yaml to connect the concept to actual cluster state.

3. Make one intentional change, such as a label change, image update, or replica change, and watch how the control plane reconciles it.

4. Capture the command output and convert it into an interview story: desired state, observed state, failure mode, and fix.

Explain the Kubernetes control plane components (API server, etcd, scheduler, controller manager).Basic

Answer

The control plane is the brain of Kubernetes. The API server is the entry point, etcd stores cluster state, the scheduler assigns unscheduled Pods to nodes, and controller managers continuously reconcile actual state back to desired state.

Technical explanation

The API server is the only supported interface for cluster state changes; components watch it and update status or desired state through it.

etcd must be protected with encryption, access control, backups, and quorum-aware operations because it is the source of truth.

Kubernetes resources are declarative API objects; controllers continuously drive actual state toward spec.

The practical interview angle is to connect the concept to reliability: scheduling, healing, scaling, rollout safety, and clear ownership.

Use kubectl get, describe, explain, and -o yaml to move from high-level view to exact spec/status details.

Hands-on example

1. Create a local lab with kind or minikube, then use it to demonstrate: inspect control-plane health using kubectl get pods -n kube-system and kubectl get --raw /readyz?verbose.

2. Run kubectl get nodes -o wide, kubectl get pods -A, kubectl describe, and kubectl get -o yaml to connect the concept to actual cluster state.

3. Make one intentional change, such as a label change, image update, or replica change, and watch how the control plane reconciles it.

4. Capture the command output and convert it into an interview story: desired state, observed state, failure mode, and fix.

What runs on a worker node (kubelet, kube-proxy, container runtime)?Basic

Answer

A worker node runs kubelet, a container runtime such as containerd or CRI-O, kube-proxy or an equivalent data-plane implementation, and node-level agents like CNI, CSI, logging, and monitoring components. Kubelet is the main node agent that makes Pods real on that node.

Technical explanation

kubelet does not run containers directly; it talks to the runtime through CRI and reports Pod/node status back to the API server.

Node reliability depends on kubelet health, runtime health, disk pressure, memory pressure, network plugin health, and certificate validity.

Kubernetes resources are declarative API objects; controllers continuously drive actual state toward spec.

The practical interview angle is to connect the concept to reliability: scheduling, healing, scaling, rollout safety, and clear ownership.

Use kubectl get, describe, explain, and -o yaml to move from high-level view to exact spec/status details.

Hands-on example

1. Create a local lab with kind or minikube, then use it to demonstrate: inspect a worker node with kubectl describe node and check kubelet/runtime conditions.

2. Run kubectl get nodes -o wide, kubectl get pods -A, kubectl describe, and kubectl get -o yaml to connect the concept to actual cluster state.

3. Make one intentional change, such as a label change, image update, or replica change, and watch how the control plane reconciles it.

4. Capture the command output and convert it into an interview story: desired state, observed state, failure mode, and fix.

What is a Pod, and why does Kubernetes schedule Pods rather than containers?Basic

Answer

A Pod is the smallest deployable unit in Kubernetes. Kubernetes schedules Pods rather than individual containers because containers inside a Pod share the same lifecycle, network namespace, volumes, and placement requirement.

Technical explanation

Containers in a Pod share an IP and ports, so localhost communication works between containers in the same Pod.

Pod co-location should be used for tightly coupled containers, not as a replacement for independent microservices.

Kubernetes resources are declarative API objects; controllers continuously drive actual state toward spec.

The practical interview angle is to connect the concept to reliability: scheduling, healing, scaling, rollout safety, and clear ownership.

Use kubectl get, describe, explain, and -o yaml to move from high-level view to exact spec/status details.

Hands-on example

1. Create a local lab with kind or minikube, then use it to demonstrate: create a two-container Pod sharing localhost and an emptyDir volume.

2. Run kubectl get nodes -o wide, kubectl get pods -A, kubectl describe, and kubectl get -o yaml to connect the concept to actual cluster state.

3. Make one intentional change, such as a label change, image update, or replica change, and watch how the control plane reconciles it.

4. Capture the command output and convert it into an interview story: desired state, observed state, failure mode, and fix.

What is the difference between a Pod, a ReplicaSet, and a Deployment?Basic

Answer

A Pod is an instance of one or more containers. A ReplicaSet keeps a desired number of matching Pods running. A Deployment sits above the ReplicaSet and manages rollout, rollback, and declarative updates for stateless workloads.

Technical explanation

A Deployment usually owns a ReplicaSet, and that ReplicaSet owns Pods; editing child ReplicaSets or Pods directly creates drift.

Use Deployments for stateless services and StatefulSets or Jobs when identity or completion semantics matter.

Kubernetes resources are declarative API objects; controllers continuously drive actual state toward spec.

The practical interview angle is to connect the concept to reliability: scheduling, healing, scaling, rollout safety, and clear ownership.

Use kubectl get, describe, explain, and -o yaml to move from high-level view to exact spec/status details.

Hands-on example

1. Create a local lab with kind or minikube, then use it to demonstrate: create a Deployment, then inspect the ReplicaSet and Pods it owns.

2. Run kubectl get nodes -o wide, kubectl get pods -A, kubectl describe, and kubectl get -o yaml to connect the concept to actual cluster state.

3. Make one intentional change, such as a label change, image update, or replica change, and watch how the control plane reconciles it.

4. Capture the command output and convert it into an interview story: desired state, observed state, failure mode, and fix.

How does a Deployment perform a rolling update, and how do maxSurge and maxUnavailable work?Basic

Answer

A Deployment rolling update creates a new ReplicaSet and gradually shifts replicas from the old ReplicaSet to the new one. maxSurge controls how many extra Pods can run above desired replicas, and maxUnavailable controls how many desired Pods can be unavailable during the rollout.

Technical explanation

maxSurge can temporarily consume more cluster capacity, so capacity planning matters before rolling out large Deployments.

Readiness probes are critical during rolling updates because they control when new Pods become eligible for traffic.

Kubernetes resources are declarative API objects; controllers continuously drive actual state toward spec.

The practical interview angle is to connect the concept to reliability: scheduling, healing, scaling, rollout safety, and clear ownership.

Use kubectl get, describe, explain, and -o yaml to move from high-level view to exact spec/status details.

Hands-on example

1. Create a local lab with kind or minikube, then use it to demonstrate: perform a rolling update with maxSurge and maxUnavailable changed.

2. Run kubectl get nodes -o wide, kubectl get pods -A, kubectl describe, and kubectl get -o yaml to connect the concept to actual cluster state.

3. Make one intentional change, such as a label change, image update, or replica change, and watch how the control plane reconciles it.

4. Capture the command output and convert it into an interview story: desired state, observed state, failure mode, and fix.

How do you roll back a Deployment, and how does Kubernetes track revisions?Basic

Answer

I roll back a Deployment with kubectl rollout undo, optionally targeting a revision. Kubernetes tracks rollout history through Deployment revisions and the ReplicaSets it created, so rollback usually means scaling an earlier ReplicaSet back up.

Technical explanation

Rollback requires previous ReplicaSets to still exist; revisionHistoryLimit controls how much history is retained.

Rollback does not undo external changes such as database migrations unless those are separately versioned and reversible.

Kubernetes resources are declarative API objects; controllers continuously drive actual state toward spec.

The practical interview angle is to connect the concept to reliability: scheduling, healing, scaling, rollout safety, and clear ownership.

Use kubectl get, describe, explain, and -o yaml to move from high-level view to exact spec/status details.

Hands-on example

1. Create a local lab with kind or minikube, then use it to demonstrate: roll back a bad Deployment image and inspect rollout history.

2. Run kubectl get nodes -o wide, kubectl get pods -A, kubectl describe, and kubectl get -o yaml to connect the concept to actual cluster state.

3. Make one intentional change, such as a label change, image update, or replica change, and watch how the control plane reconciles it.

4. Capture the command output and convert it into an interview story: desired state, observed state, failure mode, and fix.

What is a Service, and what are the types (ClusterIP, NodePort, LoadBalancer, ExternalName)?Basic

Answer

A Service gives Pods a stable virtual endpoint and load-balances to matching endpoints. ClusterIP is internal, NodePort exposes a port on each node, LoadBalancer provisions an external load balancer through the cloud provider, and ExternalName returns a DNS CNAME to an external service.

Technical explanation

Services target EndpointSlices, not just Pods directly, and the control plane updates those endpoints as Pods become ready or unready.

NodePort and LoadBalancer expose traffic differently; cloud LoadBalancer behavior depends on provider/controller implementation.

Kubernetes resources are declarative API objects; controllers continuously drive actual state toward spec.

The practical interview angle is to connect the concept to reliability: scheduling, healing, scaling, rollout safety, and clear ownership.

Use kubectl get, describe, explain, and -o yaml to move from high-level view to exact spec/status details.

Hands-on example

1. Create a local lab with kind or minikube, then use it to demonstrate: create ClusterIP, NodePort, and LoadBalancer-style Service manifests and compare behavior.

2. Run kubectl get nodes -o wide, kubectl get pods -A, kubectl describe, and kubectl get -o yaml to connect the concept to actual cluster state.

3. Make one intentional change, such as a label change, image update, or replica change, and watch how the control plane reconciles it.

4. Capture the command output and convert it into an interview story: desired state, observed state, failure mode, and fix.

How does a Service select its Pods, and what happens if labels do not match?Basic

Answer

A Service selects Pods using labels. If the selector does not match any Pods, the Service still exists, but its EndpointSlice has no usable endpoints, so traffic to that Service fails even though the Pods themselves may be healthy.

Technical explanation

Label discipline is an availability concern: a typo in app labels can create a Service with no endpoints while Deployment replicas look healthy.

Debug Service selection with kubectl get endpointslices, kubectl describe service, and kubectl get pods --show-labels.

Kubernetes resources are declarative API objects; controllers continuously drive actual state toward spec.

The practical interview angle is to connect the concept to reliability: scheduling, healing, scaling, rollout safety, and clear ownership.

Use kubectl get, describe, explain, and -o yaml to move from high-level view to exact spec/status details.

Hands-on example

1. Create a local lab with kind or minikube, then use it to demonstrate: intentionally break a Service selector and observe empty EndpointSlices.

2. Run kubectl get nodes -o wide, kubectl get pods -A, kubectl describe, and kubectl get -o yaml to connect the concept to actual cluster state.

3. Make one intentional change, such as a label change, image update, or replica change, and watch how the control plane reconciles it.

4. Capture the command output and convert it into an interview story: desired state, observed state, failure mode, and fix.

What is an Ingress, and how does it differ from a LoadBalancer Service?Basic

Answer

Ingress is Layer 7 HTTP/HTTPS routing managed by an ingress controller. A LoadBalancer Service normally provisions a Layer 4 cloud load balancer for one Service, while Ingress can route many hosts and paths through one controller/load balancer.

Technical explanation

Ingress is only an API object until an ingress controller is installed; without a controller nothing actually routes traffic.

Modern clusters may also use Gateway API, but Ingress remains common for HTTP routing.

Kubernetes resources are declarative API objects; controllers continuously drive actual state toward spec.

The practical interview angle is to connect the concept to reliability: scheduling, healing, scaling, rollout safety, and clear ownership.

Use kubectl get, describe, explain, and -o yaml to move from high-level view to exact spec/status details.

Hands-on example

1. Create a local lab with kind or minikube, then use it to demonstrate: deploy an ingress controller and route two paths to two Services.

2. Run kubectl get nodes -o wide, kubectl get pods -A, kubectl describe, and kubectl get -o yaml to connect the concept to actual cluster state.

3. Make one intentional change, such as a label change, image update, or replica change, and watch how the control plane reconciles it.

4. Capture the command output and convert it into an interview story: desired state, observed state, failure mode, and fix.

What is the difference between a liveness, readiness, and startup probe?Basic

Answer

A liveness probe answers whether the container should be restarted, a readiness probe answers whether it should receive traffic, and a startup probe gives slow-starting applications time to become healthy before liveness begins.

Technical explanation

Startup probes disable liveness and readiness checks until startup succeeds, which prevents premature restarts for slow applications.

Probe thresholds, timeoutSeconds, and periodSeconds must reflect real app behavior, not arbitrary defaults.

Health and resources are production controls, not just YAML fields; wrong settings cause outages, noisy restarts, bad rollouts, or wasted capacity.

Requests affect scheduling and node capacity planning; readiness affects traffic; liveness affects restart behavior.

Validate settings with real load, startup timing, memory profiles, and deployment rollout behavior.

Hands-on example

1. Create a namespace and deploy a small HTTP app specifically to test: add liveness, readiness, and startup probes to a slow-starting app.

2. Add probes and resources in YAML, then run kubectl describe pod, kubectl top pod, and kubectl rollout status to observe behavior.

3. Introduce a controlled failure such as slow startup, bad health endpoint, CPU load, or memory spike.

4. Tune thresholds, requests, and limits until rollout and runtime behavior are stable, then document the production values and why.

What happens when a liveness probe fails versus when a readiness probe fails?Basic

Answer

When liveness fails, kubelet restarts the container. When readiness fails, Kubernetes removes the Pod from Service endpoints but does not restart it, which is useful when the process is alive but temporarily not able to serve traffic.

Technical explanation

Readiness failure is traffic control, while liveness failure is process recovery; mixing them up causes unnecessary restarts or bad traffic routing.

External dependencies are often appropriate for readiness but risky for liveness because a downstream outage can cause restart storms.

Health and resources are production controls, not just YAML fields; wrong settings cause outages, noisy restarts, bad rollouts, or wasted capacity.

Requests affect scheduling and node capacity planning; readiness affects traffic; liveness affects restart behavior.

Validate settings with real load, startup timing, memory profiles, and deployment rollout behavior.

Hands-on example

1. Create a namespace and deploy a small HTTP app specifically to test: force readiness failure and liveness failure separately to compare behavior.

2. Add probes and resources in YAML, then run kubectl describe pod, kubectl top pod, and kubectl rollout status to observe behavior.

3. Introduce a controlled failure such as slow startup, bad health endpoint, CPU load, or memory spike.

4. Tune thresholds, requests, and limits until rollout and runtime behavior are stable, then document the production values and why.

What are resource requests and limits, and what happens when a container exceeds each?Basic

Answer

Requests are the resources Kubernetes uses for scheduling and capacity guarantees. Limits are hard ceilings enforced by cgroups; CPU above limit is throttled, while memory above limit can cause the container to be killed with OOMKilled.

Technical explanation

The scheduler uses requests, not actual usage, to decide whether a Pod fits on a node.

Limits protect shared capacity but can cause throttling or OOM if set too aggressively.

Health and resources are production controls, not just YAML fields; wrong settings cause outages, noisy restarts, bad rollouts, or wasted capacity.

Requests affect scheduling and node capacity planning; readiness affects traffic; liveness affects restart behavior.

Validate settings with real load, startup timing, memory profiles, and deployment rollout behavior.

Hands-on example

1. Create a namespace and deploy a small HTTP app specifically to test: set requests and limits and observe scheduling, CPU throttling, and memory kill behavior.

2. Add probes and resources in YAML, then run kubectl describe pod, kubectl top pod, and kubectl rollout status to observe behavior.

3. Introduce a controlled failure such as slow startup, bad health endpoint, CPU load, or memory spike.

4. Tune thresholds, requests, and limits until rollout and runtime behavior are stable, then document the production values and why.

What is the difference between a request and a limit for CPU versus memory?Basic

Answer

CPU requests influence scheduling and CPU shares, while CPU limits cause throttling. Memory requests influence scheduling, and memory limits are fatal when exceeded because memory is not compressible the way CPU time is.

Technical explanation

CPU is compressible: the workload can be throttled and continue slowly. Memory is non-compressible: exceeding a limit can terminate the process.

A common production pattern is memory limit with careful headroom, and CPU requests without strict CPU limits for latency-sensitive services.

Health and resources are production controls, not just YAML fields; wrong settings cause outages, noisy restarts, bad rollouts, or wasted capacity.

Requests affect scheduling and node capacity planning; readiness affects traffic; liveness affects restart behavior.

Validate settings with real load, startup timing, memory profiles, and deployment rollout behavior.

Hands-on example

1. Create a namespace and deploy a small HTTP app specifically to test: compare CPU limit throttling with memory limit OOM behavior.

2. Add probes and resources in YAML, then run kubectl describe pod, kubectl top pod, and kubectl rollout status to observe behavior.

3. Introduce a controlled failure such as slow startup, bad health endpoint, CPU load, or memory spike.

4. Tune thresholds, requests, and limits until rollout and runtime behavior are stable, then document the production values and why.

What is OOMKilled, and how do you diagnose and prevent it?Basic

Answer

OOMKilled means the kernel killed the container because it exceeded its memory cgroup limit or the node was under memory pressure. I diagnose it with kubectl describe, previous logs, metrics, memory profiles, and node events, then fix the leak or resize requests and limits.

Technical explanation

OOMKilled can come from a real leak, bad sizing, sudden load, large startup allocation, or sidecar overhead not included in planning.

Use container_memory_working_set_bytes, application heap metrics, and previous logs to distinguish leak from legitimate sizing.

Health and resources are production controls, not just YAML fields; wrong settings cause outages, noisy restarts, bad rollouts, or wasted capacity.

Requests affect scheduling and node capacity planning; readiness affects traffic; liveness affects restart behavior.

Validate settings with real load, startup timing, memory profiles, and deployment rollout behavior.

Hands-on example

1. Create a namespace and deploy a small HTTP app specifically to test: trigger and diagnose an OOMKilled container in a safe namespace.

2. Add probes and resources in YAML, then run kubectl describe pod, kubectl top pod, and kubectl rollout status to observe behavior.

3. Introduce a controlled failure such as slow startup, bad health endpoint, CPU load, or memory spike.

4. Tune thresholds, requests, and limits until rollout and runtime behavior are stable, then document the production values and why.

What is a ConfigMap, and how do you consume it in a Pod?Basic

Answer

A ConfigMap stores non-sensitive configuration such as files, environment variables, command arguments, or app settings. A Pod can consume it as env vars, envFrom, command arguments, or mounted files.

Technical explanation

ConfigMaps can be mounted as files or injected into env vars; mounted files can update, but env vars require Pod restart to change.

Large or frequently changing config may need a config reload pattern rather than blind Pod restarts.

Configuration, secrets, namespaces, quotas, and defaults define operational boundaries for teams and environments.

RBAC and admission controls determine who can read sensitive data and who can create risky workloads.

Production clusters should treat namespace setup as a platform contract created through IaC or GitOps.

Hands-on example

1. Create a sandbox namespace and implement this exercise with declarative YAML: mount a ConfigMap as env vars and as a file.

2. Test both success and failure paths: allowed read, denied read, quota rejection, default limit application, or config reload behavior.

3. Inspect objects with kubectl describe, kubectl auth can-i, and kubectl get events to prove the control works.

4. Turn the pattern into a reusable namespace bootstrap manifest for real teams.

What is a Secret, and how is it different from a ConfigMap (and is it actually encrypted)?Basic

Answer

A Secret is meant for sensitive values, while a ConfigMap is for non-sensitive config. Kubernetes Secrets are base64-encoded by default and are only encrypted at rest if the cluster is configured with encryption or a managed KMS provider.

Technical explanation

base64 is encoding, not encryption; anyone with read access to the Secret object can decode it.

Secret safety requires RBAC, encryption at rest, restricted logging, no env dumps, and external secret rotation where possible.

Configuration, secrets, namespaces, quotas, and defaults define operational boundaries for teams and environments.

RBAC and admission controls determine who can read sensitive data and who can create risky workloads.

Production clusters should treat namespace setup as a platform contract created through IaC or GitOps.

Hands-on example

1. Create a sandbox namespace and implement this exercise with declarative YAML: create a Secret, decode it, then restrict read access with RBAC.

2. Test both success and failure paths: allowed read, denied read, quota rejection, default limit application, or config reload behavior.

3. Inspect objects with kubectl describe, kubectl auth can-i, and kubectl get events to prove the control works.

4. Turn the pattern into a reusable namespace bootstrap manifest for real teams.

How do you encrypt Kubernetes Secrets at rest?Basic

Answer

To encrypt Kubernetes Secrets at rest, configure API server encryption using an EncryptionConfiguration or managed KMS integration. Existing Secrets usually need to be rewritten so the new encryption provider stores them encrypted in etcd.

Technical explanation

Encryption at rest protects etcd storage, but it does not protect anyone who is authorized to read the Secret through the API.

Rotate encryption keys carefully and rewrite Secret objects so old data is not left encrypted with retired providers.

Configuration, secrets, namespaces, quotas, and defaults define operational boundaries for teams and environments.

RBAC and admission controls determine who can read sensitive data and who can create risky workloads.

Production clusters should treat namespace setup as a platform contract created through IaC or GitOps.

Hands-on example

1. Create a sandbox namespace and implement this exercise with declarative YAML: review or enable secret encryption at rest and rewrite test Secrets.

2. Test both success and failure paths: allowed read, denied read, quota rejection, default limit application, or config reload behavior.

3. Inspect objects with kubectl describe, kubectl auth can-i, and kubectl get events to prove the control works.

4. Turn the pattern into a reusable namespace bootstrap manifest for real teams.

What is a namespace, and how do you use it for isolation and quotas?Basic

Answer

A namespace is a logical scope for names, RBAC, quotas, policies, and operational ownership. It helps isolate teams or environments, but it is not a complete security boundary by itself unless combined with RBAC, NetworkPolicy, quotas, and Pod Security controls.

Technical explanation

Namespaces isolate object names and policy scope, but not kernels, nodes, or network traffic by themselves.

For multi-tenant clusters, pair namespaces with RBAC, quotas, NetworkPolicy, Pod Security admission, and separate node pools where needed.

Configuration, secrets, namespaces, quotas, and defaults define operational boundaries for teams and environments.

RBAC and admission controls determine who can read sensitive data and who can create risky workloads.

Production clusters should treat namespace setup as a platform contract created through IaC or GitOps.

Hands-on example

1. Create a sandbox namespace and implement this exercise with declarative YAML: create a team namespace with RBAC, quota, and network defaults.

2. Test both success and failure paths: allowed read, denied read, quota rejection, default limit application, or config reload behavior.

3. Inspect objects with kubectl describe, kubectl auth can-i, and kubectl get events to prove the control works.

4. Turn the pattern into a reusable namespace bootstrap manifest for real teams.

What is a ResourceQuota and a LimitRange?Basic

Answer

ResourceQuota caps aggregate resource usage in a namespace, such as CPU, memory, PVCs, Services, or object counts. LimitRange sets defaults, minimums, and maximums for individual Pods or containers so users cannot omit or exceed expected resource settings.

Technical explanation

ResourceQuota can require requests/limits by rejecting Pods that omit them when quota scopes are configured.

LimitRange can prevent one container from requesting excessive CPU/memory and can provide sane defaults for teams.

Configuration, secrets, namespaces, quotas, and defaults define operational boundaries for teams and environments.

RBAC and admission controls determine who can read sensitive data and who can create risky workloads.

Production clusters should treat namespace setup as a platform contract created through IaC or GitOps.

Hands-on example

1. Create a sandbox namespace and implement this exercise with declarative YAML: apply ResourceQuota and LimitRange, then try valid and invalid Pod specs.

2. Test both success and failure paths: allowed read, denied read, quota rejection, default limit application, or config reload behavior.

3. Inspect objects with kubectl describe, kubectl auth can-i, and kubectl get events to prove the control works.

4. Turn the pattern into a reusable namespace bootstrap manifest for real teams.

What is the difference between a StatefulSet and a Deployment?Basic

Answer

A StatefulSet is for stateful workloads that need stable network identity, ordered rollout, and stable per-Pod storage. A Deployment is for stateless, interchangeable replicas where any Pod can replace another without identity concerns.

Technical explanation

StatefulSet Pods have stable ordinal names such as app-0 and stable PVCs created from volumeClaimTemplates.

StatefulSet updates and deletes are more conservative because identity and storage safety matter.

Kubernetes workload controllers encode different lifecycle guarantees: interchangeable replicas, stable identities, node-local agents, or finite tasks.

Storage decisions must align with durability, access mode, zone placement, backup, restore, and failover behavior.

Autoscaling should be designed with metrics, scheduling constraints, PDBs, and node capacity together.

Hands-on example

1. Deploy a workload for this exercise using kubectl apply and a small test image such as nginx, busybox, or a purpose-built app: compare a Deployment and StatefulSet using stable Pod names and PVCs.

2. Inspect ownerReferences, events, Pods, PVCs, PVs, EndpointSlices, and metrics depending on the resource being tested.

3. Create a realistic disruption: delete a Pod, scale replicas, restart a node, fill a queue, or recreate storage attachment in a test environment.

4. Write the runbook entry covering expected behavior, safe rollback, and what alarms should exist.

When would you use a DaemonSet, and give a real example.Basic

Answer

I use a DaemonSet when I need one Pod on every node or every matching node. Real examples are log collectors, node exporters, CNI agents, CSI node plugins, security agents, and local monitoring components.

Technical explanation

DaemonSets can target only selected nodes using node selectors, affinity, or tolerations.

During node upgrades, DaemonSet Pods are normally recreated automatically on replacement nodes.

Kubernetes workload controllers encode different lifecycle guarantees: interchangeable replicas, stable identities, node-local agents, or finite tasks.

Storage decisions must align with durability, access mode, zone placement, backup, restore, and failover behavior.

Autoscaling should be designed with metrics, scheduling constraints, PDBs, and node capacity together.

Hands-on example

1. Deploy a workload for this exercise using kubectl apply and a small test image such as nginx, busybox, or a purpose-built app: deploy a node-exporter-style DaemonSet to selected nodes.

2. Inspect ownerReferences, events, Pods, PVCs, PVs, EndpointSlices, and metrics depending on the resource being tested.

3. Create a realistic disruption: delete a Pod, scale replicas, restart a node, fill a queue, or recreate storage attachment in a test environment.

4. Write the runbook entry covering expected behavior, safe rollback, and what alarms should exist.

What is a Job versus a CronJob?Basic

Answer

A Job runs a finite task to completion, such as a migration or batch import. A CronJob creates Jobs on a schedule, such as nightly cleanup, backup validation, or periodic reporting.

Technical explanation

Jobs track completions and retries; CronJobs add schedule, concurrency policy, and missed-run behavior.

Batch jobs need idempotency because retries can rerun the same task after partial failure.

Kubernetes workload controllers encode different lifecycle guarantees: interchangeable replicas, stable identities, node-local agents, or finite tasks.

Storage decisions must align with durability, access mode, zone placement, backup, restore, and failover behavior.

Autoscaling should be designed with metrics, scheduling constraints, PDBs, and node capacity together.

Hands-on example

1. Deploy a workload for this exercise using kubectl apply and a small test image such as nginx, busybox, or a purpose-built app: run a one-time Job and a scheduled CronJob with concurrency controls.

2. Inspect ownerReferences, events, Pods, PVCs, PVs, EndpointSlices, and metrics depending on the resource being tested.

3. Create a realistic disruption: delete a Pod, scale replicas, restart a node, fill a queue, or recreate storage attachment in a test environment.

4. Write the runbook entry covering expected behavior, safe rollback, and what alarms should exist.

What is a HorizontalPodAutoscaler, and what metrics can it scale on?Basic

Answer

The HorizontalPodAutoscaler changes replica count based on observed demand. It commonly scales on CPU and memory through metrics-server, and it can also scale on custom or external metrics such as queue depth, request rate, or business-specific load.

Technical explanation

HPA needs metrics availability; without metrics-server or custom metrics adapter it cannot make correct decisions.

Scaling should be based on a signal that reflects user demand, not only CPU if CPU is not the bottleneck.

Kubernetes workload controllers encode different lifecycle guarantees: interchangeable replicas, stable identities, node-local agents, or finite tasks.

Storage decisions must align with durability, access mode, zone placement, backup, restore, and failover behavior.

Autoscaling should be designed with metrics, scheduling constraints, PDBs, and node capacity together.

Hands-on example

1. Deploy a workload for this exercise using kubectl apply and a small test image such as nginx, busybox, or a purpose-built app: configure HPA for a web Deployment and generate load.

2. Inspect ownerReferences, events, Pods, PVCs, PVs, EndpointSlices, and metrics depending on the resource being tested.

3. Create a realistic disruption: delete a Pod, scale replicas, restart a node, fill a queue, or recreate storage attachment in a test environment.

4. Write the runbook entry covering expected behavior, safe rollback, and what alarms should exist.

What is the difference between the HPA, VPA, and Cluster Autoscaler?Basic

Answer

HPA scales Pod replicas, VPA adjusts resource requests for Pods, and Cluster Autoscaler changes node count. They solve different layers of capacity: workload replicas, per-Pod sizing, and cluster infrastructure capacity.

Technical explanation

HPA and VPA can conflict if both change CPU/memory for the same workload without a careful mode.

Cluster Autoscaler reacts after Pods are unschedulable, so HPA can create demand before nodes exist.

Kubernetes workload controllers encode different lifecycle guarantees: interchangeable replicas, stable identities, node-local agents, or finite tasks.

Storage decisions must align with durability, access mode, zone placement, backup, restore, and failover behavior.

Autoscaling should be designed with metrics, scheduling constraints, PDBs, and node capacity together.

Hands-on example

1. Deploy a workload for this exercise using kubectl apply and a small test image such as nginx, busybox, or a purpose-built app: compare HPA replica scaling, VPA recommendations, and Cluster Autoscaler node behavior.

2. Inspect ownerReferences, events, Pods, PVCs, PVs, EndpointSlices, and metrics depending on the resource being tested.

3. Create a realistic disruption: delete a Pod, scale replicas, restart a node, fill a queue, or recreate storage attachment in a test environment.

4. Write the runbook entry covering expected behavior, safe rollback, and what alarms should exist.

How does the Cluster Autoscaler decide to add or remove nodes?Basic

Answer

Cluster Autoscaler adds nodes when Pods are unschedulable because no existing node can fit them, and it removes nodes when they are underutilized and their Pods can be safely moved. It must respect PDBs, scheduling constraints, taints, local storage, and cloud node group limits.

Technical explanation

Scale-up is triggered by unschedulable Pods, not by high node CPU alone.

Scale-down requires Pods to be movable and can be blocked by PDBs, local volumes, affinity, or system pods.

Kubernetes workload controllers encode different lifecycle guarantees: interchangeable replicas, stable identities, node-local agents, or finite tasks.

Storage decisions must align with durability, access mode, zone placement, backup, restore, and failover behavior.

Autoscaling should be designed with metrics, scheduling constraints, PDBs, and node capacity together.

Hands-on example

1. Deploy a workload for this exercise using kubectl apply and a small test image such as nginx, busybox, or a purpose-built app: create unschedulable Pods and observe Cluster Autoscaler scale-up conditions.

2. Inspect ownerReferences, events, Pods, PVCs, PVs, EndpointSlices, and metrics depending on the resource being tested.

3. Create a realistic disruption: delete a Pod, scale replicas, restart a node, fill a queue, or recreate storage attachment in a test environment.

4. Write the runbook entry covering expected behavior, safe rollback, and what alarms should exist.

What is a PersistentVolume, a PersistentVolumeClaim, and a StorageClass?Basic

Answer

A PersistentVolume is a cluster storage resource, a PersistentVolumeClaim is a user's request for storage, and a StorageClass defines how storage should be dynamically provisioned. Together they decouple app manifests from the storage backend implementation.

Technical explanation

StorageClass volumeBindingMode matters; WaitForFirstConsumer helps choose storage in the same zone as the scheduled Pod.

ReclaimPolicy controls whether the underlying volume is deleted or retained after the PVC is deleted.

Kubernetes workload controllers encode different lifecycle guarantees: interchangeable replicas, stable identities, node-local agents, or finite tasks.

Storage decisions must align with durability, access mode, zone placement, backup, restore, and failover behavior.

Autoscaling should be designed with metrics, scheduling constraints, PDBs, and node capacity together.

Hands-on example

1. Deploy a workload for this exercise using kubectl apply and a small test image such as nginx, busybox, or a purpose-built app: create a PVC from a StorageClass and inspect the provisioned PV.

2. Inspect ownerReferences, events, Pods, PVCs, PVs, EndpointSlices, and metrics depending on the resource being tested.

3. Create a realistic disruption: delete a Pod, scale replicas, restart a node, fill a queue, or recreate storage attachment in a test environment.

4. Write the runbook entry covering expected behavior, safe rollback, and what alarms should exist.

What is dynamic volume provisioning, and how does a StorageClass enable it?Basic

Answer

Dynamic volume provisioning means Kubernetes creates a backing volume automatically when a PVC references a StorageClass. The StorageClass points to a CSI provisioner and parameters such as disk type, filesystem, reclaim policy, binding mode, and expansion capability.

Technical explanation

Dynamic provisioning is normally implemented by a CSI driver such as EBS CSI, EFS CSI, Ceph CSI, or another storage provider.

allowVolumeExpansion and parameters decide whether PVC resize and disk characteristics are available.

Kubernetes workload controllers encode different lifecycle guarantees: interchangeable replicas, stable identities, node-local agents, or finite tasks.

Storage decisions must align with durability, access mode, zone placement, backup, restore, and failover behavior.

Autoscaling should be designed with metrics, scheduling constraints, PDBs, and node capacity together.

Hands-on example

1. Deploy a workload for this exercise using kubectl apply and a small test image such as nginx, busybox, or a purpose-built app: test dynamic provisioning with a CSI StorageClass and WaitForFirstConsumer.

2. Inspect ownerReferences, events, Pods, PVCs, PVs, EndpointSlices, and metrics depending on the resource being tested.

3. Create a realistic disruption: delete a Pod, scale replicas, restart a node, fill a queue, or recreate storage attachment in a test environment.

4. Write the runbook entry covering expected behavior, safe rollback, and what alarms should exist.

What is the difference between a volume, a PersistentVolume, and an emptyDir?Basic

Answer

A volume is any storage source mounted into a Pod. A PersistentVolume is durable cluster storage claimed through a PVC. emptyDir is temporary storage created with the Pod and deleted when the Pod is removed from the node.

Technical explanation

emptyDir is useful for cache, scratch, and handoff between containers in a Pod, but it is not a backup or durable data store.

A normal Pod volume can be config, secret, projected, CSI, emptyDir, hostPath, or a PVC-backed mount.

Kubernetes workload controllers encode different lifecycle guarantees: interchangeable replicas, stable identities, node-local agents, or finite tasks.

Storage decisions must align with durability, access mode, zone placement, backup, restore, and failover behavior.

Autoscaling should be designed with metrics, scheduling constraints, PDBs, and node capacity together.

Hands-on example

1. Deploy a workload for this exercise using kubectl apply and a small test image such as nginx, busybox, or a purpose-built app: compare emptyDir and PVC-backed storage across Pod restarts.

2. Inspect ownerReferences, events, Pods, PVCs, PVs, EndpointSlices, and metrics depending on the resource being tested.

3. Create a realistic disruption: delete a Pod, scale replicas, restart a node, fill a queue, or recreate storage attachment in a test environment.

4. Write the runbook entry covering expected behavior, safe rollback, and what alarms should exist.

How does Kubernetes DNS work for service discovery?Basic

Answer

Kubernetes DNS is usually provided by CoreDNS. Services get predictable DNS names, ClusterIP Services resolve to the Service IP, and headless Services can return individual Pod endpoint records for direct discovery.

Technical explanation

Service DNS names follow patterns like service.namespace.svc.cluster.local, with search domains making short names work inside the namespace.

DNS problems often come from CoreDNS health, NetworkPolicy, kube-proxy/CNI issues, or using the wrong namespace name.

Kubernetes workload controllers encode different lifecycle guarantees: interchangeable replicas, stable identities, node-local agents, or finite tasks.

Storage decisions must align with durability, access mode, zone placement, backup, restore, and failover behavior.

Autoscaling should be designed with metrics, scheduling constraints, PDBs, and node capacity together.

Hands-on example

1. Deploy a workload for this exercise using kubectl apply and a small test image such as nginx, busybox, or a purpose-built app: test CoreDNS service discovery from a temporary debug Pod.

2. Inspect ownerReferences, events, Pods, PVCs, PVs, EndpointSlices, and metrics depending on the resource being tested.

3. Create a realistic disruption: delete a Pod, scale replicas, restart a node, fill a queue, or recreate storage attachment in a test environment.

4. Write the runbook entry covering expected behavior, safe rollback, and what alarms should exist.

Explain the Kubernetes networking model and the requirement that all Pods can reach each other.Basic

Answer

The Kubernetes networking model expects every Pod to have an IP and for Pods to reach other Pods without NAT, even across nodes. This model lets Services, controllers, and applications treat Pod networking consistently regardless of placement.

Technical explanation

The flat Pod network simplifies service discovery but shifts security responsibility to NetworkPolicy and application authorization.

Cloud CNIs may assign VPC-native IPs while overlay CNIs create an encapsulated cluster network.

Kubernetes networking separates identity and discovery from Pod IP churn by using Services, DNS, EndpointSlices, and routing rules.

Security is not automatic in the flat Pod network; NetworkPolicy and application auth are required for segmentation.

Cloud integrations such as EKS load balancers add provider-specific annotations, subnet tagging, health checks, and security group behavior.

Hands-on example

1. Deploy an app Pod and a temporary debug Pod to test this traffic path with nslookup, dig, curl, and kubectl get endpointslices: verify Pod-to-Pod connectivity across nodes and namespaces.

2. Add or change Service, Ingress, CNI, or NetworkPolicy resources one at a time and observe the traffic path.

3. Validate both allowed and denied flows so you know the policy is actually enforced by the CNI.

4. Record the troubleshooting path from DNS to Service to endpoint to Pod logs.

What is a CNI plugin, and name a few (Calico, Cilium, AWS VPC CNI)?Basic

Answer

A CNI plugin implements Pod networking: IP allocation, interface setup, routing, and often NetworkPolicy enforcement. Common examples are Calico, Cilium, Flannel, and AWS VPC CNI on EKS.

Technical explanation

CNI choice affects IP consumption, network policy support, eBPF features, observability, and cloud integration.

On EKS, AWS VPC CNI gives Pods VPC IPs, which is powerful but can make subnet IP exhaustion a scaling issue.

Kubernetes networking separates identity and discovery from Pod IP churn by using Services, DNS, EndpointSlices, and routing rules.

Security is not automatic in the flat Pod network; NetworkPolicy and application auth are required for segmentation.

Cloud integrations such as EKS load balancers add provider-specific annotations, subnet tagging, health checks, and security group behavior.

Hands-on example

1. Deploy an app Pod and a temporary debug Pod to test this traffic path with nslookup, dig, curl, and kubectl get endpointslices: compare CNI capabilities such as policy enforcement and IP allocation.

2. Add or change Service, Ingress, CNI, or NetworkPolicy resources one at a time and observe the traffic path.

3. Validate both allowed and denied flows so you know the policy is actually enforced by the CNI.

4. Record the troubleshooting path from DNS to Service to endpoint to Pod logs.

What is kube-proxy, and how does it implement Service routing (iptables vs IPVS)?Basic

Answer

kube-proxy watches Services and EndpointSlices and programs node-level forwarding rules. It commonly uses iptables or IPVS, while some clusters replace that path with eBPF-based implementations such as Cilium.

Technical explanation

iptables mode is common and rule-based; IPVS can be more efficient for very large service tables.

Even when kube-proxy is healthy, wrong selectors or readiness failures can still leave a Service with no endpoints.

Kubernetes networking separates identity and discovery from Pod IP churn by using Services, DNS, EndpointSlices, and routing rules.

Security is not automatic in the flat Pod network; NetworkPolicy and application auth are required for segmentation.

Cloud integrations such as EKS load balancers add provider-specific annotations, subnet tagging, health checks, and security group behavior.

Hands-on example

1. Deploy an app Pod and a temporary debug Pod to test this traffic path with nslookup, dig, curl, and kubectl get endpointslices: inspect Service routing through EndpointSlices and node proxy rules.

2. Add or change Service, Ingress, CNI, or NetworkPolicy resources one at a time and observe the traffic path.

3. Validate both allowed and denied flows so you know the policy is actually enforced by the CNI.

4. Record the troubleshooting path from DNS to Service to endpoint to Pod logs.

What is a NetworkPolicy, and what is the default Pod-to-Pod behaviour without one?Intermediate

Answer

NetworkPolicy defines allowed ingress and egress for selected Pods. Without any NetworkPolicy selecting a Pod, the default behavior is usually allow-all Pod-to-Pod traffic, assuming the CNI supports policy enforcement only when policies exist.

Technical explanation

Policies are additive: multiple policies combine allowed traffic rather than being evaluated in first-match order.

NetworkPolicy requires CNI support; creating policies with a CNI that ignores them gives a false sense of security.

Kubernetes networking separates identity and discovery from Pod IP churn by using Services, DNS, EndpointSlices, and routing rules.

Security is not automatic in the flat Pod network; NetworkPolicy and application auth are required for segmentation.

Cloud integrations such as EKS load balancers add provider-specific annotations, subnet tagging, health checks, and security group behavior.

Hands-on example

1. Deploy an app Pod and a temporary debug Pod to test this traffic path with nslookup, dig, curl, and kubectl get endpointslices: apply default-deny and allowlist NetworkPolicies and test with curl.

2. Add or change Service, Ingress, CNI, or NetworkPolicy resources one at a time and observe the traffic path.

3. Validate both allowed and denied flows so you know the policy is actually enforced by the CNI.

4. Record the troubleshooting path from DNS to Service to endpoint to Pod logs.

What is RBAC, and what are Roles, ClusterRoles, RoleBindings, and ClusterRoleBindings?Intermediate

Answer

RBAC controls who can perform which verbs on which Kubernetes resources. Roles and RoleBindings are namespace-scoped, while ClusterRoles and ClusterRoleBindings are cluster-scoped or reusable across namespaces.

Technical explanation

RBAC grants are additive; Kubernetes does not have an RBAC deny rule, so least privilege requires carefully scoped grants.

ClusterRoleBinding is powerful and should be rare for human and workload identities.

Scheduling controls place workloads correctly; RBAC and ServiceAccounts decide what identities can do after placement.

Use labels consistently because Services, Deployments, affinities, policies, and topology spread all depend on label selection.

Every constraint should be testable with events: FailedScheduling, denied API calls, or observed placement.

Hands-on example

1. Create a lab namespace for this exercise with explicit labels, ServiceAccounts, roles, node labels, or taints: create a read-only Role and RoleBinding for a namespace.

2. Use kubectl auth can-i, kubectl describe pod, and scheduling events to verify the expected decision.

3. Test a negative case, such as missing permission, missing toleration, or impossible affinity, and capture the exact error.

4. Convert the validated YAML into a reusable platform pattern with clear naming and labels.

What is a ServiceAccount, and how do Pods use it to talk to the API server?Intermediate

Answer

A ServiceAccount is the Kubernetes identity assigned to a Pod. Pods use projected ServiceAccount tokens to authenticate to the API server, and RBAC decides what that identity is allowed to do.

Technical explanation

Modern ServiceAccount tokens are projected, time-bound, and audience-scoped compared with older long-lived token Secrets.

Disable automountServiceAccountToken where Pods do not need API access.

Scheduling controls place workloads correctly; RBAC and ServiceAccounts decide what identities can do after placement.

Use labels consistently because Services, Deployments, affinities, policies, and topology spread all depend on label selection.

Every constraint should be testable with events: FailedScheduling, denied API calls, or observed placement.

Hands-on example

1. Create a lab namespace for this exercise with explicit labels, ServiceAccounts, roles, node labels, or taints: run a Pod with a scoped ServiceAccount and test kubectl auth can-i.

2. Use kubectl auth can-i, kubectl describe pod, and scheduling events to verify the expected decision.

3. Test a negative case, such as missing permission, missing toleration, or impossible affinity, and capture the exact error.

4. Convert the validated YAML into a reusable platform pattern with clear naming and labels.

How does the scheduler decide where to place a Pod?Intermediate

Answer

The scheduler places a Pod by filtering nodes that cannot run it and scoring nodes that can. It evaluates resources, taints and tolerations, affinity, topology spread, volume constraints, node conditions, and plugin-specific policies.

Technical explanation

Scheduling has filter and score phases, then binding; failures usually appear as FailedScheduling events.

The scheduler does not create capacity by itself; autoscalers or operators respond to unschedulable workloads.

Scheduling controls place workloads correctly; RBAC and ServiceAccounts decide what identities can do after placement.

Use labels consistently because Services, Deployments, affinities, policies, and topology spread all depend on label selection.

Every constraint should be testable with events: FailedScheduling, denied API calls, or observed placement.

Hands-on example

1. Create a lab namespace for this exercise with explicit labels, ServiceAccounts, roles, node labels, or taints: watch scheduler decisions for Pods with resource and placement constraints.

2. Use kubectl auth can-i, kubectl describe pod, and scheduling events to verify the expected decision.

3. Test a negative case, such as missing permission, missing toleration, or impossible affinity, and capture the exact error.

4. Convert the validated YAML into a reusable platform pattern with clear naming and labels.

What are node selectors, node affinity, and anti-affinity?Intermediate

Answer

nodeSelector is a simple label match. Node affinity is more expressive and supports required or preferred rules, while Pod affinity and anti-affinity place Pods near or away from other Pods based on labels and topology.

Technical explanation

Required affinity is a hard constraint; preferred affinity influences scoring but can be ignored if necessary.

Anti-affinity can protect availability but may block scheduling if topology domains or labels are too strict.

Scheduling controls place workloads correctly; RBAC and ServiceAccounts decide what identities can do after placement.

Use labels consistently because Services, Deployments, affinities, policies, and topology spread all depend on label selection.

Every constraint should be testable with events: FailedScheduling, denied API calls, or observed placement.

Hands-on example

1. Create a lab namespace for this exercise with explicit labels, ServiceAccounts, roles, node labels, or taints: place Pods using nodeSelector, node affinity, and pod anti-affinity.

2. Use kubectl auth can-i, kubectl describe pod, and scheduling events to verify the expected decision.

3. Test a negative case, such as missing permission, missing toleration, or impossible affinity, and capture the exact error.

4. Convert the validated YAML into a reusable platform pattern with clear naming and labels.

What are taints and tolerations, and how do they differ from affinity?Intermediate

Answer

Taints repel Pods from nodes unless the Pods tolerate them. Affinity attracts or avoids placement based on labels. Taints are usually node-owned guardrails, while affinity is usually workload-owned scheduling intent.

Technical explanation

Tolerating a taint does not force placement on that node; it only allows the Pod to be scheduled there.

NoSchedule prevents new Pods, PreferNoSchedule is soft, and NoExecute can evict existing Pods.

Scheduling controls place workloads correctly; RBAC and ServiceAccounts decide what identities can do after placement.

Use labels consistently because Services, Deployments, affinities, policies, and topology spread all depend on label selection.

Every constraint should be testable with events: FailedScheduling, denied API calls, or observed placement.

Hands-on example

1. Create a lab namespace for this exercise with explicit labels, ServiceAccounts, roles, node labels, or taints: taint a node and schedule only Pods with matching tolerations.

2. Use kubectl auth can-i, kubectl describe pod, and scheduling events to verify the expected decision.

3. Test a negative case, such as missing permission, missing toleration, or impossible affinity, and capture the exact error.

4. Convert the validated YAML into a reusable platform pattern with clear naming and labels.

What are Pod topology spread constraints, and why use them?Intermediate

Answer

Pod topology spread constraints tell the scheduler to distribute Pods across topology domains such as zones, nodes, or racks. They are used to avoid placing too many replicas in one failure domain.

Technical explanation

maxSkew defines how uneven placement may be across topology domains.

Topology spread is usually cleaner than hard anti-affinity for spreading many replicas across zones.

Scheduling controls place workloads correctly; RBAC and ServiceAccounts decide what identities can do after placement.

Use labels consistently because Services, Deployments, affinities, policies, and topology spread all depend on label selection.

Every constraint should be testable with events: FailedScheduling, denied API calls, or observed placement.

Hands-on example

1. Create a lab namespace for this exercise with explicit labels, ServiceAccounts, roles, node labels, or taints: spread replicas across zones or hostnames with topologySpreadConstraints.

2. Use kubectl auth can-i, kubectl describe pod, and scheduling events to verify the expected decision.

3. Test a negative case, such as missing permission, missing toleration, or impossible affinity, and capture the exact error.

4. Convert the validated YAML into a reusable platform pattern with clear naming and labels.

What is a static Pod, and how does it differ from a normally scheduled Pod?Intermediate

Answer

A static Pod is defined directly on a node, usually in the kubelet manifest path, and kubelet manages it without the scheduler. Control-plane components in kubeadm clusters are often static Pods.

Technical explanation

The API server may show mirror Pods for static Pods, but the source of truth is the node's local manifest file.

Static Pods are useful for bootstrapping control-plane components before higher-level controllers are available.

Scheduling controls place workloads correctly; RBAC and ServiceAccounts decide what identities can do after placement.

Use labels consistently because Services, Deployments, affinities, policies, and topology spread all depend on label selection.

Every constraint should be testable with events: FailedScheduling, denied API calls, or observed placement.

Hands-on example

1. Create a lab namespace for this exercise with explicit labels, ServiceAccounts, roles, node labels, or taints: inspect a kubeadm static Pod manifest on a lab control-plane node.

2. Use kubectl auth can-i, kubectl describe pod, and scheduling events to verify the expected decision.

3. Test a negative case, such as missing permission, missing toleration, or impossible affinity, and capture the exact error.

4. Convert the validated YAML into a reusable platform pattern with clear naming and labels.

What are init containers, and when would you use one?Intermediate

Answer

Init containers run before application containers and must complete successfully before the app starts. I use them for setup work such as waiting for dependencies, generating config, running migrations, or fixing permissions.

Technical explanation

Init containers can use different images and permissions than the app container, reducing what the runtime container needs.

A failing init container prevents the app from starting and must be visible in events and status.

Scheduling controls place workloads correctly; RBAC and ServiceAccounts decide what identities can do after placement.

Use labels consistently because Services, Deployments, affinities, policies, and topology spread all depend on label selection.

Every constraint should be testable with events: FailedScheduling, denied API calls, or observed placement.

Hands-on example

1. Create a lab namespace for this exercise with explicit labels, ServiceAccounts, roles, node labels, or taints: use an init container to wait for a dependency before app start.

2. Use kubectl auth can-i, kubectl describe pod, and scheduling events to verify the expected decision.

3. Test a negative case, such as missing permission, missing toleration, or impossible affinity, and capture the exact error.

4. Convert the validated YAML into a reusable platform pattern with clear naming and labels.

What is a sidecar container, and give a common use case?Intermediate

Answer

A sidecar container runs alongside the main container in the same Pod and shares networking and volumes. Common examples are service mesh proxies, log shippers, file synchronizers, and local adapters.

Technical explanation

Sidecars should be used when lifecycle and placement are truly tied to the main app.

Native sidecar support improves startup/shutdown semantics, but teams still need resource requests and readiness design for sidecars.

Scheduling controls place workloads correctly; RBAC and ServiceAccounts decide what identities can do after placement.

Use labels consistently because Services, Deployments, affinities, policies, and topology spread all depend on label selection.

Every constraint should be testable with events: FailedScheduling, denied API calls, or observed placement.

Hands-on example

1. Create a lab namespace for this exercise with explicit labels, ServiceAccounts, roles, node labels, or taints: run an app with a sidecar proxy or log shipper sharing a volume.

2. Use kubectl auth can-i, kubectl describe pod, and scheduling events to verify the expected decision.

3. Test a negative case, such as missing permission, missing toleration, or impossible affinity, and capture the exact error.

4. Convert the validated YAML into a reusable platform pattern with clear naming and labels.

What is etcd, and why is it critical to back it up?Intermediate

Answer

etcd is the strongly consistent key-value store backing Kubernetes cluster state. Losing etcd or restoring the wrong snapshot can mean losing cluster objects, so reliable backups and tested restore procedures are critical.

Technical explanation

etcd performance affects API responsiveness; slow disk, quorum loss, or compaction issues can appear as cluster-wide instability.

Managed Kubernetes hides etcd operations, but platform teams still need to understand backup guarantees and disaster recovery options.

Kubernetes internals follow a watch-and-reconcile model over API objects stored in etcd.

Extending Kubernetes safely requires schema validation, idempotent controllers, finalizers, ownership, and observable status conditions.

Backup and restore procedures are part of the control-plane design, not an afterthought.

Hands-on example

1. Use a disposable kubeadm or kind-based lab for this exercise: inspect etcd health and backup requirements in a kubeadm lab. Do not practice destructive control-plane work on production.

2. Inspect API objects and controller behavior with kubectl get -w, events, status fields, and logs from the relevant controller.

3. For backup/restore topics, create a snapshot, restore into a separate environment, and verify objects and workloads after recovery.

4. Document the failure scenario, recovery steps, and validation commands.

How would you back up and restore an etcd cluster?Intermediate

Answer

For etcd backup, I take a snapshot with etcdctl or the managed provider mechanism, store it securely, and regularly test restore in a non-production cluster. A backup is not trusted until a restore has been rehearsed.

Technical explanation

A consistent etcd restore usually recreates a cluster from the snapshot rather than merging arbitrary old state into a live cluster.

Snapshot encryption, access control, retention, and restore runbooks are as important as the snapshot command.

Kubernetes internals follow a watch-and-reconcile model over API objects stored in etcd.

Extending Kubernetes safely requires schema validation, idempotent controllers, finalizers, ownership, and observable status conditions.

Backup and restore procedures are part of the control-plane design, not an afterthought.

Hands-on example

1. Use a disposable kubeadm or kind-based lab for this exercise: take an etcd snapshot and restore it in a throwaway cluster. Do not practice destructive control-plane work on production.

2. Inspect API objects and controller behavior with kubectl get -w, events, status fields, and logs from the relevant controller.

3. For backup/restore topics, create a snapshot, restore into a separate environment, and verify objects and workloads after recovery.

4. Document the failure scenario, recovery steps, and validation commands.

What is a CRD (Custom Resource Definition), and what is an Operator?Intermediate

Answer

A CRD extends the Kubernetes API with a custom resource type. An Operator is a controller that watches those custom resources and reconciles real infrastructure or application state to match the custom resource spec.

Technical explanation

CRDs make custom resources first-class API objects with schema, versions, validation, and kubectl support.

Operators turn operational knowledge into controllers, for example database failover, backups, certificate issuance, or app upgrades.

Kubernetes internals follow a watch-and-reconcile model over API objects stored in etcd.

Extending Kubernetes safely requires schema validation, idempotent controllers, finalizers, ownership, and observable status conditions.

Backup and restore procedures are part of the control-plane design, not an afterthought.

Hands-on example

1. Use a disposable kubeadm or kind-based lab for this exercise: create a simple CRD and a small controller/operator example. Do not practice destructive control-plane work on production.

2. Inspect API objects and controller behavior with kubectl get -w, events, status fields, and logs from the relevant controller.

3. For backup/restore topics, create a snapshot, restore into a separate environment, and verify objects and workloads after recovery.

4. Document the failure scenario, recovery steps, and validation commands.

What is the controller/reconciliation loop pattern in Kubernetes?Intermediate

Answer

The reconciliation loop is the core Kubernetes controller pattern: observe current state, compare it with desired state, take one safe action, and repeat until the system converges. This makes automation resilient to partial failures and drift.

Technical explanation

Controllers must be idempotent because the same event can be processed multiple times.

Good controllers use finalizers, status conditions, backoff, and clear ownership references to manage lifecycle safely.

Kubernetes internals follow a watch-and-reconcile model over API objects stored in etcd.

Extending Kubernetes safely requires schema validation, idempotent controllers, finalizers, ownership, and observable status conditions.

Backup and restore procedures are part of the control-plane design, not an afterthought.

Hands-on example

1. Use a disposable kubeadm or kind-based lab for this exercise: write a reconciliation loop that updates status after creating a child resource. Do not practice destructive control-plane work on production.

2. Inspect API objects and controller behavior with kubectl get -w, events, status fields, and logs from the relevant controller.

3. For backup/restore topics, create a snapshot, restore into a separate environment, and verify objects and workloads after recovery.

4. Document the failure scenario, recovery steps, and validation commands.

How do you troubleshoot a Pod stuck in Pending?Intermediate

Answer

A Pod stuck in Pending usually means it cannot be scheduled or bound to required resources. I check events for insufficient resources, taints, affinity conflicts, unbound PVCs, quota limits, node selectors, or autoscaler behavior.

Technical explanation

Pending can happen before the image is ever pulled because the Pod has not been assigned to a node.

Always read the scheduling events before changing container images or app code.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: create a Pending Pod with impossible resource requests and diagnose events.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

How do you troubleshoot a Pod in CrashLoopBackOff?Intermediate

Answer

CrashLoopBackOff means a container starts, exits, and kubelet backs off before restarting it again. I check previous logs, exit code, command/args, probes, config, secrets, dependencies, and recent deployment changes.

Technical explanation

BackOff is a symptom, not a root cause; the root cause is usually visible in previous logs, exit code, events, or config diff.

Check whether probes are killing an otherwise healthy slow-starting process.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: create a CrashLoopBackOff with bad command and use --previous logs.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

How do you troubleshoot an ImagePullBackOff error?Intermediate

Answer

ImagePullBackOff means the node cannot pull the image. I check the image name and tag, registry reachability, imagePullSecrets, credentials, rate limits, private registry policies, and whether the image exists for the target architecture.

Technical explanation

ImagePullBackOff usually has a clear event message such as unauthorized, not found, manifest unknown, or TLS/network failure.

Private registries need imagePullSecrets or node-level registry credentials depending on the environment.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: break an image name and fix ImagePullBackOff using events and imagePullSecrets.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

What does kubectl describe show that kubectl get does not?Intermediate

Answer

kubectl get gives a compact current view. kubectl describe adds details such as events, conditions, selected node, image IDs, mounts, resource settings, probe status, and reasons behind scheduling or runtime failures.

Technical explanation

describe is especially useful because it includes chronological Events that kubectl get does not show by default.

For deeper inspection, pair describe with kubectl get -o yaml to see exact spec and status fields.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: compare kubectl get, describe, and get -o yaml for the same Pod.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

How do you view logs from a crashed (previous) container instance?Intermediate

Answer

I view logs from the previous crashed container with kubectl logs POD -c CONTAINER --previous. That is important because the current container instance may not yet have produced logs or may be stuck restarting.

Technical explanation

--previous reads logs from the last terminated container instance, which is essential in CrashLoopBackOff.

If a Pod has multiple containers, always pass -c so you do not inspect the wrong container.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: capture logs from the previous crashed container instance.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

What is a PodDisruptionBudget, and how does it protect availability during maintenance?Intermediate

Answer

A PodDisruptionBudget limits voluntary disruptions by specifying minAvailable or maxUnavailable. It protects availability during node drain, upgrades, or autoscaler scale-down, but it does not prevent involuntary failures.

Technical explanation

PDBs only affect voluntary disruptions through the eviction API; node crashes can still take Pods down.

Overly strict PDBs can block upgrades and drains, so set them according to replica count and real availability needs.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: create a PDB and observe how it affects kubectl drain.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

What is graceful termination, and how do preStop hooks and terminationGracePeriodSeconds work?Intermediate

Answer

Graceful termination starts when Kubernetes sends SIGTERM to the container, optionally runs a preStop hook, waits for terminationGracePeriodSeconds, and then sends SIGKILL if the process has not exited.

Technical explanation

preStop runs before SIGTERM handling completes but counts inside the grace period, so long hooks can consume shutdown time.

Applications should stop accepting new work, finish in-flight work, and exit before SIGKILL.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: test SIGTERM handling with preStop and terminationGracePeriodSeconds.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

How do you safely drain and cordon a node for maintenance?Intermediate

Answer

To maintain a node safely, I cordon it first, drain it while respecting DaemonSets and PDBs, perform the maintenance, verify node health, and then uncordon it. I watch replacement Pods and disruption budgets during the process.

Technical explanation

Use --ignore-daemonsets for drain because DaemonSet Pods are managed differently.

Check PDB violations before maintenance so upgrades do not stall midway.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: cordon and drain a node while watching Pods reschedule.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

What is the difference between cordon, drain, and delete on a node?Intermediate

Answer

cordon marks a node unschedulable, drain evicts movable Pods from the node, and delete removes the Node object from the API. They are different lifecycle operations and should not be used interchangeably.

Technical explanation

Deleting a node object does not gracefully evict workloads from a healthy node the same way drain does.

After deleting a cloud node, the cloud provider or node group may replace it depending on autoscaling settings.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: compare cordon, drain, and delete on a disposable node.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

What are Kubernetes QoS classes (Guaranteed, Burstable, BestEffort)?Intermediate

Answer

Kubernetes QoS classes are Guaranteed, Burstable, and BestEffort. They are derived from requests and limits and affect eviction priority when a node is under resource pressure.

Technical explanation

Guaranteed requires every container to have equal CPU and memory request and limit.

BestEffort Pods have no requests or limits and are first candidates for eviction under pressure.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: create Guaranteed, Burstable, and BestEffort Pods and inspect QoSClass.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

How does Kubernetes handle a node that becomes NotReady?Intermediate

Answer

When a node becomes NotReady, Kubernetes stops treating it as healthy, stops routing to affected endpoints as conditions update, and eventually evicts Pods after configured toleration periods. Controllers then recreate Pods elsewhere if capacity exists.

Technical explanation

Node conditions and taints such as node.kubernetes.io/not-ready influence scheduling and eviction behavior.

Stateful workloads may require careful storage reattachment before replacement Pods become healthy.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: simulate a NotReady node in a lab and observe taints and rescheduling.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

What is the difference between horizontal scaling of Pods and scaling nodes?Intermediate

Answer

Horizontal Pod scaling increases or decreases application replicas. Node scaling adds or removes infrastructure capacity. A healthy platform usually needs both: HPA creates demand for capacity, and Cluster Autoscaler supplies nodes when required.

Technical explanation

More Pods without more nodes can stay Pending if cluster capacity is exhausted.

More nodes without workload scaling does not increase application throughput unless there are replicas to run on them.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: scale replicas and node count separately under load.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

How do you expose a service externally on EKS, and what gets created?Intermediate

Answer

On EKS, I can expose a service externally with a LoadBalancer Service or an Ingress. Depending on controller and annotations, AWS creates an NLB, ALB, target groups, security group rules, listeners, and DNS names.

Technical explanation

The AWS Load Balancer Controller is commonly used for ALB Ingress and NLB/target group integrations.

Subnets, tags, security groups, target type, health checks, and annotations determine the AWS resources created.

Troubleshooting starts from state and events: get, describe, logs, previous logs, events, and then node/runtime/network checks.

Separate scheduling failures, image pull failures, runtime failures, app failures, and traffic-routing failures so you do not fix the wrong layer.

Operational commands like drain and rollback must respect PDBs, probes, and workload disruption tolerance.

Hands-on example

1. In a non-production namespace, create this safe broken scenario: expose a sample app on EKS with Service type LoadBalancer or ALB Ingress.

2. Follow a fixed triage order: kubectl get, describe, logs or logs --previous, events, rollout status, node status, and then runtime/network checks.

3. Fix only one variable at a time so the root cause is clear rather than accidentally masked.

4. Save the commands and final diagnosis as an interview-ready incident walkthrough.

What is a container, and how is it different from a virtual machine?Intermediate

Answer

A container packages a process with its filesystem and runtime isolation using kernel primitives. A virtual machine virtualizes hardware and runs a full guest OS, while containers share the host kernel and are usually lighter and faster to start.

Technical explanation

Containers are not a security boundary equivalent to a VM; they share the host kernel and need runtime hardening.

Their value is packaging consistency, fast startup, resource efficiency, and portable deployment workflows.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: run the same app in a container and a VM-like environment and compare startup/isolation.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is a container image, and what are layers?Intermediate

Answer

A container image is an immutable template made of layers plus metadata. Layers are content-addressed filesystem changes, which makes distribution and caching efficient because unchanged layers can be reused.

Technical explanation

Each Dockerfile instruction can create a layer, and layers are reused by digest when unchanged.

Image metadata includes config such as entrypoint, command, exposed ports, env vars, labels, and user.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: build an image and inspect layers with docker history or podman history.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

How does Docker layer caching work, and how do you order a Dockerfile to exploit it?Intermediate

Answer

Docker caches each build step as a layer. To exploit caching, I put stable dependency installation steps before frequently changing source code, copy lock files before the rest of the app, and keep the build context small.

Technical explanation

Changing an early Dockerfile instruction invalidates the cache for following steps.

Copy dependency manifests first, install dependencies, then copy frequently changing application source.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: reorder a Dockerfile and measure cache hits and misses.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is the difference between an image and a container?Intermediate

Answer

An image is the packaged artifact, and a container is a running or stopped instance created from that image. Multiple containers can run from the same image with different config, environment, mounts, and network settings.

Technical explanation

Deleting a container does not delete the image unless explicitly removed.

Container runtime settings such as env, mounts, command, and ports are applied when the container is created.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: create multiple containers from the same image with different commands/env.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is the difference between CMD and ENTRYPOINT in a Dockerfile?Intermediate

Answer

ENTRYPOINT defines the main executable for the container, while CMD provides default arguments or a default command. In production images, I often use exec-form ENTRYPOINT plus CMD for overridable defaults.

Technical explanation

Shell form performs shell expansion but handles signals poorly; exec form is usually better for production.

CMD can be overridden easily at docker run time or through Kubernetes command/args.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: test CMD and ENTRYPOINT override behavior with docker run arguments.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is the difference between COPY and ADD?Intermediate

Answer

COPY copies files from the build context into the image. ADD can also extract local tar archives and fetch remote URLs, but I avoid ADD unless I specifically need those extra behaviors because COPY is clearer and safer.

Technical explanation

ADD remote URL behavior is less explicit and less controllable than using curl with checksum validation in a build step.

COPY is preferred for predictable builds and clearer code review.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: compare COPY and ADD with a tar archive in a lab build.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is a multi-stage build, and why does it reduce image size and risk?Advanced

Answer

A multi-stage build uses one stage to compile or package the application and a later runtime stage to contain only what is needed to run it. That reduces image size, removes build tools, and reduces vulnerability surface.

Technical explanation

Build stages can be named and artifacts copied from one stage to another with COPY --from.

Never rely on multi-stage builds alone to protect secrets; do not pass secrets through normal build args or copied files.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: convert a single-stage build into a multi-stage build.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

Why should containers run as a non-root user, and how do you enforce it?Advanced

Answer

Containers should run as non-root so a process compromise has less privilege inside the container and less chance of dangerous host interaction. I enforce it in the Dockerfile and in Kubernetes securityContext or admission policy.

Technical explanation

Non-root must be compatible with file ownership, writable directories, and low-port binding constraints.

In Kubernetes, enforce runAsNonRoot, runAsUser, allowPrivilegeEscalation false, and capability drops.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: run a container as non-root and enforce Kubernetes runAsNonRoot.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is a distroless or scratch image, and what are the trade-offs?Advanced

Answer

scratch and distroless images are minimal runtime images with little or no OS userland. They reduce image size and attack surface, but they make debugging harder because tools like shell, package manager, curl, or ps may be absent.

Technical explanation

Distroless images often include runtime libraries and CA certs, while scratch is completely empty unless you copy everything required.

For debugging, use ephemeral debug containers or rebuild a debug variant rather than adding tools to production images.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: build a distroless runtime image and debug it with an ephemeral debug container.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is the difference between EXPOSE and publishing a port with -p?Advanced

Answer

EXPOSE is image metadata documenting the port the application listens on. Publishing with -p or --publish creates an actual host-to-container port mapping at runtime.

Technical explanation

EXPOSE does not open firewall rules and does not publish anything by itself.

In Kubernetes, containerPort is similar documentation/metadata; Services decide actual routing.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: compare EXPOSE with docker run -p using curl from the host.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is a Docker volume versus a bind mount?Advanced

Answer

A Docker volume is managed by the container engine and survives container replacement. A bind mount maps a specific host path into the container, which is useful for development but can leak host coupling into production.

Technical explanation

Volumes are portable across container replacements on the same engine; bind mounts are tied to a host path and host permissions.

In Kubernetes, avoid hostPath unless absolutely necessary because it couples Pods to node filesystem layout.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: persist data with a Docker volume and compare it with a bind mount.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

How do you reduce the size of a Docker image?Advanced

Answer

I reduce image size with multi-stage builds, slim or distroless bases, .dockerignore, dependency pruning, cache cleanup in the same layer, and by avoiding unnecessary build tools or copied artifacts in the runtime image.

Technical explanation

Remove package manager caches in the same RUN layer where packages are installed.

Measure with docker history, dive, or build tool output rather than guessing where size comes from.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: reduce image size and verify before/after with docker images and history.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is a .dockerignore file, and why does it matter?Advanced

Answer

.dockerignore controls which files are excluded from the build context. It matters because sending secrets, git history, node_modules, test outputs, or large artifacts to the builder slows builds and can accidentally bake sensitive files into images.

Technical explanation

The build context is sent before Dockerfile instructions run, so .dockerignore improves performance even if files are never copied.

It also prevents accidental leakage of .env files, credentials, SSH keys, and local build artifacts.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: add .dockerignore and measure build context size reduction.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is the difference between the build context and the image?Advanced

Answer

The build context is the set of local files sent to the builder and available to COPY or ADD. The image is the final content-addressed artifact produced after Dockerfile instructions execute.

Technical explanation

Dockerfile COPY cannot access files outside the build context.

A large context can slow remote builders and cache invalidation even when the final image is small.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: demonstrate that COPY can only read files inside the build context.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is the role of the ENTRYPOINT exec form versus shell form regarding signals?Advanced

Answer

Exec form ENTRYPOINT runs the process directly, so it receives signals properly as PID 1. Shell form wraps the command in /bin/sh -c, which can swallow signals, complicate argument parsing, and break graceful shutdown.

Technical explanation

PID 1 has special signal and zombie-reaping behavior in Linux, so container entrypoint design affects graceful shutdown.

Use exec form and consider a minimal init such as tini only when the app does not reap child processes.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: compare shell-form and exec-form ENTRYPOINT signal handling.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

How do you debug a container that exits immediately on start?Advanced

Answer

To debug a container that exits immediately, I inspect logs, exit code, command, environment, image architecture, missing files, and permissions. Then I override the entrypoint or run a debug shell if the image has one.

Technical explanation

A container that exits immediately may be working as designed for a one-shot command; not every exit is a failure.

Check whether the process needs a foreground command rather than a daemonizing background command.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: override entrypoint and inspect logs for a container that exits immediately.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is a container registry, and how does image tagging and digests work?Advanced

Answer

A registry stores and distributes container images. Tags are mutable pointers to image manifests, while digests are immutable content identifiers; production deployments should prefer versioned tags and, for strict reproducibility, digest pinning.

Technical explanation

Tags such as prod or latest can move; digests such as sha256:... identify exact content.

Registries often support vulnerability scanning, signing, retention policies, and immutability controls.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: push an image to a registry and deploy by tag versus digest.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

Why is using the latest tag in production discouraged?Advanced

Answer

Using latest in production is discouraged because it is mutable and non-auditable. The same manifest can deploy different bits over time, which breaks rollback, provenance, vulnerability tracking, and incident investigation.

Technical explanation

latest also makes staged rollouts ambiguous because dev, staging, and prod may silently pull different images.

Use semantic versions, git SHA tags, build numbers, and promotion-by-digest for reproducible releases.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: replace latest with immutable version tags and digest pinning.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is the difference between Docker and containerd?Advanced

Answer

Docker is a full developer-facing container platform with CLI, API, build, networking, and image management. containerd is a lower-level container runtime used by Kubernetes through CRI and by Docker under the hood for container lifecycle.

Technical explanation

Kubernetes removed direct Docker runtime dependence in favor of CRI runtimes, which is why containerd is common on modern clusters.

Developers can still use Docker to build images that run on containerd because both use OCI image/runtime standards.

Container image quality affects supply chain, startup time, vulnerability surface, rollout reliability, and debugging workflows.

Prefer reproducible builds: pinned dependencies, small build context, deterministic Dockerfile order, non-root runtime, and immutable image references.

Understand the runtime boundary: an image is not a VM, and container isolation depends on kernel, namespaces, cgroups, capabilities, seccomp, and mounts.

Hands-on example

1. Create a tiny sample app and Dockerfile for this exercise: compare Docker CLI workflow with containerd/CRI usage in Kubernetes nodes.

2. Build and inspect it with docker build or podman build, docker history, image inspect, and a vulnerability or size scan if available.

3. Run it locally with explicit env vars, ports, user, volumes, and signal tests depending on the question.

4. Convert the final runtime assumptions into Kubernetes fields such as image, command, args, ports, securityContext, probes, and volumeMounts.

What is Podman, and how does it differ architecturally from Docker (daemonless, rootless)?Advanced

Answer

Podman is an OCI-compatible container engine that is daemonless and supports rootless operation by design. Unlike Docker's traditional daemon model, Podman does not require a long-running root-owned daemon to manage containers.

Technical explanation

Daemonless means the Podman CLI interacts with lower-level tooling without a central always-on daemon.

Rootless mode improves local developer security and aligns well with least-privilege container practices.

Podman follows OCI standards, so images and many workflows are portable across Docker, Podman, and Kubernetes runtimes.

The key architectural difference is daemonless/rootless operation, which changes security posture and some operational behavior.

Podman is especially useful for local testing, rootless workflows, and generating starter Kubernetes manifests.

Hands-on example

1. Run a rootless Podman lab for this exercise: run a rootless Podman container and inspect process ownership.

2. Inspect the process, user namespace, network behavior, volumes, and image metadata with podman ps, inspect, logs, and exec.

3. For pod workflows, create an app plus sidecar Podman pod and test localhost communication.

4. Generate Kubernetes YAML where relevant, review it, add production fields, and apply it to a kind cluster.

What are the security advantages of Podman's rootless and daemonless design?Advanced

Answer

Podman's rootless and daemonless design reduces the blast radius of a compromised container or client because there is no central root daemon socket equivalent to attack. Rootless containers run inside user namespaces with reduced host privileges.

Technical explanation

A compromised Docker group user can often control the root daemon; rootless Podman avoids that specific socket risk.

Rootless networking and storage may have some functional or performance differences that teams must test.

Podman follows OCI standards, so images and many workflows are portable across Docker, Podman, and Kubernetes runtimes.

The key architectural difference is daemonless/rootless operation, which changes security posture and some operational behavior.

Podman is especially useful for local testing, rootless workflows, and generating starter Kubernetes manifests.

Hands-on example

1. Run a rootless Podman lab for this exercise: compare Docker daemon socket risk with rootless Podman operation.

2. Inspect the process, user namespace, network behavior, volumes, and image metadata with podman ps, inspect, logs, and exec.

3. For pod workflows, create an app plus sidecar Podman pod and test localhost communication.

4. Generate Kubernetes YAML where relevant, review it, add production fields, and apply it to a kind cluster.

What is a Podman pod, and how does it relate to the Kubernetes Pod concept?Advanced

Answer

A Podman pod groups containers that share namespaces, especially the network namespace, similar to a Kubernetes Pod. It is useful for local testing of sidecar-style workloads before generating Kubernetes YAML.

Technical explanation

A Podman pod has an infra container that holds shared namespaces, similar to how Kubernetes manages Pod-level namespaces.

This helps test multi-container patterns like app plus sidecar locally.

Podman follows OCI standards, so images and many workflows are portable across Docker, Podman, and Kubernetes runtimes.

The key architectural difference is daemonless/rootless operation, which changes security posture and some operational behavior.

Podman is especially useful for local testing, rootless workflows, and generating starter Kubernetes manifests.

Hands-on example

1. Run a rootless Podman lab for this exercise: create a Podman pod with app and sidecar containers.

2. Inspect the process, user namespace, network behavior, volumes, and image metadata with podman ps, inspect, logs, and exec.

3. For pod workflows, create an app plus sidecar Podman pod and test localhost communication.

4. Generate Kubernetes YAML where relevant, review it, add production fields, and apply it to a kind cluster.

How do you generate Kubernetes YAML from Podman (podman generate kube)?Advanced

Answer

podman generate kube can export Kubernetes YAML for a Podman container, pod, or volume. It is helpful for turning a local OCI container experiment into a starting manifest, though I still review and productionize the generated YAML.

Technical explanation

Generated YAML is a starting point, not a complete production manifest.

Add resources, probes, securityContext, labels, namespace, Service, ConfigMaps, and Secrets before using it in a cluster.

Podman follows OCI standards, so images and many workflows are portable across Docker, Podman, and Kubernetes runtimes.

The key architectural difference is daemonless/rootless operation, which changes security posture and some operational behavior.

Podman is especially useful for local testing, rootless workflows, and generating starter Kubernetes manifests.

Hands-on example

1. Run a rootless Podman lab for this exercise: generate Kubernetes YAML from a Podman pod and apply it to kind.

2. Inspect the process, user namespace, network behavior, volumes, and image metadata with podman ps, inspect, logs, and exec.

3. For pod workflows, create an app plus sidecar Podman pod and test localhost communication.

4. Generate Kubernetes YAML where relevant, review it, add production fields, and apply it to a kind cluster.

Is the Podman CLI compatible with Docker commands, and what aliasing is possible?Advanced

Answer

The Podman CLI is intentionally similar to Docker for common container commands, and many teams alias docker=podman for local workflows. Compatibility is high for basic usage, but not every Docker API, Docker Desktop feature, or Compose workflow is identical.

Technical explanation

The alias works best for basic commands such as build, run, ps, logs, exec, and push.

Features that depend on the Docker daemon API or Docker Desktop integration may need different tooling.

Podman follows OCI standards, so images and many workflows are portable across Docker, Podman, and Kubernetes runtimes.

The key architectural difference is daemonless/rootless operation, which changes security posture and some operational behavior.

Podman is especially useful for local testing, rootless workflows, and generating starter Kubernetes manifests.

Hands-on example

1. Run a rootless Podman lab for this exercise: alias docker=podman for basic commands and note incompatibilities.

2. Inspect the process, user namespace, network behavior, volumes, and image metadata with podman ps, inspect, logs, and exec.

3. For pod workflows, create an app plus sidecar Podman pod and test localhost communication.

4. Generate Kubernetes YAML where relevant, review it, add production fields, and apply it to a kind cluster.

What is Helm, and what problem does it solve over raw manifests?Advanced

Answer

Helm is a Kubernetes package manager and templating tool. It solves the problem of repeatedly managing many raw manifests by packaging templates, default values, dependencies, release history, and upgrades into a chart workflow.

Technical explanation

Helm prevents copy-paste YAML drift by centralizing common templates and values.

It also gives release history, which raw kubectl apply does not provide by itself.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: package a small app into a Helm chart instead of raw manifests.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

What are the parts of a Helm chart (Chart.yaml, values.yaml, templates, _helpers.tpl)?Advanced

Answer

A Helm chart contains Chart.yaml for metadata, values.yaml for defaults, templates for Kubernetes manifests, and helper templates such as _helpers.tpl for reusable names and labels. It can also include dependencies, tests, schemas, and CRDs.

Technical explanation

_helpers.tpl commonly centralizes fullname, labels, selector labels, and common annotations.

values.schema.json improves chart usability by validating user values before install/upgrade.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: walk through Chart.yaml, values.yaml, templates, and _helpers.tpl.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

How does Helm templating work, and how do values get injected?Advanced

Answer

Helm uses Go templates to render Kubernetes YAML. Values come from chart defaults, values files, --set flags, and parent charts; Helm merges those inputs and injects them into templates before applying or printing manifests.

Technical explanation

Values precedence matters: --set and later values files can override earlier defaults.

Templates should be deterministic and readable after rendering; clever templates that produce surprising YAML are risky.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: render templates with different values files and inspect the YAML diff.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

What is the difference between helm install, helm upgrade, and helm rollback?Advanced

Answer

helm install creates a new release, helm upgrade modifies an existing release using a new chart or values, and helm rollback returns a release to a previous revision. In production I combine upgrade with atomic behavior, readiness checks, and rollback planning.

Technical explanation

--atomic can automatically rollback a failed install/upgrade, but only after Kubernetes/Helm detects failure according to wait conditions.

CRDs and hooks need special care because rollback may not fully revert their side effects.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: install, upgrade, and rollback a Helm release.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

How does Helm track releases and revisions?Advanced

Answer

Helm tracks releases and revisions by storing release metadata in the cluster, usually as Secrets in the target namespace. That history lets Helm calculate upgrades, show status, and roll back to earlier revisions.

Technical explanation

Release metadata can become stuck if an operation is interrupted, resulting in pending-install or pending-upgrade states.

Release history retention should be managed to avoid unbounded metadata growth.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: inspect Helm release Secrets and revision history.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

What is the difference between helm template and helm install?Advanced

Answer

helm template renders manifests locally and does not create a release in the cluster. helm install renders and submits those manifests to the Kubernetes API, then records release metadata for future upgrade and rollback.

Technical explanation

helm template is excellent for CI validation and GitOps rendering because it does not require cluster write access.

helm install also performs release lifecycle management and optional wait/hook behavior.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: compare helm template output with helm install --dry-run --debug.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

How do you manage environment-specific values across dev, staging, and prod with Helm?Advanced

Answer

I manage environment-specific Helm values with a common base values file plus environment overlays such as dev, staging, and prod. Sensitive values should come from secret management, and production overrides should be reviewed and validated in CI.

Technical explanation

Keep environment differences in values, not in copied chart templates.

Use promotion controls so prod values are intentional and reviewed rather than manually typed --set commands.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: manage dev, staging, and prod values with layered values files.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

What are Helm hooks, and when would you use them?Advanced

Answer

Helm hooks run chart-defined Kubernetes resources at specific lifecycle points such as pre-install, post-install, pre-upgrade, or pre-delete. I use them carefully for migrations, smoke tests, or cleanup because hooks can also make releases harder to reason about.

Technical explanation

Hooks are not ordinary managed resources in the same way as normal templates, so lifecycle and cleanup annotations matter.

Long-running or fragile hooks are a common cause of stuck Helm releases.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: create a pre-upgrade hook Job and handle hook cleanup.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

What is a Helm subchart and chart dependency, and how is it managed?Advanced

Answer

A subchart is a dependent chart packaged or pulled with a parent chart. Dependencies are declared in Chart.yaml, locked in Chart.lock, downloaded with helm dependency update, and configured through values passed to the subchart.

Technical explanation

Subcharts are isolated: parent charts can pass values into them, but subcharts should not depend on parent templates directly.

Lock files make dependency versions reproducible in CI and production.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: add a subchart dependency and lock it for reproducible builds.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

How do you secure secrets in Helm (e.g., with helm-secrets or external stores)?Advanced

Answer

I do not put plaintext secrets directly in values files. I use external secret operators, cloud secret managers, SOPS or helm-secrets, sealed secrets, or runtime injection so secrets are encrypted, auditable, and rotated outside the chart repository.

Technical explanation

Encrypting secrets in Git is not the same as runtime rotation; plan both storage security and rotation behavior.

External Secrets Operator and cloud secret managers keep Kubernetes Secret generation separate from Helm chart packaging.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: integrate Helm with External Secrets, SOPS, or helm-secrets.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

How does Helm compare to Kustomize, and when would you choose each?Advanced

Answer

Helm is best when I need packaging, parameters, dependencies, and release lifecycle. Kustomize is best when I want patch-based overlays on plain YAML without templates. I often use Helm for third-party apps and Kustomize or GitOps overlays for environment composition.

Technical explanation

Helm templates can express conditionals and loops; Kustomize patches existing YAML without a template language.

Many GitOps setups render Helm then apply Kustomize overlays, but complexity should be justified.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: patch the same base app with Helm values and Kustomize overlays.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

How would you validate and lint a Helm chart in CI?Advanced

Answer

I validate a Helm chart in CI with helm lint, helm template, schema validation, kubeconform or kubeval, unit tests where useful, policy checks, and a dry-run against a test cluster. I also check rendered resources for names, labels, probes, resources, and security context.

Technical explanation

helm lint catches chart structure issues, but rendered manifest validation catches Kubernetes API mistakes.

Policy checks should verify securityContext, resources, probes, labels, and disallowed host access.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: build a CI job for helm lint, template, schema validation, and policy checks.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

How do you debug a failed Helm upgrade and a release stuck in pending-upgrade?Advanced

Answer

For a failed Helm upgrade or pending-upgrade release, I check helm status, helm history, rendered manifests, Kubernetes events, and the release Secret. Then I decide whether to rollback, fix a stuck hook, clear a failed release carefully, or rerun with --atomic after correcting the root cause.

Technical explanation

Never blindly delete Helm release Secrets in production unless you understand the exact release state and have a backup.

Use helm get manifest/values/hooks to compare intended and actual release content.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: debug a failed upgrade with helm status, history, get manifest, and events.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

How would you design a chart to be reusable across multiple services?Advanced

Answer

A reusable chart should expose stable values for image, probes, resources, service, ingress, autoscaling, security context, annotations, and environment, while hiding naming and label complexity in helpers. It should include values.schema.json and sane defaults.

Technical explanation

Reusable charts need a stable values contract and a schema so service teams do not edit templates directly.

Keep selector labels immutable across upgrades, because changing them can orphan Services or recreate workloads unexpectedly.

Helm separates reusable chart templates from environment-specific values and tracks release revisions in the cluster.

Always validate the rendered YAML because Kubernetes receives manifests, not templates.

Good Helm practice includes values schema, deterministic helpers, security defaults, linting, dry runs, and rollback planning.

Hands-on example

1. Create or modify a small Helm chart for this exercise: design one chart that can deploy multiple HTTP services safely.

2. Run helm lint, helm template, helm install --dry-run --debug, and kubeconform or an equivalent manifest validator.

3. Install to a test namespace, perform an upgrade with changed values, and inspect helm status, history, and rendered manifests.

4. Test failure and rollback behavior, then document the CI gates that would prevent the same issue in production.

What recent Kubernetes feature have you used, and what value did it bring?Advanced

Answer

In an interview, I would choose a feature I genuinely used and explain the operational value. A strong current example is Kubernetes Pod-level resources from v1.34, which lets teams express CPU and memory at Pod scope for workloads where containers share an overall Pod budget.

Technical explanation

Pod-level resources are useful for tightly coupled containers where the Pod should be treated as one budget rather than independent container budgets.

A good interview answer should include the feature, why it mattered, how you tested it, and one limitation or rollout risk.

Hardening should be layered: authentication, authorization, admission, workload security, network segmentation, secret protection, image trust, audit, and runtime monitoring.

Use audit and warn modes to discover breakage before enforcing new policies in shared clusters.

Document exceptions with ownership, expiry, compensating controls, and evidence.

Hands-on example

1. Create a non-production namespace or cluster baseline for this exercise: evaluate a recent feature such as Pod-level resources or native sidecars in a test namespace.

2. Apply controls in layers: RBAC, ServiceAccounts, Pod Security labels, NetworkPolicy, resources, probes, image policy, secret handling, and audit logging.

3. Run negative tests such as privileged Pod rejection, denied API access, blocked network flow, unsigned image rejection, or secret read denial.

4. Move from audit/warn to enforce only after measuring impact, documenting exceptions, and wiring alerts to owners.

How would you harden a Kubernetes cluster (Pod Security Standards, RBAC, network policies, image policy)?Advanced

Answer

I harden a Kubernetes cluster in layers: identity and RBAC, namespace isolation, Pod Security Standards, NetworkPolicies, image provenance, secrets encryption, admission control, audit logging, node hardening, patching, and continuous compliance checks. The goal is least privilege and reduced blast radius without blocking delivery.

Technical explanation

Hardening is defense in depth; no single control such as RBAC or Pod Security Standards is enough alone.

Start in audit/warn modes where possible, measure breakage, then move to enforce with documented exceptions.

Hardening should be layered: authentication, authorization, admission, workload security, network segmentation, secret protection, image trust, audit, and runtime monitoring.

Use audit and warn modes to discover breakage before enforcing new policies in shared clusters.

Document exceptions with ownership, expiry, compensating controls, and evidence.

Hands-on example

1. Create a non-production namespace or cluster baseline for this exercise: create a hardened namespace baseline with PSS, RBAC, NetworkPolicy, image controls, and audit checks.

2. Apply controls in layers: RBAC, ServiceAccounts, Pod Security labels, NetworkPolicy, resources, probes, image policy, secret handling, and audit logging.

3. Run negative tests such as privileged Pod rejection, denied API access, blocked network flow, unsigned image rejection, or secret read denial.

4. Move from audit/warn to enforce only after measuring impact, documenting exceptions, and wiring alerts to owners.

Source Note for Current Kubernetes Items

Most answers are based on stable Kubernetes, Docker/OCI, Podman, and Helm concepts. For the current-feature and hardening items, validate against the exact cluster version and vendor distribution before using in a real interview or implementation.

Kubernetes v1.34 release blog: https://kubernetes.io/blog/2025/08/27/kubernetes-v1-34-release/

Kubernetes Pod-level resources v1.34 blog: https://kubernetes.io/blog/2025/09/22/kubernetes-v1-34-pod-level-resources/

Kubernetes Pod Security Standards: https://kubernetes.io/docs/concepts/security/pod-security-standards/

Kubernetes RBAC documentation: https://kubernetes.io/docs/reference/access-authn-authz/rbac/

Kubernetes NetworkPolicy documentation: https://kubernetes.io/docs/concepts/services-networking/network-policies/

← All interview topics