Interview questions › Infrastructure as Code (Terraform, Ansible)

Infrastructure as Code (Terraform, Ansible) interview questions & answers

100 Infrastructure as Code (Terraform, Ansible) interview questions, each answered three ways: a concise spoken answer, a technical explanation, and a hands-on example.

Tip: paste the job description + your resume into our free resume checker to see which of these skills the role actually requires.

All questions

What is Infrastructure as Code, and what problems does it solve over click-ops?
What is the difference between declarative and imperative IaC, and where do Terraform and Ansible fall?
What is the difference between configuration management and provisioning?
What is Terraform, and what is the core plan/apply workflow?
What does terraform init do?
What is the Terraform state file, and why is it critical?
Why should state be stored remotely, and what backend would you use on AWS?
What is state locking, and why does it matter for teams?
How does Terraform use DynamoDB for state locking with an S3 backend?
What is a Terraform provider, and how is it versioned?
Why pin provider and Terraform versions, and how do you do it?
What is a Terraform module, and why use modules?
What is the difference between a root module and a child module?
What are input variables, outputs, and locals?
What is the difference between count and for_each, and when do you use each?
What problem does for_each solve that count creates when removing items?
What is a dynamic block, and when would you use it?
What is the difference between a data source and a resource?
What is terraform import, and when do you need it?
What is configuration drift, and how does Terraform detect it?
How do you handle resources changed manually outside Terraform?
What are Terraform workspaces, and what are their limitations for environments?
What is the difference between using workspaces and separate state files per environment?
How do you structure Terraform code for multiple environments (dev/staging/prod)?
What is the purpose of terraform plan -out and applying a saved plan?
What does terraform refresh do, and how has its behaviour changed?
What are provisioners, and why are they considered a last resort?
What is the difference between local-exec and remote-exec provisioners?
What is the lifecycle block (create_before_destroy, prevent_destroy, ignore_changes)?
When would you use ignore_changes, and what is the risk?
What is a tainted resource, and what replaced the taint command?
How do you target a specific resource with terraform apply, and why is it discouraged?
What are explicit versus implicit dependencies, and what does depends_on do?
How does Terraform build its dependency graph?
How do you pass outputs from one module to another?
What is a remote state data source, and what is the security concern with it?
How do you manage secrets in Terraform without leaking them into state?
Why does the state file potentially contain sensitive values, and how do you protect it?
What is the sensitive flag on variables and outputs?
What are Terraform meta-arguments (count, for_each, provider, depends_on, lifecycle)?
How do you test Terraform code (validate, fmt, plan, tflint, terratest)?
How would you integrate Terraform into a CI/CD pipeline safely?
What is a policy-as-code check for Terraform (OPA, Sentinel, checkov)?
How do you handle a Terraform apply that fails halfway through?
What is the difference between terraform destroy and removing a resource from code?
How do you import existing infrastructure into Terraform at scale?
What is a Terraform registry module, and how do you evaluate one for production use?
How would you write a reusable module for a standard service (as you did to cut provisioning time 70%)?
What is the difference between Terraform and CloudFormation, and when choose each?
What is OpenTofu, and why did it fork from Terraform?
What is Ansible, and how is it agentless?
How does Ansible connect to managed hosts?
What is an Ansible inventory, and what is the difference between static and dynamic inventory?
What is a playbook, a play, and a task?
What is idempotency in Ansible, and why does it matter?
Why is a raw shell or command task not idempotent, and how do you make it safe?
What is the creates argument on a command task, and how does it add idempotency?
What are Ansible modules, and why prefer them over shell commands?
What is a handler, and how is it triggered with notify?
Why do handlers run once at the end rather than immediately?
What are Ansible roles, and what is the standard directory structure?
What is the order of variable precedence in Ansible at a high level?
What are group_vars and host_vars?
What is Ansible Vault, and how do you protect secrets with it?
What is the difference between a variable and a fact?
How does Ansible gather facts, and how do you use them in conditionals?
What is a register, and how do you use the result of one task in another?
What is the when clause, and how do you write conditional tasks?
What is a loop in Ansible, and how do you iterate over a list?
What is the serial keyword, and how does it enable rolling updates?
What is max_fail_percentage, and how does it protect a rollout?
What is delegate_to, and when would you use it?
What is the difference between Ansible roles and collections?
What is a Jinja2 template, and how is it used in Ansible?
How do you run an Ansible playbook in check (dry-run) mode?
What is the difference between state: present and state: latest, and the risk of latest?
How do you handle host-specific differences across a mixed fleet?
How would you orchestrate a rolling, health-checked upgrade across servers with Ansible?
How do you integrate Ansible into CI/CD and keep playbooks tested?
What is ansible-lint, and what does it catch?
When would you choose Ansible over Terraform and vice versa?
Can Terraform and Ansible be used together, and how would you combine them?
What is Kustomize, and how does it differ from Helm?
What is a kustomization.yaml, and what does it define?
What is the difference between a base and an overlay in Kustomize?
How do overlays customise a base without copying it?
What are strategic merge patches versus JSON 6902 patches in Kustomize?
How does Kustomize handle environment-specific configuration?
What are name prefixes/suffixes and common labels in Kustomize?
What is a configMapGenerator and secretGenerator, and what is the hash suffix for?
Why does the config hash suffix help trigger rolling updates?
How is Kustomize integrated natively into kubectl (kubectl apply -k)?
When would you choose Kustomize over Helm, and can you use them together?
How does ArgoCD work with Kustomize overlays for multiple environments?
What are the trade-offs of templating (Helm) versus patching (Kustomize)?
How do you keep IaC DRY across many similar microservices?
How do you review and approve infrastructure changes safely as a team?
What recent IaC practice or tool have you adopted, and what did it improve?
How do you detect and remediate IaC drift continuously rather than only at apply time?
How would you onboard an existing manually-built environment into IaC with confidence?

What is Infrastructure as Code, and what problems does it solve over click-ops?Basic

Answer

Infrastructure as Code is the practice of defining infrastructure in version-controlled files instead of creating it manually through consoles. It solves inconsistent environments, undocumented click-ops, poor auditability, slow provisioning, and configuration drift by making infrastructure repeatable, reviewable, and automated.

Technical explanation

Click-ops is fast for experiments but weak for production because it leaves no reproducible desired state.

IaC gives code review, peer approval, rollback history, automated validation, and consistent environment creation.

The value is not just speed; it is operational control, auditability, and the ability to rebuild after failure.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a small IaC workflow for: What is Infrastructure as Code, and what problems does it solve over click-ops?

2. Create a Git repository with folders terraform/network and ansible/web. Put cloud resources in Terraform and host configuration in Ansible.

3. Add a minimal Terraform resource and run the standard workflow:

cd terraform/network

terraform init

terraform fmt -check

terraform validate

terraform plan -out=tfplan

terraform apply tfplan

4. Commit the plan output summary to the pull request, require approval, then use Ansible after infrastructure exists:

ansible-inventory -i inventory/aws_ec2.yml --graph

ansible-playbook -i inventory/aws_ec2.yml ansible/web/site.yml --check --diff

ansible-playbook -i inventory/aws_ec2.yml ansible/web/site.yml

5. Prove the benefit by rebuilding a dev environment from Git and confirming there are no manual console steps.

What is the difference between declarative and imperative IaC, and where do Terraform and Ansible fall?Basic

Answer

Declarative IaC describes the desired end state and lets the tool calculate the changes. Imperative automation describes the exact steps to run. Terraform is primarily declarative. Ansible runs ordered tasks, so it feels imperative, but its modules are designed to be idempotent and express desired state such as package present or service started.

Technical explanation

Declarative tools require accurate state or discovery to calculate delta safely.

Imperative tasks are useful when order matters, such as draining a node before patching it.

Many production platforms combine both: Terraform for resources and Ansible for ordered host operations.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a small IaC workflow for: What is the difference between declarative and imperative IaC, and where do Terraform and Ansible fall?

2. Create a Git repository with folders terraform/network and ansible/web. Put cloud resources in Terraform and host configuration in Ansible.

3. Add a minimal Terraform resource and run the standard workflow:

cd terraform/network

terraform init

terraform fmt -check

terraform validate

terraform plan -out=tfplan

terraform apply tfplan

4. Commit the plan output summary to the pull request, require approval, then use Ansible after infrastructure exists:

ansible-inventory -i inventory/aws_ec2.yml --graph

ansible-playbook -i inventory/aws_ec2.yml ansible/web/site.yml --check --diff

ansible-playbook -i inventory/aws_ec2.yml ansible/web/site.yml

5. Prove the benefit by rebuilding a dev environment from Git and confirming there are no manual console steps.

What is the difference between configuration management and provisioning?Basic

Answer

Provisioning creates infrastructure such as networks, compute, databases, IAM, and load balancers. Configuration management installs packages, writes files, configures services, and keeps hosts in the desired runtime state. Terraform is usually stronger for provisioning; Ansible is usually stronger for configuration management and orchestration.

Technical explanation

Provisioning usually changes the cloud control plane; configuration management usually changes the operating system or application runtime.

Provisioned resources often have dependencies represented as a graph, while host configuration often has ordered steps.

Ownership boundaries matter so two tools do not manage the same object or field.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a small IaC workflow for: What is the difference between configuration management and provisioning?

2. Create a Git repository with folders terraform/network and ansible/web. Put cloud resources in Terraform and host configuration in Ansible.

3. Add a minimal Terraform resource and run the standard workflow:

cd terraform/network

terraform init

terraform fmt -check

terraform validate

terraform plan -out=tfplan

terraform apply tfplan

4. Commit the plan output summary to the pull request, require approval, then use Ansible after infrastructure exists:

ansible-inventory -i inventory/aws_ec2.yml --graph

ansible-playbook -i inventory/aws_ec2.yml ansible/web/site.yml --check --diff

ansible-playbook -i inventory/aws_ec2.yml ansible/web/site.yml

5. Prove the benefit by rebuilding a dev environment from Git and confirming there are no manual console steps.

What is Terraform, and what is the core plan/apply workflow?Basic

Answer

Terraform is a declarative IaC tool that compares configuration, state, and real infrastructure to produce a plan, then applies that plan to create, update, or delete resources. The standard workflow is write HCL, terraform init, terraform fmt/validate, terraform plan, review, and terraform apply.

Technical explanation

plan is the safety checkpoint: it shows what Terraform intends to create, update, replace, or destroy.

apply executes the plan and writes updated state after provider operations succeed.

A mature workflow adds fmt, validate, linting, policy checks, and approval before production apply.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a small IaC workflow for: What is Terraform, and what is the core plan/apply workflow?

2. Create a Git repository with folders terraform/network and ansible/web. Put cloud resources in Terraform and host configuration in Ansible.

3. Add a minimal Terraform resource and run the standard workflow:

cd terraform/network

terraform init

terraform fmt -check

terraform validate

terraform plan -out=tfplan

terraform apply tfplan

4. Commit the plan output summary to the pull request, require approval, then use Ansible after infrastructure exists:

ansible-inventory -i inventory/aws_ec2.yml --graph

ansible-playbook -i inventory/aws_ec2.yml ansible/web/site.yml --check --diff

ansible-playbook -i inventory/aws_ec2.yml ansible/web/site.yml

5. Prove the benefit by rebuilding a dev environment from Git and confirming there are no manual console steps.

What does terraform init do?Basic

Answer

terraform init prepares a working directory. It configures the backend, downloads required providers, installs child modules, creates or updates dependency lock information, and makes the directory ready for plan and apply. It is safe to run repeatedly and is normally the first command in local and CI workflows.

Technical explanation

Provider plugins are resolved from required_providers and recorded in .terraform.lock.hcl.

Backend initialization decides where state will be stored and may ask to migrate existing state.

Module installation downloads local, registry, Git, or archive-based child modules.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Create a directory with main.tf, versions.tf, and a child module reference.

2. Run terraform init and inspect what changed:

terraform init

find .terraform -maxdepth 3 -type f | sort

cat .terraform.lock.hcl

3. Change only a provider version constraint and rerun terraform init -upgrade in a test branch to see provider selection update.

4. In CI, run init before validate/plan and fail the build if .terraform.lock.hcl changed but was not committed.

What is the Terraform state file, and why is it critical?Basic

Answer

The Terraform state file is Terraform's source of truth for mapping resource addresses in code to real provider object IDs. It is critical because Terraform cannot safely know what it manages, what changed, or what dependencies exist without state. Losing or corrupting state can lead to duplicates, orphaned resources, or destructive plans.

Technical explanation

State includes resource instance addresses, provider IDs, attributes, dependency metadata, and sometimes sensitive values.

State is not a cache that can be ignored; it is a transactional record Terraform uses to plan future changes.

State operations should be backed up, locked, encrypted, and tightly permissioned.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Implement remote state and locking for: What is the Terraform state file, and why is it critical?

2. Create or use an S3 bucket with versioning, encryption, public access blocked, and least-privilege IAM. Then configure the backend:

terraform {

backend "s3" {

bucket = "company-tfstate-prod"

key = "platform/network/prod.tfstate"

region = "ap-south-1"

encrypt = true

use_lockfile = true

}

3. For legacy DynamoDB locking, verify the table has partition key LockID and migrate deliberately because newer S3 locking makes DynamoDB locking deprecated.

4. Open two terminals and start two applies against the same state key; confirm the second run waits or fails on the lock instead of writing concurrently.

5. Recover safely by checking the active process first; only use force-unlock when the original run is definitely dead.

Why should state be stored remotely, and what backend would you use on AWS?Basic

Answer

State should be remote so teams share one authoritative state, get locking, enable recovery, and avoid state sitting on a laptop. On AWS I normally use an S3 backend with encryption, bucket versioning, least-privilege IAM, and state locking. In current Terraform, S3 lockfile locking is preferred; older setups used DynamoDB locking.

Technical explanation

S3 should have versioning, server-side encryption, public access blocked, and tight IAM policies.

Use separate state keys for separate environments and blast-radius boundaries.

Remote backends also make CI/CD practical because workers do not need local state files.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Implement remote state and locking for: Why should state be stored remotely, and what backend would you use on AWS?

2. Create or use an S3 bucket with versioning, encryption, public access blocked, and least-privilege IAM. Then configure the backend:

terraform {

backend "s3" {

bucket = "company-tfstate-prod"

key = "platform/network/prod.tfstate"

region = "ap-south-1"

encrypt = true

use_lockfile = true

}

3. For legacy DynamoDB locking, verify the table has partition key LockID and migrate deliberately because newer S3 locking makes DynamoDB locking deprecated.

4. Open two terminals and start two applies against the same state key; confirm the second run waits or fails on the lock instead of writing concurrently.

5. Recover safely by checking the active process first; only use force-unlock when the original run is definitely dead.

What is state locking, and why does it matter for teams?Basic

Answer

State locking prevents two Terraform runs from writing the same state at the same time. It matters because concurrent applies can corrupt state or make each run act on stale assumptions. In a team, locking is mandatory for safe shared infrastructure changes.

Technical explanation

Locking protects state writes, not the cloud provider itself; it prevents simultaneous Terraform writers.

A stale lock should be force-unlocked only after proving the original run is dead.

Good pipelines also serialize per workspace to avoid fighting over the same backend key.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Implement remote state and locking for: What is state locking, and why does it matter for teams?

2. Create or use an S3 bucket with versioning, encryption, public access blocked, and least-privilege IAM. Then configure the backend:

terraform {

backend "s3" {

bucket = "company-tfstate-prod"

key = "platform/network/prod.tfstate"

region = "ap-south-1"

encrypt = true

use_lockfile = true

}

3. For legacy DynamoDB locking, verify the table has partition key LockID and migrate deliberately because newer S3 locking makes DynamoDB locking deprecated.

4. Open two terminals and start two applies against the same state key; confirm the second run waits or fails on the lock instead of writing concurrently.

5. Recover safely by checking the active process first; only use force-unlock when the original run is definitely dead.

How does Terraform use DynamoDB for state locking with an S3 backend?Basic

Answer

Historically, Terraform's S3 backend used a DynamoDB table with a LockID partition key to acquire a lock before modifying state and release it afterward. That prevents concurrent writers. However, current Terraform S3 backend documentation marks DynamoDB-based locking as deprecated in favor of S3 lockfile locking, so I would migrate new work to use_lockfile where possible.

Technical explanation

Legacy DynamoDB locking uses a table row keyed by LockID; Terraform writes the lock and removes it when done.

If a run crashes, you may need force-unlock using the lock ID, but only after confirming no active writer exists.

For new Terraform S3 backends, check current support for use_lockfile and plan a migration away from DynamoDB-based locking.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Implement remote state and locking for: How does Terraform use DynamoDB for state locking with an S3 backend?

2. Create or use an S3 bucket with versioning, encryption, public access blocked, and least-privilege IAM. Then configure the backend:

terraform {

backend "s3" {

bucket = "company-tfstate-prod"

key = "platform/network/prod.tfstate"

region = "ap-south-1"

encrypt = true

use_lockfile = true

}

3. For legacy DynamoDB locking, verify the table has partition key LockID and migrate deliberately because newer S3 locking makes DynamoDB locking deprecated.

4. Open two terminals and start two applies against the same state key; confirm the second run waits or fails on the lock instead of writing concurrently.

5. Recover safely by checking the active process first; only use force-unlock when the original run is definitely dead.

What is a Terraform provider, and how is it versioned?Basic

Answer

A Terraform provider is a plugin that knows how to talk to an API such as AWS, Azure, Kubernetes, GitHub, or Datadog. Providers are versioned independently from Terraform itself, and production code should constrain provider versions with required_providers and commit the dependency lock file.

Technical explanation

Providers translate Terraform CRUD operations into API calls and expose resources and data sources.

Provider aliases allow one configuration to address multiple regions, accounts, or clusters.

Provider upgrades must be reviewed because schema or behavior changes can alter plans.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Create version controls for: What is a Terraform provider, and how is it versioned?

2. Add versions.tf:

terraform {

required_version = ">= 1.6, < 2.0"

required_providers {

aws = {

source = "hashicorp/aws"

version = "~> 5.0"

}

3. Run terraform init, commit .terraform.lock.hcl, and run terraform providers lock for multi-platform CI if needed.

4. Test an upgrade in a branch with terraform init -upgrade, review the provider changelog, and compare plans before merging.

Why pin provider and Terraform versions, and how do you do it?Basic

Answer

Version pinning makes Terraform runs deterministic. I pin the Terraform CLI with required_version, pin providers with required_providers constraints, and commit .terraform.lock.hcl so CI and developers use the same provider builds. Without pinning, an upstream provider change can alter plan behavior unexpectedly.

Technical explanation

required_version constrains the Terraform CLI feature set used by the configuration.

required_providers constrains provider source and version range.

The lock file pins checksums and selected provider versions for reproducible installs.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Create version controls for: Why pin provider and Terraform versions, and how do you do it?

2. Add versions.tf:

terraform {

required_version = ">= 1.6, < 2.0"

required_providers {

aws = {

source = "hashicorp/aws"

version = "~> 5.0"

}

3. Run terraform init, commit .terraform.lock.hcl, and run terraform providers lock for multi-platform CI if needed.

4. Test an upgrade in a branch with terraform init -upgrade, review the provider changelog, and compare plans before merging.

What is a Terraform module, and why use modules?Basic

Answer

A Terraform module is a directory of Terraform configuration that exposes inputs and outputs. Modules are used to standardize patterns, reduce duplication, encode security defaults, and give teams a reusable interface for common infrastructure such as VPCs, EKS clusters, IAM roles, and service stacks.

Technical explanation

A module should hide implementation details but expose the decisions callers legitimately need.

Good modules include examples, validation, sane defaults, outputs, documentation, and versioned releases.

Bad modules become black boxes if they expose too little or unsafe free-for-all wrappers if they expose everything.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Write a reusable module for: What is a Terraform module, and why use modules?

2. Create modules/s3_bucket with variables, locals, resource, and outputs:

variable "name" { type = string }

variable "tags" { type = map(string) default = {} }

locals { common_tags = merge(var.tags, { managed_by = "terraform" }) }

resource "aws_s3_bucket" "this" { bucket = var.name tags = local.common_tags }

output "bucket_id" { value = aws_s3_bucket.this.id }

3. Call it from the root module and pass the output to another module:

module "logs" { source = "../../modules/s3_bucket" name = "company-prod-logs" tags = var.tags }

module "app" { source = "../../modules/app" log_bucket_id = module.logs.bucket_id }

4. Add validation, README examples, and a minimal Terratest or plan test before releasing v1.0.0.

What is the difference between a root module and a child module?Basic

Answer

The root module is the Terraform directory where you run terraform plan and apply. A child module is called from another module using a module block. The root module wires environment-specific inputs, providers, and backends; child modules package reusable infrastructure logic.

Technical explanation

The root module owns backend configuration and the environment's entrypoint.

Child modules should be reusable and should not usually configure remote state backends.

Provider configuration can be passed from root to child modules for multi-account or multi-region setups.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Write a reusable module for: What is the difference between a root module and a child module?

2. Create modules/s3_bucket with variables, locals, resource, and outputs:

variable "name" { type = string }

variable "tags" { type = map(string) default = {} }

locals { common_tags = merge(var.tags, { managed_by = "terraform" }) }

resource "aws_s3_bucket" "this" { bucket = var.name tags = local.common_tags }

output "bucket_id" { value = aws_s3_bucket.this.id }

3. Call it from the root module and pass the output to another module:

module "logs" { source = "../../modules/s3_bucket" name = "company-prod-logs" tags = var.tags }

module "app" { source = "../../modules/app" log_bucket_id = module.logs.bucket_id }

4. Add validation, README examples, and a minimal Terratest or plan test before releasing v1.0.0.

What are input variables, outputs, and locals?Basic

Answer

Input variables are parameters passed into a module. Outputs return values from a module, often for use by callers or other systems. Locals are named expressions used internally to simplify repeated logic. Variables are the module interface, outputs are the exported result, and locals are private implementation helpers.

Technical explanation

Variables can have types, defaults, descriptions, validation, and sensitive markings.

Outputs can expose IDs, ARNs, endpoints, and computed values for downstream modules or humans.

Locals improve readability by naming common expressions and normalizing inputs.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Write a reusable module for: What are input variables, outputs, and locals?

2. Create modules/s3_bucket with variables, locals, resource, and outputs:

variable "name" { type = string }

variable "tags" { type = map(string) default = {} }

locals { common_tags = merge(var.tags, { managed_by = "terraform" }) }

resource "aws_s3_bucket" "this" { bucket = var.name tags = local.common_tags }

output "bucket_id" { value = aws_s3_bucket.this.id }

3. Call it from the root module and pass the output to another module:

module "logs" { source = "../../modules/s3_bucket" name = "company-prod-logs" tags = var.tags }

module "app" { source = "../../modules/app" log_bucket_id = module.logs.bucket_id }

4. Add validation, README examples, and a minimal Terratest or plan test before releasing v1.0.0.

What is the difference between count and for_each, and when do you use each?Basic

Answer

count creates N instances addressed by numeric index; for_each creates instances keyed by map or set keys. I use count for simple on/off or identical numbered resources, and for_each when each instance has a stable identity such as subnet name, user name, or environment key.

Technical explanation

count instances are addressed like aws_instance.web[0]; for_each instances are addressed like aws_instance.web["blue"].

Stable addressing is important because address changes can force replacement or state moves.

for_each requires keys known during planning.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Model repeated resources for: What is the difference between count and for_each, and when do you use each?

2. Prefer stable keys with for_each:

variable "subnets" {

type = map(object({ cidr = string, az = string }))

}

resource "aws_subnet" "private" {

for_each = var.subnets

vpc_id = aws_vpc.main.id

cidr_block = each.value.cidr

availability_zone = each.value.az

tags = { Name = "private-${each.key}" }

}

3. For nested blocks, use dynamic only when the input list genuinely drives repeated nested configuration:

dynamic "ingress" {

for_each = var.ingress_rules

content { from_port = ingress.value.port to_port = ingress.value.port protocol = "tcp" cidr_blocks = ingress.value.cidrs }

}

4. Remove one key and run plan; confirm only that keyed instance is affected rather than later list indexes shifting.

What problem does for_each solve that count creates when removing items?Basic

Answer

for_each solves the index-shift problem caused by count. With count, removing item 1 from a list can shift every later index and cause unnecessary replacement. With for_each, each object is addressed by a stable key, so removing one key does not rename the rest.

Technical explanation

Index shifts are especially dangerous for IAM users, security group rules, subnets, and DNS records.

Maps with stable keys express identity better than positional lists.

If migrating from count to for_each, use moved blocks or state moves to avoid replacement.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Model repeated resources for: What problem does for_each solve that count creates when removing items?

2. Prefer stable keys with for_each:

variable "subnets" {

type = map(object({ cidr = string, az = string }))

}

resource "aws_subnet" "private" {

for_each = var.subnets

vpc_id = aws_vpc.main.id

cidr_block = each.value.cidr

availability_zone = each.value.az

tags = { Name = "private-${each.key}" }

}

3. For nested blocks, use dynamic only when the input list genuinely drives repeated nested configuration:

dynamic "ingress" {

for_each = var.ingress_rules

content { from_port = ingress.value.port to_port = ingress.value.port protocol = "tcp" cidr_blocks = ingress.value.cidrs }

}

4. Remove one key and run plan; confirm only that keyed instance is affected rather than later list indexes shifting.

What is a dynamic block, and when would you use it?Basic

Answer

A dynamic block generates repeatable nested blocks inside a resource. I use it when a resource needs zero or more nested blocks based on input, such as security group ingress rules, listener rules, EBS block devices, or Kubernetes container ports. It should not be overused when simple explicit blocks are clearer.

Technical explanation

dynamic blocks generate nested configuration blocks, not top-level resources.

Use them when input data drives repeated nested structures.

Keep them simple; too many dynamic layers make modules hard to read and debug.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Model repeated resources for: What is a dynamic block, and when would you use it?

2. Prefer stable keys with for_each:

variable "subnets" {

type = map(object({ cidr = string, az = string }))

}

resource "aws_subnet" "private" {

for_each = var.subnets

vpc_id = aws_vpc.main.id

cidr_block = each.value.cidr

availability_zone = each.value.az

tags = { Name = "private-${each.key}" }

}

3. For nested blocks, use dynamic only when the input list genuinely drives repeated nested configuration:

dynamic "ingress" {

for_each = var.ingress_rules

content { from_port = ingress.value.port to_port = ingress.value.port protocol = "tcp" cidr_blocks = ingress.value.cidrs }

}

4. Remove one key and run plan; confirm only that keyed instance is affected rather than later list indexes shifting.

What is the difference between a data source and a resource?Basic

Answer

A resource tells Terraform to create and manage an object. A data source reads an existing object or computed information without owning its lifecycle. For example, aws_vpc creates a VPC, while data.aws_ami reads the latest AMI ID for use by an instance.

Technical explanation

Data sources are read during planning and can introduce dependencies if their values are referenced.

Resources are owned by the current state and can be created, updated, or destroyed by Terraform.

Do not use a data source when Terraform should own lifecycle; do not create a resource when another team owns it.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Create one resource and one data source side by side.

2. Use a data source for existing information and a resource for owned infrastructure:

data "aws_ami" "ubuntu" {

most_recent = true

owners = ["099720109477"]

filter { name = "name" values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"] }

}

resource "aws_instance" "web" {

ami = data.aws_ami.ubuntu.id

instance_type = "t3.micro"

}

3. Run plan and explain that Terraform reads the AMI but owns the EC2 instance lifecycle.

4. Delete the instance from code and see Terraform plan destruction; remove the data source and see no remote object deletion.

What is terraform import, and when do you need it?Basic

Answer

terraform import brings an existing real-world object under Terraform state management. It is needed when infrastructure was created manually, by another tool, or before Terraform adoption. Modern Terraform supports import blocks so imports can be reviewed, automated, and run through the normal plan/apply workflow.

Technical explanation

Import only updates Terraform state; you must still write configuration that matches the imported object.

Import blocks make imports reviewable and repeatable in pull requests.

After import, run plan and reconcile differences until the plan is clean or intentionally corrective.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Onboard an existing resource for: What is terraform import, and when do you need it?

2. Inventory the existing object, then create a matching resource block and import block:

import {

to = aws_s3_bucket.logs

id = "existing-company-logs"

}

resource "aws_s3_bucket" "logs" {

bucket = "existing-company-logs"

}

3. Run terraform plan -generate-config-out=generated.tf in a scratch branch when supported for the resource, then clean up the generated configuration to match module standards.

4. Apply the import, run a normal plan, and reconcile every proposed change as intentional, accidental drift, or provider default noise.

5. Repeat in small batches and do not enable automated production apply until no-op plans are reliable.

What is configuration drift, and how does Terraform detect it?Basic

Answer

Configuration drift is when real infrastructure no longer matches Terraform code or state, usually because of manual changes, external automation, or provider-side defaults. Terraform detects drift during refresh in plan/apply by reading current remote objects and comparing them with state and desired configuration.

Technical explanation

Drift can appear as an in-place update, replacement, deletion, or state-only change in the plan.

Some drift is intentional, for example autoscaling capacity, and should be handled with ownership boundaries or ignore_changes.

Regular drift detection reduces surprises during emergency applies.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Set up drift handling for: What is configuration drift, and how does Terraform detect it?

2. Schedule a read-only plan job per workspace:

terraform init

terraform plan -detailed-exitcode -out=drift.tfplan || status=$?

terraform show -json drift.tfplan > drift.json

3. Interpret exit code 0 as no drift, 2 as changes present, and 1 as an error requiring investigation.

4. For state-only synchronization, use refresh-only review:

terraform plan -refresh-only -out=refresh.tfplan

terraform apply refresh.tfplan

5. Open a ticket that classifies drift as revert, codify, ignore because externally owned, or remove from Terraform ownership.

How do you handle resources changed manually outside Terraform?Basic

Answer

For manual changes outside Terraform, I first run plan to understand the drift. Then I choose one of three actions: codify the intended change, revert the manual change, or update/import/remove state if ownership changed. I do not blindly apply until I know whether Terraform will undo or preserve the change.

Technical explanation

First classify the manual change as intended, accidental, or externally owned.

If intended, update code and apply; if accidental, let Terraform revert it; if ownership changed, update state and documentation.

The worst response is to use ignore_changes just to silence a plan without understanding the cause.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Set up drift handling for: How do you handle resources changed manually outside Terraform?

2. Schedule a read-only plan job per workspace:

terraform init

terraform plan -detailed-exitcode -out=drift.tfplan || status=$?

terraform show -json drift.tfplan > drift.json

3. Interpret exit code 0 as no drift, 2 as changes present, and 1 as an error requiring investigation.

4. For state-only synchronization, use refresh-only review:

terraform plan -refresh-only -out=refresh.tfplan

terraform apply refresh.tfplan

5. Open a ticket that classifies drift as revert, codify, ignore because externally owned, or remove from Terraform ownership.

What are Terraform workspaces, and what are their limitations for environments?Basic

Answer

Terraform workspaces let the same configuration directory maintain multiple state instances, selected with terraform workspace select. They are useful for lightweight isolation, but they are limited for full environment separation because backend settings, credentials, policies, and directory-level controls often need to differ by environment.

Technical explanation

Workspaces do not automatically isolate IAM permissions, backend configuration, or deployment approvals.

They can be useful for ephemeral review environments when the blast radius is small.

For critical environments, separate roots and state are clearer and easier to govern.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Create environment separation for: What are Terraform workspaces, and what are their limitations for environments?

2. Use reusable modules with separate root modules:

infra/

modules/vpc/

envs/dev/main.tf

envs/stage/main.tf

envs/prod/main.tf

3. Give each environment a separate backend key and credentials boundary:

terraform { backend "s3" { key = "envs/prod/network.tfstate" bucket = "company-tfstate-prod" region = "ap-south-1" use_lockfile = true } }

4. Run PR plans for all environments but require manual approval and stricter policy for prod.

5. Use workspaces only for low-risk ephemeral variants after documenting their limitations.

What is the difference between using workspaces and separate state files per environment?Basic

Answer

Workspaces multiplex state within one configuration, while separate state files or directories make environment boundaries explicit. For production, I prefer separate state per environment with clear backend keys, permissions, and CI gates. Workspaces are acceptable for simple dev/test variants but can hide risk if prod and dev share too much configuration surface.

Technical explanation

Separate state makes blast radius and access control explicit.

Workspaces can reduce duplication but may encourage one-size-fits-all environment design.

Mature platforms often use modules for reuse and separate roots for isolation.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Create environment separation for: What is the difference between using workspaces and separate state files per environment?

2. Use reusable modules with separate root modules:

infra/

modules/vpc/

envs/dev/main.tf

envs/stage/main.tf

envs/prod/main.tf

3. Give each environment a separate backend key and credentials boundary:

terraform { backend "s3" { key = "envs/prod/network.tfstate" bucket = "company-tfstate-prod" region = "ap-south-1" use_lockfile = true } }

4. Run PR plans for all environments but require manual approval and stricter policy for prod.

5. Use workspaces only for low-risk ephemeral variants after documenting their limitations.

How do you structure Terraform code for multiple environments (dev/staging/prod)?Basic

Answer

I structure multi-environment Terraform with reusable modules plus thin environment root modules. Each environment has its own backend key or backend configuration, variable values, plan pipeline, and approval rules. Shared logic belongs in modules; environment-specific sizing, account IDs, regions, and feature flags belong in env roots or tfvars.

Technical explanation

Keep global/shared resources separate from app-specific resources when their lifecycles differ.

Avoid one giant state file; use meaningful boundaries such as network, cluster, data, and service.

Each environment should be independently plannable and recoverable.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Create environment separation for: How do you structure Terraform code for multiple environments (dev/staging/prod)?

2. Use reusable modules with separate root modules:

infra/

modules/vpc/

envs/dev/main.tf

envs/stage/main.tf

envs/prod/main.tf

3. Give each environment a separate backend key and credentials boundary:

terraform { backend "s3" { key = "envs/prod/network.tfstate" bucket = "company-tfstate-prod" region = "ap-south-1" use_lockfile = true } }

4. Run PR plans for all environments but require manual approval and stricter policy for prod.

5. Use workspaces only for low-risk ephemeral variants after documenting their limitations.

What is the purpose of terraform plan -out and applying a saved plan?Basic

Answer

terraform plan -out writes an exact saved execution plan. Applying that saved plan ensures the approved actions are the actions that execute, instead of recalculating a different plan later. It is useful in CI/CD because reviewers approve a concrete artifact before apply.

Technical explanation

A saved plan includes variable values and provider decisions at plan time; treat it as sensitive.

Do not edit configuration between saved plan and apply unless you generate a new plan.

Saved plans are useful for audit trails because the approval references a concrete artifact.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Create a CI job that generates and publishes a saved plan artifact.

2. Run:

terraform init

terraform plan -out=tfplan

terraform show -no-color tfplan > tfplan.txt

terraform show -json tfplan > tfplan.json

3. Require reviewers to inspect tfplan.txt or a summarized PR comment before approving.

4. In the protected apply job, download the exact artifact and run terraform apply tfplan, not a fresh unreviewed apply.

What does terraform refresh do, and how has its behaviour changed?Basic

Answer

terraform refresh used to update state to match remote objects, but the standalone command is deprecated. The safer current workflow is terraform plan -refresh-only and terraform apply -refresh-only, because they let you review state-only changes before writing them.

Technical explanation

Normal plan and apply do an implicit refresh, but refresh-only mode focuses on updating state without changing remote infrastructure.

Standalone refresh could update state without a reviewed plan, which made it easier to hide dangerous drift.

Use refresh-only when the real infrastructure is correct and state needs to catch up.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Set up drift handling for: What does terraform refresh do, and how has its behaviour changed?

2. Schedule a read-only plan job per workspace:

terraform init

terraform plan -detailed-exitcode -out=drift.tfplan || status=$?

terraform show -json drift.tfplan > drift.json

3. Interpret exit code 0 as no drift, 2 as changes present, and 1 as an error requiring investigation.

4. For state-only synchronization, use refresh-only review:

terraform plan -refresh-only -out=refresh.tfplan

terraform apply refresh.tfplan

5. Open a ticket that classifies drift as revert, codify, ignore because externally owned, or remove from Terraform ownership.

What are provisioners, and why are they considered a last resort?Basic

Answer

Provisioners run scripts or commands during resource creation or destruction. They are a last resort because they are hard to model declaratively, can be non-idempotent, make dependency behavior fragile, and often hide configuration that belongs in images, cloud-init, Ansible, or a platform-native API.

Technical explanation

Provisioners run after resource creation or before destruction but are not part of the provider's normal desired-state model.

They can fail after the resource exists, leaving partial configuration and confusing retries.

Prefer provider-native resources, cloud-init, images, or Ansible for post-provision configuration.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Demonstrate the risk and replacement for: What are provisioners, and why are they considered a last resort?

2. Avoid this as a default pattern:

provisioner "remote-exec" {

inline = ["sudo yum install -y nginx", "sudo systemctl enable --now nginx"]

}

3. Replace it with a baked AMI, user_data, SSM document, or Ansible role. For example, put bootstrap in cloud-init and detailed config in Ansible:

ansible-playbook -i inventory/aws_ec2.yml site.yml --limit tag_Role_web --check --diff

4. If a provisioner remains necessary, make it small, retry-safe, logged, and documented as a temporary bridge.

What is the difference between local-exec and remote-exec provisioners?Basic

Answer

local-exec runs a command on the machine executing Terraform. remote-exec connects to the created resource and runs commands there. local-exec is useful for local notifications or external tooling; remote-exec is usually fragile and should be replaced with baked images, cloud-init, SSM, or configuration management when possible.

Technical explanation

local-exec depends on the runner environment, installed tools, and credentials.

remote-exec depends on network reachability, SSH/WinRM readiness, host keys, and bootstrap timing.

Both should be isolated, logged, and avoided for core lifecycle management.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Demonstrate the risk and replacement for: What is the difference between local-exec and remote-exec provisioners?

2. Avoid this as a default pattern:

provisioner "remote-exec" {

inline = ["sudo yum install -y nginx", "sudo systemctl enable --now nginx"]

}

3. Replace it with a baked AMI, user_data, SSM document, or Ansible role. For example, put bootstrap in cloud-init and detailed config in Ansible:

ansible-playbook -i inventory/aws_ec2.yml site.yml --limit tag_Role_web --check --diff

4. If a provisioner remains necessary, make it small, retry-safe, logged, and documented as a temporary bridge.

What is the lifecycle block (create_before_destroy, prevent_destroy, ignore_changes)?Basic

Answer

The lifecycle block customizes resource behavior. create_before_destroy reduces downtime by replacing before deleting when supported. prevent_destroy protects critical resources from accidental deletion. ignore_changes tells Terraform not to act on selected attribute drift, usually when another system legitimately manages that field.

Technical explanation

create_before_destroy can require unique names or extra capacity because old and new resources coexist.

prevent_destroy is a guardrail, not a backup strategy; it fails the plan if deletion is proposed.

ignore_changes should be narrow and documented.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Practice lifecycle and dependency controls for: What is the lifecycle block (create_before_destroy, prevent_destroy, ignore_changes)?

2. Add lifecycle rules deliberately:

resource "aws_db_instance" "prod" {

identifier = "prod-db"

lifecycle {

prevent_destroy = true

ignore_changes = [allocated_storage]

}

3. Force a reviewed replacement with:

terraform plan -replace='aws_instance.web["blue"]' -out=replace.tfplan

terraform apply replace.tfplan

4. Use depends_on only when references do not express the ordering. Then run terraform graph or inspect the plan to explain the dependency path.

5. If you use -target for recovery, immediately follow with a full terraform plan.

When would you use ignore_changes, and what is the risk?Basic

Answer

ignore_changes is useful when an external controller manages an attribute, such as an autoscaler changing desired capacity or Kubernetes adding annotations. The risk is that Terraform will intentionally stop detecting meaningful drift for that field, so it can hide misconfiguration or security changes if used too broadly.

Technical explanation

Use it for fields owned by controllers, not for fields that encode compliance or security posture.

Review ignored attributes periodically because they become blind spots.

Prefer module design that separates ownership before reaching for ignore_changes.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Practice lifecycle and dependency controls for: When would you use ignore_changes, and what is the risk?

2. Add lifecycle rules deliberately:

resource "aws_db_instance" "prod" {

identifier = "prod-db"

lifecycle {

prevent_destroy = true

ignore_changes = [allocated_storage]

}

3. Force a reviewed replacement with:

terraform plan -replace='aws_instance.web["blue"]' -out=replace.tfplan

terraform apply replace.tfplan

4. Use depends_on only when references do not express the ordering. Then run terraform graph or inspect the plan to explain the dependency path.

5. If you use -target for recovery, immediately follow with a full terraform plan.

What is a tainted resource, and what replaced the taint command?Basic

Answer

A tainted resource is a resource marked for replacement on the next plan. The older terraform taint command is deprecated. The recommended replacement is terraform apply -replace='resource.address' because replacement appears in the plan and can be reviewed before execution.

Technical explanation

-replace is explicit for one plan/apply operation instead of leaving a hidden taint in shared state.

It works well for degraded instances, corrupted volumes, or resources that need recreation after manual repair.

Review the replacement plan for dependencies and downtime.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Practice lifecycle and dependency controls for: What is a tainted resource, and what replaced the taint command?

2. Add lifecycle rules deliberately:

resource "aws_db_instance" "prod" {

identifier = "prod-db"

lifecycle {

prevent_destroy = true

ignore_changes = [allocated_storage]

}

3. Force a reviewed replacement with:

terraform plan -replace='aws_instance.web["blue"]' -out=replace.tfplan

terraform apply replace.tfplan

4. Use depends_on only when references do not express the ordering. Then run terraform graph or inspect the plan to explain the dependency path.

5. If you use -target for recovery, immediately follow with a full terraform plan.

How do you target a specific resource with terraform apply, and why is it discouraged?Basic

Answer

You can target a resource using terraform plan or apply -target=resource.address. It is discouraged as normal practice because it bypasses Terraform's full dependency graph and can leave the configuration partially converged. I reserve it for recovery or troubleshooting and follow it with a normal full plan.

Technical explanation

-target can skip resources that would normally be updated through dependency traversal.

It is acceptable for recovery, bootstrapping a dependency, or narrowing a failing troubleshooting run.

Always run a normal plan afterward to confirm full convergence.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Practice lifecycle and dependency controls for: How do you target a specific resource with terraform apply, and why is it discouraged?

2. Add lifecycle rules deliberately:

resource "aws_db_instance" "prod" {

identifier = "prod-db"

lifecycle {

prevent_destroy = true

ignore_changes = [allocated_storage]

}

3. Force a reviewed replacement with:

terraform plan -replace='aws_instance.web["blue"]' -out=replace.tfplan

terraform apply replace.tfplan

4. Use depends_on only when references do not express the ordering. Then run terraform graph or inspect the plan to explain the dependency path.

5. If you use -target for recovery, immediately follow with a full terraform plan.

What are explicit versus implicit dependencies, and what does depends_on do?Basic

Answer

Implicit dependencies come from references, such as an instance using a subnet ID. Explicit dependencies use depends_on when no data reference expresses the required ordering. depends_on should be used sparingly because overusing it makes plans less parallel and can hide weak module interfaces.

Technical explanation

References such as subnet_id = aws_subnet.private.id are preferred because they carry data and ordering.

depends_on is useful for side effects not represented in values, such as IAM propagation or module-level sequencing.

Unnecessary depends_on can make values unknown during plan and reduce parallelism.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Practice lifecycle and dependency controls for: What are explicit versus implicit dependencies, and what does depends_on do?

2. Add lifecycle rules deliberately:

resource "aws_db_instance" "prod" {

identifier = "prod-db"

lifecycle {

prevent_destroy = true

ignore_changes = [allocated_storage]

}

3. Force a reviewed replacement with:

terraform plan -replace='aws_instance.web["blue"]' -out=replace.tfplan

terraform apply replace.tfplan

4. Use depends_on only when references do not express the ordering. Then run terraform graph or inspect the plan to explain the dependency path.

5. If you use -target for recovery, immediately follow with a full terraform plan.

How does Terraform build its dependency graph?Intermediate

Answer

Terraform builds a dependency graph from resource references, provider configuration, module calls, meta-arguments, data sources, and explicit depends_on edges. It uses the graph to order reads, creates, updates, replacements, and destroys while parallelizing independent work.

Technical explanation

Graph construction lets Terraform handle create and destroy ordering differently, especially for replacements.

Unknown values at plan time can affect graph decisions and may delay evaluation until apply.

Understanding the graph helps debug cycles and unexpected ordering.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Practice lifecycle and dependency controls for: How does Terraform build its dependency graph?

2. Add lifecycle rules deliberately:

resource "aws_db_instance" "prod" {

identifier = "prod-db"

lifecycle {

prevent_destroy = true

ignore_changes = [allocated_storage]

}

3. Force a reviewed replacement with:

terraform plan -replace='aws_instance.web["blue"]' -out=replace.tfplan

terraform apply replace.tfplan

4. Use depends_on only when references do not express the ordering. Then run terraform graph or inspect the plan to explain the dependency path.

5. If you use -target for recovery, immediately follow with a full terraform plan.

How do you pass outputs from one module to another?Intermediate

Answer

Outputs from one module are passed to another by referencing module.<name>.<output>. The producing module defines an output block, and the root module wires that value into an input variable on the consuming module. This keeps dependencies explicit and avoids hidden remote lookups.

Technical explanation

Outputs are the clean contract between modules.

Passing values through the root module makes dependencies visible in code review.

Avoid making child modules read each other's remote state directly unless a deliberate stack boundary exists.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Write a reusable module for: How do you pass outputs from one module to another?

2. Create modules/s3_bucket with variables, locals, resource, and outputs:

variable "name" { type = string }

variable "tags" { type = map(string) default = {} }

locals { common_tags = merge(var.tags, { managed_by = "terraform" }) }

resource "aws_s3_bucket" "this" { bucket = var.name tags = local.common_tags }

output "bucket_id" { value = aws_s3_bucket.this.id }

3. Call it from the root module and pass the output to another module:

module "logs" { source = "../../modules/s3_bucket" name = "company-prod-logs" tags = var.tags }

module "app" { source = "../../modules/app" log_bucket_id = module.logs.bucket_id }

4. Add validation, README examples, and a minimal Terratest or plan test before releasing v1.0.0.

What is a remote state data source, and what is the security concern with it?Intermediate

Answer

A remote state data source reads outputs from another Terraform state. It is convenient for cross-stack integration, but the security concern is that state often contains sensitive data and the reader may gain access to more than the intended outputs. I prefer narrower interfaces such as SSM Parameter Store, Secrets Manager, or explicitly published outputs when possible.

Technical explanation

Remote state coupling can make refactors hard because consumers depend on another workspace's output names.

State access often grants more data than one output, so apply least privilege carefully.

A safer pattern is to publish only required IDs to a controlled service catalog or parameter store.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Create a producer workspace that outputs a VPC ID and a consumer that reads it.

2. Example consumer:

data "terraform_remote_state" "network" {

backend = "s3"

config = { bucket = "company-tfstate-prod" key = "network/prod.tfstate" region = "ap-south-1" }

}

module "app" {

source = "../../modules/app"

vpc_id = data.terraform_remote_state.network.outputs.vpc_id

}

3. Review who can read the entire referenced state, not only vpc_id.

4. Replace the remote state dependency with SSM Parameter Store or a service catalog entry if access is too broad.

How do you manage secrets in Terraform without leaking them into state?Intermediate

Answer

You cannot completely avoid secret values in Terraform state if Terraform manages a resource attribute that stores the secret. The safer pattern is to avoid generating or passing plaintext secrets through Terraform, reference secret ARNs or names instead of values, restrict state access, encrypt state, and let runtime systems fetch secrets directly.

Technical explanation

Marking a value sensitive hides output but does not remove it from state.

If Terraform must set a secret value, assume state readers can see it and govern access accordingly.

Prefer secret metadata and runtime retrieval over secret material in IaC.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Harden secret handling for: How do you manage secrets in Terraform without leaking them into state?

2. Bad pattern: passing plaintext database passwords as Terraform variables and outputting them.

3. Better pattern: create or reference secret metadata and let runtime fetch the value:

resource "aws_secretsmanager_secret" "db" { name = "prod/db/password" }

# Application IAM can read this secret ARN; Terraform does not need to output the value.

output "db_secret_arn" { value = aws_secretsmanager_secret.db.arn }

4. Mark any unavoidable sensitive input or output with sensitive = true, but still treat the state backend as secret storage.

5. Verify S3 state encryption, IAM read restrictions, audit logs, and CI log redaction.

Why does the state file potentially contain sensitive values, and how do you protect it?Intermediate

Answer

State can contain sensitive values because providers must record enough information to detect drift and update resources. Marking an output sensitive hides it from CLI display, but the raw state may still contain it. Protect state with encryption, access controls, versioning, audit logs, short-lived credentials, and minimal sharing.

Technical explanation

Encrypt state at rest and in transit, and enable object versioning for recovery.

Limit state read permissions because read can be as sensitive as write.

Scrub CI logs and artifacts so plans and outputs do not expose secret material.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Harden secret handling for: Why does the state file potentially contain sensitive values, and how do you protect it?

2. Bad pattern: passing plaintext database passwords as Terraform variables and outputting them.

3. Better pattern: create or reference secret metadata and let runtime fetch the value:

resource "aws_secretsmanager_secret" "db" { name = "prod/db/password" }

# Application IAM can read this secret ARN; Terraform does not need to output the value.

output "db_secret_arn" { value = aws_secretsmanager_secret.db.arn }

4. Mark any unavoidable sensitive input or output with sensitive = true, but still treat the state backend as secret storage.

5. Verify S3 state encryption, IAM read restrictions, audit logs, and CI log redaction.

What is the sensitive flag on variables and outputs?Intermediate

Answer

The sensitive flag prevents values from being displayed in Terraform CLI output and module outputs, which reduces accidental exposure in logs. It does not make the value cryptographically secret in the state file. Anyone who can read state may still access sensitive values.

Technical explanation

Use sensitive = true for variables and outputs that contain tokens, passwords, private keys, or generated secrets.

Do not rely on sensitive for compliance if state access is broad.

Combine it with backend encryption, access control, and secret-store design.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Harden secret handling for: What is the sensitive flag on variables and outputs?

2. Bad pattern: passing plaintext database passwords as Terraform variables and outputting them.

3. Better pattern: create or reference secret metadata and let runtime fetch the value:

resource "aws_secretsmanager_secret" "db" { name = "prod/db/password" }

# Application IAM can read this secret ARN; Terraform does not need to output the value.

output "db_secret_arn" { value = aws_secretsmanager_secret.db.arn }

4. Mark any unavoidable sensitive input or output with sensitive = true, but still treat the state backend as secret storage.

5. Verify S3 state encryption, IAM read restrictions, audit logs, and CI log redaction.

What are Terraform meta-arguments (count, for_each, provider, depends_on, lifecycle)?Intermediate

Answer

Terraform meta-arguments are language features that modify how resources or modules behave. Common examples are count, for_each, provider, depends_on, and lifecycle. They are not provider-specific arguments; they control Terraform's graph, instance addressing, provider selection, and lifecycle decisions.

Technical explanation

Meta-arguments are evaluated by Terraform before provider-specific API calls.

They affect addressing, dependencies, lifecycle, and provider binding.

Changing a meta-argument can change resource addresses, so review plans carefully.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Model repeated resources for: What are Terraform meta-arguments (count, for_each, provider, depends_on, lifecycle)?

2. Prefer stable keys with for_each:

variable "subnets" {

type = map(object({ cidr = string, az = string }))

}

resource "aws_subnet" "private" {

for_each = var.subnets

vpc_id = aws_vpc.main.id

cidr_block = each.value.cidr

availability_zone = each.value.az

tags = { Name = "private-${each.key}" }

}

3. For nested blocks, use dynamic only when the input list genuinely drives repeated nested configuration:

dynamic "ingress" {

for_each = var.ingress_rules

content { from_port = ingress.value.port to_port = ingress.value.port protocol = "tcp" cidr_blocks = ingress.value.cidrs }

}

4. Remove one key and run plan; confirm only that keyed instance is affected rather than later list indexes shifting.

How do you test Terraform code (validate, fmt, plan, tflint, terratest)?Intermediate

Answer

I test Terraform in layers: terraform fmt for style, validate for syntax and provider schema, tflint for static rules, policy tools for guardrails, plan review for behavior, and Terratest or integration tests for real provisioning where risk justifies it.

Technical explanation

Static validation is fast and should run on every pull request.

Integration tests are slower and should focus on reusable modules or high-risk resources.

The plan itself is a test artifact because it shows the behavioral delta.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a safe IaC delivery workflow for: How do you test Terraform code (validate, fmt, plan, tflint, terratest)?

2. Pull request job:

terraform fmt -check

terraform init -backend=false

terraform validate

tflint --recursive

checkov -d .

terraform init

terraform plan -out=tfplan

terraform show -json tfplan > tfplan.json

3. Policy job evaluates plan JSON for public exposure, missing encryption, IAM wildcards, and destructive changes.

4. Apply job runs only after approval, uses remote state locking, short-lived cloud credentials, and applies the saved plan artifact.

5. For failures, rerun plan, inspect state and cloud objects, and fix root cause before any state surgery.

How would you integrate Terraform into a CI/CD pipeline safely?Intermediate

Answer

A safe Terraform CI/CD pipeline separates plan and apply, uses remote state locking, assumes least-privilege credentials, publishes the saved plan, requires review for production, runs policy and security checks before apply, and prevents concurrent runs per workspace.

Technical explanation

Use short-lived cloud credentials through OIDC or workload identity rather than long-lived static keys.

Separate read-only PR plans from privileged apply jobs.

Protect production applies with environment approvals and concurrency controls.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a safe IaC delivery workflow for: How would you integrate Terraform into a CI/CD pipeline safely?

2. Pull request job:

terraform fmt -check

terraform init -backend=false

terraform validate

tflint --recursive

checkov -d .

terraform init

terraform plan -out=tfplan

terraform show -json tfplan > tfplan.json

3. Policy job evaluates plan JSON for public exposure, missing encryption, IAM wildcards, and destructive changes.

4. Apply job runs only after approval, uses remote state locking, short-lived cloud credentials, and applies the saved plan artifact.

5. For failures, rerun plan, inspect state and cloud objects, and fix root cause before any state surgery.

What is a policy-as-code check for Terraform (OPA, Sentinel, checkov)?Intermediate

Answer

Policy-as-code checks evaluate Terraform plans or configuration against organizational rules. Examples include OPA/Rego, Sentinel, Checkov, Conftest, and tfsec-style checks. They catch issues such as public S3 buckets, unencrypted storage, unrestricted security groups, missing tags, and disallowed instance types before apply.

Technical explanation

Evaluate both configuration and generated plan JSON where possible; plan checks catch computed behavior.

Policies should be clear, versioned, tested, and owned like application code.

Start with high-value controls: encryption, public access, IAM wildcards, tagging, and destructive changes.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a safe IaC delivery workflow for: What is a policy-as-code check for Terraform (OPA, Sentinel, checkov)?

2. Pull request job:

terraform fmt -check

terraform init -backend=false

terraform validate

tflint --recursive

checkov -d .

terraform init

terraform plan -out=tfplan

terraform show -json tfplan > tfplan.json

3. Policy job evaluates plan JSON for public exposure, missing encryption, IAM wildcards, and destructive changes.

4. Apply job runs only after approval, uses remote state locking, short-lived cloud credentials, and applies the saved plan artifact.

5. For failures, rerun plan, inspect state and cloud objects, and fix root cause before any state surgery.

How do you handle a Terraform apply that fails halfway through?Intermediate

Answer

If apply fails halfway, I do not rerun blindly. I inspect the error, run terraform plan to see the actual remaining delta, check state for created resources, import or remove state only if needed, fix the root cause, and re-apply. Terraform state should reflect successful operations even if the overall apply failed.

Technical explanation

Terraform records successful resource operations as it goes, so partial success is normal after failures.

Manual cleanup may be needed if a provider created an object but failed before state was updated.

State surgery should be rare, backed up, and peer-reviewed.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a safe IaC delivery workflow for: How do you handle a Terraform apply that fails halfway through?

2. Pull request job:

terraform fmt -check

terraform init -backend=false

terraform validate

tflint --recursive

checkov -d .

terraform init

terraform plan -out=tfplan

terraform show -json tfplan > tfplan.json

3. Policy job evaluates plan JSON for public exposure, missing encryption, IAM wildcards, and destructive changes.

4. Apply job runs only after approval, uses remote state locking, short-lived cloud credentials, and applies the saved plan artifact.

5. For failures, rerun plan, inspect state and cloud objects, and fix root cause before any state surgery.

What is the difference between terraform destroy and removing a resource from code?Intermediate

Answer

terraform destroy plans deletion for all resources managed by the current state. Removing a resource from code usually plans to destroy that specific resource because Terraform thinks it is no longer desired. If the goal is to stop managing without deleting, use a removed block with destroy=false or a state removal workflow, with approvals.

Technical explanation

Removing code means 'this should no longer exist' unless you explicitly remove state ownership without destruction.

Destroying an entire workspace is a high-risk operation and should require strong approval.

Decommissioning should include backups, dependency checks, and post-destroy verification.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a safe IaC delivery workflow for: What is the difference between terraform destroy and removing a resource from code?

2. Pull request job:

terraform fmt -check

terraform init -backend=false

terraform validate

tflint --recursive

checkov -d .

terraform init

terraform plan -out=tfplan

terraform show -json tfplan > tfplan.json

3. Policy job evaluates plan JSON for public exposure, missing encryption, IAM wildcards, and destructive changes.

4. Apply job runs only after approval, uses remote state locking, short-lived cloud credentials, and applies the saved plan artifact.

5. For failures, rerun plan, inspect state and cloud objects, and fix root cause before any state surgery.

How do you import existing infrastructure into Terraform at scale?Intermediate

Answer

At scale, I import existing infrastructure by inventorying resources, grouping them by ownership and risk, generating resource and import blocks, running plans in small batches, reconciling drift, and adding tests and policy. Modern import blocks and bulk import workflows are better than one-off CLI imports because they are reviewable and repeatable.

Technical explanation

Start with read-only inventory and tagging so ownership is clear before import.

Import low-risk resources first to validate naming and module design.

For large estates, automate discovery and generated import blocks but review each batch.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Onboard an existing resource for: How do you import existing infrastructure into Terraform at scale?

2. Inventory the existing object, then create a matching resource block and import block:

import {

to = aws_s3_bucket.logs

id = "existing-company-logs"

}

resource "aws_s3_bucket" "logs" {

bucket = "existing-company-logs"

}

3. Run terraform plan -generate-config-out=generated.tf in a scratch branch when supported for the resource, then clean up the generated configuration to match module standards.

4. Apply the import, run a normal plan, and reconcile every proposed change as intentional, accidental drift, or provider default noise.

5. Repeat in small batches and do not enable automated production apply until no-op plans are reliable.

What is a Terraform registry module, and how do you evaluate one for production use?Intermediate

Answer

A Terraform registry module is a published reusable module from the public or private registry. For production, I evaluate maintainership, source code quality, versioning, examples, inputs/outputs, security defaults, issue history, release cadence, license, and whether it allows the controls my organization requires.

Technical explanation

A good module has narrow scope and predictable behavior.

Avoid unmaintained modules, modules with excessive permissions, or modules that hide security decisions.

Pin module versions and read changelogs before upgrades.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a safe IaC delivery workflow for: What is a Terraform registry module, and how do you evaluate one for production use?

2. Pull request job:

terraform fmt -check

terraform init -backend=false

terraform validate

tflint --recursive

checkov -d .

terraform init

terraform plan -out=tfplan

terraform show -json tfplan > tfplan.json

3. Policy job evaluates plan JSON for public exposure, missing encryption, IAM wildcards, and destructive changes.

4. Apply job runs only after approval, uses remote state locking, short-lived cloud credentials, and applies the saved plan artifact.

5. For failures, rerun plan, inspect state and cloud objects, and fix root cause before any state surgery.

How would you write a reusable module for a standard service (as you did to cut provisioning time 70%)?Intermediate

Answer

For a reusable service module, I define a clean interface, encode secure defaults, expose only necessary knobs, create outputs needed by downstream systems, and add examples and tests. The goal is to turn repeated manual provisioning into a small, approved module call that is fast, consistent, and safe.

Technical explanation

A service module should encode defaults for logging, monitoring, tags, encryption, IAM boundaries, and deployment patterns.

Expose sizing and optional features as typed inputs with validation.

Prove reuse through examples for dev and prod and a test that provisions the minimal valid service.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Write a reusable module for: How would you write a reusable module for a standard service (as you did to cut provisioning time 70%)?

2. Create modules/s3_bucket with variables, locals, resource, and outputs:

variable "name" { type = string }

variable "tags" { type = map(string) default = {} }

locals { common_tags = merge(var.tags, { managed_by = "terraform" }) }

resource "aws_s3_bucket" "this" { bucket = var.name tags = local.common_tags }

output "bucket_id" { value = aws_s3_bucket.this.id }

3. Call it from the root module and pass the output to another module:

module "logs" { source = "../../modules/s3_bucket" name = "company-prod-logs" tags = var.tags }

module "app" { source = "../../modules/app" log_bucket_id = module.logs.bucket_id }

4. Add validation, README examples, and a minimal Terratest or plan test before releasing v1.0.0.

What is the difference between Terraform and CloudFormation, and when choose each?Intermediate

Answer

Terraform is multi-cloud and provider-based, with HCL modules and a broad ecosystem. CloudFormation is AWS-native, deeply integrated with AWS services, and supported directly by AWS. I choose Terraform for cross-cloud or standard platform workflows; CloudFormation when AWS-native support, StackSets, or service coverage is the decisive factor.

Technical explanation

Terraform's provider ecosystem is broader; CloudFormation's AWS integration can be deeper on day-zero AWS features.

CloudFormation state is managed by AWS stacks; Terraform state is managed by the selected backend.

Operational familiarity and governance requirements often decide the choice.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a safe IaC delivery workflow for: What is the difference between Terraform and CloudFormation, and when choose each?

2. Pull request job:

terraform fmt -check

terraform init -backend=false

terraform validate

tflint --recursive

checkov -d .

terraform init

terraform plan -out=tfplan

terraform show -json tfplan > tfplan.json

3. Policy job evaluates plan JSON for public exposure, missing encryption, IAM wildcards, and destructive changes.

4. Apply job runs only after approval, uses remote state locking, short-lived cloud credentials, and applies the saved plan artifact.

5. For failures, rerun plan, inspect state and cloud objects, and fix root cause before any state surgery.

What is OpenTofu, and why did it fork from Terraform?Intermediate

Answer

OpenTofu is the community-driven open-source fork of Terraform created after HashiCorp changed Terraform's license to BUSL. It aims to preserve an open, vendor-neutral IaC tool with a familiar Terraform-compatible workflow. I consider it when license posture, governance, or open-source guarantees matter to the organization.

Technical explanation

OpenTofu remains familiar to Terraform users, but compatibility and feature divergence should be tested per workspace.

A migration decision should consider provider support, CI tooling, policy tooling, and organizational license requirements.

Do not mix Terraform and OpenTofu against the same state without a deliberate, tested migration plan.

Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.

Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.

Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.

Hands-on example

1. Build a safe IaC delivery workflow for: What is OpenTofu, and why did it fork from Terraform?

2. Pull request job:

terraform fmt -check

terraform init -backend=false

terraform validate

tflint --recursive

checkov -d .

terraform init

terraform plan -out=tfplan

terraform show -json tfplan > tfplan.json

3. Policy job evaluates plan JSON for public exposure, missing encryption, IAM wildcards, and destructive changes.

4. Apply job runs only after approval, uses remote state locking, short-lived cloud credentials, and applies the saved plan artifact.

5. For failures, rerun plan, inspect state and cloud objects, and fix root cause before any state surgery.

What is Ansible, and how is it agentless?Intermediate

Answer

Ansible is an automation and configuration management tool that runs tasks from a control node against managed hosts. It is agentless because managed Linux hosts usually only need SSH and Python, while Windows hosts use WinRM; there is no long-running Ansible agent to install.

Technical explanation

Agentless reduces bootstrap requirements and makes Ansible attractive for heterogeneous fleets.

The control node pushes modules to targets for execution and collects structured results.

Managed hosts still need network reachability, credentials, privilege escalation, and suitable interpreters.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Create a minimal Ansible control workflow for: What is Ansible, and how is it agentless?

2. Inventory example:

[web]

web1 ansible_host=10.0.1.10 ansible_user=ec2-user

web2 ansible_host=10.0.1.11 ansible_user=ec2-user

[web:vars]

ansible_become=true

3. Playbook example:

---

- name: Configure web hosts

hosts: web

become: true

tasks:

- name: Ensure nginx is installed

ansible.builtin.package:

state: present

- name: Ensure nginx is running

ansible.builtin.service:

state: started

enabled: true

4. Run ansible -m ping web first, then ansible-playbook site.yml --check --diff, then the real run.

How does Ansible connect to managed hosts?Intermediate

Answer

Ansible connects to managed hosts using connection plugins, most commonly SSH for Linux/Unix and WinRM for Windows. Inventory defines the hosts and variables, and Ansible authenticates using SSH keys, passwords, Kerberos, cloud identity, or other supported mechanisms.

Technical explanation

Connection details can be inventory variables such as ansible_host, ansible_user, ansible_port, and ansible_connection.

become enables privilege escalation where tasks need root or administrator rights.

For cloud fleets, dynamic inventory can populate connection metadata automatically.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Create a minimal Ansible control workflow for: How does Ansible connect to managed hosts?

2. Inventory example:

[web]

web1 ansible_host=10.0.1.10 ansible_user=ec2-user

web2 ansible_host=10.0.1.11 ansible_user=ec2-user

[web:vars]

ansible_become=true

3. Playbook example:

---

- name: Configure web hosts

hosts: web

become: true

tasks:

- name: Ensure nginx is installed

ansible.builtin.package:

state: present

- name: Ensure nginx is running

ansible.builtin.service:

state: started

enabled: true

4. Run ansible -m ping web first, then ansible-playbook site.yml --check --diff, then the real run.

What is an Ansible inventory, and what is the difference between static and dynamic inventory?Intermediate

Answer

An Ansible inventory lists the hosts and groups Ansible can manage. Static inventory is a file in INI or YAML. Dynamic inventory is generated from an external source such as AWS, Azure, GCP, VMware, or CMDB, so host lists and metadata stay current automatically.

Technical explanation

Inventory groups model roles, environments, regions, and lifecycle stages.

Dynamic inventory prevents stale host lists when instances scale up or down.

Inventory variables should describe differences, not hide application logic.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Create a minimal Ansible control workflow for: What is an Ansible inventory, and what is the difference between static and dynamic inventory?

2. Inventory example:

[web]

web1 ansible_host=10.0.1.10 ansible_user=ec2-user

web2 ansible_host=10.0.1.11 ansible_user=ec2-user

[web:vars]

ansible_become=true

3. Playbook example:

---

- name: Configure web hosts

hosts: web

become: true

tasks:

- name: Ensure nginx is installed

ansible.builtin.package:

state: present

- name: Ensure nginx is running

ansible.builtin.service:

state: started

enabled: true

4. Run ansible -m ping web first, then ansible-playbook site.yml --check --diff, then the real run.

What is a playbook, a play, and a task?Intermediate

Answer

A playbook is a YAML file containing one or more plays. A play maps hosts to roles or tasks. A task calls a module with arguments to enforce one piece of desired state, such as installing a package, rendering a template, or restarting a service.

Technical explanation

A play targets hosts and defines how to execute against them.

Tasks are executed in order unless strategies or async behavior alter execution.

Roles are commonly included from plays to keep playbooks readable.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Create a minimal Ansible control workflow for: What is a playbook, a play, and a task?

2. Inventory example:

[web]

web1 ansible_host=10.0.1.10 ansible_user=ec2-user

web2 ansible_host=10.0.1.11 ansible_user=ec2-user

[web:vars]

ansible_become=true

3. Playbook example:

---

- name: Configure web hosts

hosts: web

become: true

tasks:

- name: Ensure nginx is installed

ansible.builtin.package:

state: present

- name: Ensure nginx is running

ansible.builtin.service:

state: started

enabled: true

4. Run ansible -m ping web first, then ansible-playbook site.yml --check --diff, then the real run.

What is idempotency in Ansible, and why does it matter?Intermediate

Answer

Idempotency means running the same automation multiple times should produce the same final state without unnecessary changes. In Ansible, idempotent modules report changed only when they actually modify the host, which makes repeated runs safe and enables reliable handlers and drift correction.

Technical explanation

Idempotency is what makes configuration management safe as a recurring operation.

Handlers rely on correct changed status; false changes can cause unnecessary restarts.

Idempotent playbooks are easier to run in CI, during incidents, and on schedules.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Make a task idempotent for: What is idempotency in Ansible, and why does it matter?

2. Replace an unsafe command with a module where possible:

- name: Install nginx idempotently

ansible.builtin.package:

state: present

3. If command is unavoidable, add guards:

- name: Initialize application database once

ansible.builtin.command: /opt/app/bin/init-db

args:

creates: /var/lib/app/.db_initialized

changed_when: init_result.rc == 0

4. Run the playbook twice; the second run should report ok rather than changed for already-converged tasks.

Why is a raw shell or command task not idempotent, and how do you make it safe?Intermediate

Answer

Raw shell or command tasks are not inherently idempotent because Ansible cannot know whether the command changed anything. You make them safer with creates, removes, changed_when, failed_when, check_mode guards, or by replacing them with a purpose-built module.

Technical explanation

A command might create a user, append a line, or restart a service every time unless guarded.

Use modules such as package, service, lineinfile, copy, template, user, and file when possible.

If command is unavoidable, explicitly define changed_when and failed_when.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Make a task idempotent for: Why is a raw shell or command task not idempotent, and how do you make it safe?

2. Replace an unsafe command with a module where possible:

- name: Install nginx idempotently

ansible.builtin.package:

state: present

3. If command is unavoidable, add guards:

- name: Initialize application database once

ansible.builtin.command: /opt/app/bin/init-db

args:

creates: /var/lib/app/.db_initialized

changed_when: init_result.rc == 0

4. Run the playbook twice; the second run should report ok rather than changed for already-converged tasks.

What is the creates argument on a command task, and how does it add idempotency?Intermediate

Answer

creates is an argument for command or shell-style tasks that tells Ansible to skip the command when a specific file already exists. It adds idempotency for one-time commands such as extracting an archive, initializing a database, or generating a marker after a migration.

Technical explanation

creates is checked before the command runs.

It is useful for marker-file workflows, but the marker must represent the real completed state.

For reversible tasks, removes can be the counterpart guard.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Make a task idempotent for: What is the creates argument on a command task, and how does it add idempotency?

2. Replace an unsafe command with a module where possible:

- name: Install nginx idempotently

ansible.builtin.package:

state: present

3. If command is unavoidable, add guards:

- name: Initialize application database once

ansible.builtin.command: /opt/app/bin/init-db

args:

creates: /var/lib/app/.db_initialized

changed_when: init_result.rc == 0

4. Run the playbook twice; the second run should report ok rather than changed for already-converged tasks.

What are Ansible modules, and why prefer them over shell commands?Intermediate

Answer

Ansible modules are reusable units that implement desired-state logic for a target system. I prefer modules over shell because modules understand idempotency, check mode, return values, errors, and platform differences better than hand-written commands.

Technical explanation

Modules return structured JSON results that can be registered and tested.

Modules support check mode and diff mode more reliably than shell.

Purpose-built modules reduce quoting, parsing, and platform portability problems.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Make a task idempotent for: What are Ansible modules, and why prefer them over shell commands?

2. Replace an unsafe command with a module where possible:

- name: Install nginx idempotently

ansible.builtin.package:

state: present

3. If command is unavoidable, add guards:

- name: Initialize application database once

ansible.builtin.command: /opt/app/bin/init-db

args:

creates: /var/lib/app/.db_initialized

changed_when: init_result.rc == 0

4. Run the playbook twice; the second run should report ok rather than changed for already-converged tasks.

What is a handler, and how is it triggered with notify?Intermediate

Answer

A handler is a special task that runs only when notified by a changed task. For example, template a config file and notify Restart nginx only if the template changed. This avoids unnecessary restarts and ties operational actions to real changes.

Technical explanation

Handlers are normal tasks listed under handlers but triggered by notify.

They run only when the notifying task reports changed.

Multiple tasks can notify the same handler and it will still run once by default.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Use handlers correctly for: What is a handler, and how is it triggered with notify?

2. Template a config and notify one restart:

tasks:

- name: Render nginx config

ansible.builtin.template:

src: nginx.conf.j2

dest: /etc/nginx/nginx.conf

validate: 'nginx -t -c %s'

notify: Restart nginx

handlers:

- name: Restart nginx

ansible.builtin.service:

state: restarted

3. Change two template-managed files that both notify Restart nginx and observe that the handler runs once at the end.

4. If later tasks require the restarted service immediately, insert meta: flush_handlers at that point and document why.

Why do handlers run once at the end rather than immediately?Intermediate

Answer

Handlers run once at the end of a play by default so multiple changes can trigger one restart instead of several restarts. They are deduplicated by handler name. If an immediate restart is required before later tasks, use meta: flush_handlers deliberately.

Technical explanation

End-of-play execution reduces service churn during a run.

meta: flush_handlers is available when later tasks depend on the handler's effect.

Handler names should be unique and stable because notify references names.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Use handlers correctly for: Why do handlers run once at the end rather than immediately?

2. Template a config and notify one restart:

tasks:

- name: Render nginx config

ansible.builtin.template:

src: nginx.conf.j2

dest: /etc/nginx/nginx.conf

validate: 'nginx -t -c %s'

notify: Restart nginx

handlers:

- name: Restart nginx

ansible.builtin.service:

state: restarted

3. Change two template-managed files that both notify Restart nginx and observe that the handler runs once at the end.

4. If later tasks require the restarted service immediately, insert meta: flush_handlers at that point and document why.

What are Ansible roles, and what is the standard directory structure?Intermediate

Answer

Ansible roles package tasks, handlers, templates, files, defaults, variables, and metadata into reusable units. The standard structure includes tasks/main.yml, handlers/main.yml, templates/, files/, defaults/main.yml, vars/main.yml, and meta/main.yml.

Technical explanation

defaults/main.yml is for low-precedence defaults callers can override.

vars/main.yml is higher precedence and should be used sparingly.

Roles become easier to test and version when each role has a clear responsibility.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Package reusable Ansible content for: What are Ansible roles, and what is the standard directory structure?

2. Create a role structure:

roles/web/

defaults/main.yml

tasks/main.yml

handlers/main.yml

templates/nginx.conf.j2

files/

meta/main.yml

3. Call the role from a playbook:

- name: Configure web tier

hosts: web

roles:

- role: web

vars:

web_listen_port: 8080

4. If distributing modules/plugins/roles together, package them as a collection and pin it in requirements.yml.

What is the order of variable precedence in Ansible at a high level?Intermediate

Answer

At a high level, Ansible variable precedence means more specific and later sources override broader defaults. Role defaults are low precedence, inventory and group variables are in the middle, play/task vars are higher, and extra vars are near the top. I avoid relying on obscure precedence and keep variable ownership clear.

Technical explanation

Extra vars are powerful and should be controlled in CI because they can override nearly everything.

Role defaults are the safest place for configurable defaults.

Good naming conventions reduce accidental variable collisions.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Model variables and facts for: What is the order of variable precedence in Ansible at a high level?

2. Create inventory variables:

group_vars/web.yml:

app_port: 8080

package_name_by_os:

RedHat: httpd

Debian: apache2

host_vars/web1.yml:

app_port: 9090

3. Use facts and variables in a task:

- name: Install OS-specific web package

ansible.builtin.package:

state: present

when: ansible_facts['os_family'] in package_name_by_os

4. Run ansible-playbook site.yml -e app_port=7070 in a lab to see extra vars override lower-precedence values.

What are group_vars and host_vars?Intermediate

Answer

group_vars define variables for a group of hosts, while host_vars define variables for a single host. They let the same playbook adapt to environments, regions, roles, or individual host differences without embedding conditionals everywhere.

Technical explanation

group_vars/all applies broadly; group-specific files apply to that group; host_vars applies to one host.

Variables can be organized as directories with multiple files for readability.

Use inventory hierarchy carefully when a host belongs to multiple groups.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Model variables and facts for: What are group_vars and host_vars?

2. Create inventory variables:

group_vars/web.yml:

app_port: 8080

package_name_by_os:

RedHat: httpd

Debian: apache2

host_vars/web1.yml:

app_port: 9090

3. Use facts and variables in a task:

- name: Install OS-specific web package

ansible.builtin.package:

state: present

when: ansible_facts['os_family'] in package_name_by_os

4. Run ansible-playbook site.yml -e app_port=7070 in a lab to see extra vars override lower-precedence values.

What is Ansible Vault, and how do you protect secrets with it?Intermediate

Answer

Ansible Vault encrypts sensitive YAML values or files so secrets can be stored with playbooks without being readable in plaintext. I use vault IDs, separate secrets by environment, avoid printing secret values, and integrate decryption with CI/CD using controlled credentials.

Technical explanation

Vault can encrypt whole files or individual values.

Use no_log for tasks that might print decrypted values.

Rotate vault passwords or vault identities according to your secrets policy.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Model variables and facts for: What is Ansible Vault, and how do you protect secrets with it?

2. Create inventory variables:

group_vars/web.yml:

app_port: 8080

package_name_by_os:

RedHat: httpd

Debian: apache2

host_vars/web1.yml:

app_port: 9090

3. Use facts and variables in a task:

- name: Install OS-specific web package

ansible.builtin.package:

state: present

when: ansible_facts['os_family'] in package_name_by_os

4. Run ansible-playbook site.yml -e app_port=7070 in a lab to see extra vars override lower-precedence values.

What is the difference between a variable and a fact?Intermediate

Answer

A variable is a value you define or pass into Ansible. A fact is information discovered from a managed host, such as OS family, IP addresses, CPU count, memory, and distribution version. Facts are gathered at runtime and can drive conditional logic.

Technical explanation

Facts are host-derived and can change over time.

Variables can come from inventory, roles, playbooks, command line, or registered results.

Caching facts can speed large runs but requires freshness awareness.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Model variables and facts for: What is the difference between a variable and a fact?

2. Create inventory variables:

group_vars/web.yml:

app_port: 8080

package_name_by_os:

RedHat: httpd

Debian: apache2

host_vars/web1.yml:

app_port: 9090

3. Use facts and variables in a task:

- name: Install OS-specific web package

ansible.builtin.package:

state: present

when: ansible_facts['os_family'] in package_name_by_os

4. Run ansible-playbook site.yml -e app_port=7070 in a lab to see extra vars override lower-precedence values.

How does Ansible gather facts, and how do you use them in conditionals?Intermediate

Answer

Ansible gathers facts with the setup module at the start of a play when gather_facts is true. Those facts are available under ansible_facts and commonly used in when clauses to branch by OS family, distribution version, network interface, or hardware capability.

Technical explanation

gather_facts can be disabled for speed when facts are not needed.

Fact names help write portable tasks such as ansible_facts['os_family'] == 'RedHat'.

For custom facts, use local facts or set_fact carefully.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Model variables and facts for: How does Ansible gather facts, and how do you use them in conditionals?

2. Create inventory variables:

group_vars/web.yml:

app_port: 8080

package_name_by_os:

RedHat: httpd

Debian: apache2

host_vars/web1.yml:

app_port: 9090

3. Use facts and variables in a task:

- name: Install OS-specific web package

ansible.builtin.package:

state: present

when: ansible_facts['os_family'] in package_name_by_os

4. Run ansible-playbook site.yml -e app_port=7070 in a lab to see extra vars override lower-precedence values.

What is a register, and how do you use the result of one task in another?Advanced

Answer

register stores the result of a task in a variable. The result can include stdout, stderr, return code, changed status, skipped status, and module-specific fields. I use it to make later tasks conditional on real command or module output.

Technical explanation

Registered variables are scoped to the host executing the task.

Check stdout_lines for line-oriented output and rc for command status.

Registered results are commonly combined with when, changed_when, and failed_when.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Use register, when, and loop for: What is a register, and how do you use the result of one task in another?

2. Example:

- name: Check app health

ansible.builtin.uri:

url: http://localhost:8080/health

status_code: 200

failed_when: false

- name: Restart app only when health check failed

ansible.builtin.service:

state: restarted

when: health.status | default(0) != 200

- name: Install required packages

ansible.builtin.package:

state: present

loop:

- nginx

- curl

- jq

3. Run once with the service healthy and once after stopping it; confirm the conditional task changes behavior based on the registered result.

4. Use loop_control.label when iterating over dictionaries to keep output readable.

What is the when clause, and how do you write conditional tasks?Advanced

Answer

The when clause makes a task conditional. It evaluates a Jinja2 expression without wrapping the entire expression in {{ }}. I use when for OS-specific tasks, feature flags, registered results, and rollout conditions.

Technical explanation

when expressions should be readable and based on clear variables or facts.

Use boolean variables instead of complex string comparisons where possible.

Combine conditions with and, or, in, is defined, and filters.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Use register, when, and loop for: What is the when clause, and how do you write conditional tasks?

2. Example:

- name: Check app health

ansible.builtin.uri:

url: http://localhost:8080/health

status_code: 200

failed_when: false

- name: Restart app only when health check failed

ansible.builtin.service:

state: restarted

when: health.status | default(0) != 200

- name: Install required packages

ansible.builtin.package:

state: present

loop:

- nginx

- curl

- jq

3. Run once with the service healthy and once after stopping it; confirm the conditional task changes behavior based on the registered result.

4. Use loop_control.label when iterating over dictionaries to keep output readable.

What is a loop in Ansible, and how do you iterate over a list?Advanced

Answer

A loop repeats a task for each item in a list or other iterable. It is used to install multiple packages, create users, render multiple files, or call APIs for a set of inputs. For complex data, each item can be a dictionary with named fields.

Technical explanation

loop replaces older with_items patterns in modern playbooks.

Use loop_control to customize the loop variable or label.

For nested or complex loops, consider restructuring data to keep tasks readable.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Use register, when, and loop for: What is a loop in Ansible, and how do you iterate over a list?

2. Example:

- name: Check app health

ansible.builtin.uri:

url: http://localhost:8080/health

status_code: 200

failed_when: false

- name: Restart app only when health check failed

ansible.builtin.service:

state: restarted

when: health.status | default(0) != 200

- name: Install required packages

ansible.builtin.package:

state: present

loop:

- nginx

- curl

- jq

3. Run once with the service healthy and once after stopping it; confirm the conditional task changes behavior based on the registered result.

4. Use loop_control.label when iterating over dictionaries to keep output readable.

What is the serial keyword, and how does it enable rolling updates?Advanced

Answer

serial limits how many hosts in a play are processed at one time. It enables rolling updates by applying changes to a small batch, validating health, and then moving to the next batch instead of changing the whole fleet at once.

Technical explanation

serial can be a number, percentage, or list of batch sizes.

Combine it with health checks and load balancer draining for safe deployments.

A rolling update is only safe if each batch is validated before the next batch.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Orchestrate a rolling update for: What is the serial keyword, and how does it enable rolling updates?

2. Playbook skeleton:

- name: Rolling app upgrade

hosts: app

serial: 2

max_fail_percentage: 20

tasks:

- name: Drain host from load balancer

ansible.builtin.command: /usr/local/bin/lbctl drain {{ inventory_hostname }}

delegate_to: localhost

- name: Upgrade app package

ansible.builtin.package:

state: present

notify: Restart app

- meta: flush_handlers

- name: Wait for health

ansible.builtin.uri:

url: http://{{ inventory_hostname }}:8080/health

status_code: 200

retries: 12

delay: 5

until: health.status == 200

- name: Add host back to load balancer

ansible.builtin.command: /usr/local/bin/lbctl enable {{ inventory_hostname }}

delegate_to: localhost

3. Test against a staging group with serial: 1, then increase batch size after measuring recovery time.

4. Confirm a failed health check stops the rollout before most hosts are touched.

What is max_fail_percentage, and how does it protect a rollout?Advanced

Answer

max_fail_percentage stops a play when failures exceed an allowed percentage within a batch. It protects rollouts by preventing a bad change from continuing across the fleet after too many hosts fail.

Technical explanation

The threshold applies to hosts in the batch, helping stop widespread damage.

Tune it based on fleet size and service redundancy.

Combine it with any_errors_fatal for stricter orchestration when one failure should stop all.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Orchestrate a rolling update for: What is max_fail_percentage, and how does it protect a rollout?

2. Playbook skeleton:

- name: Rolling app upgrade

hosts: app

serial: 2

max_fail_percentage: 20

tasks:

- name: Drain host from load balancer

ansible.builtin.command: /usr/local/bin/lbctl drain {{ inventory_hostname }}

delegate_to: localhost

- name: Upgrade app package

ansible.builtin.package:

state: present

notify: Restart app

- meta: flush_handlers

- name: Wait for health

ansible.builtin.uri:

url: http://{{ inventory_hostname }}:8080/health

status_code: 200

retries: 12

delay: 5

until: health.status == 200

- name: Add host back to load balancer

ansible.builtin.command: /usr/local/bin/lbctl enable {{ inventory_hostname }}

delegate_to: localhost

3. Test against a staging group with serial: 1, then increase batch size after measuring recovery time.

4. Confirm a failed health check stops the rollout before most hosts are touched.

What is delegate_to, and when would you use it?Advanced

Answer

delegate_to runs a task on a different host than the current inventory host. I use it for load balancer registration, API calls from localhost, database migration coordination, bastion-side checks, or centralized monitoring updates during a rollout.

Technical explanation

delegate_to: localhost is common for API calls from the control node.

delegate_facts controls whether gathered facts attach to the delegated host or original host.

Delegation is a clean way to coordinate external systems during per-host operations.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Orchestrate a rolling update for: What is delegate_to, and when would you use it?

2. Playbook skeleton:

- name: Rolling app upgrade

hosts: app

serial: 2

max_fail_percentage: 20

tasks:

- name: Drain host from load balancer

ansible.builtin.command: /usr/local/bin/lbctl drain {{ inventory_hostname }}

delegate_to: localhost

- name: Upgrade app package

ansible.builtin.package:

state: present

notify: Restart app

- meta: flush_handlers

- name: Wait for health

ansible.builtin.uri:

url: http://{{ inventory_hostname }}:8080/health

status_code: 200

retries: 12

delay: 5

until: health.status == 200

- name: Add host back to load balancer

ansible.builtin.command: /usr/local/bin/lbctl enable {{ inventory_hostname }}

delegate_to: localhost

3. Test against a staging group with serial: 1, then increase batch size after measuring recovery time.

4. Confirm a failed health check stops the rollout before most hosts are touched.

What is the difference between Ansible roles and collections?Advanced

Answer

A role is a reusable automation structure for tasks, handlers, defaults, and templates. A collection is a distribution package that can include roles, modules, plugins, playbooks, and documentation under a namespace. Collections are how Ansible content is shared and versioned at ecosystem scale.

Technical explanation

Collections use fully qualified collection names such as ansible.builtin.copy or community.mysql.mysql_db.

Roles can live inside collections, but roles can also be standalone.

Pin collection versions in requirements.yml for reproducible automation.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Package reusable Ansible content for: What is the difference between Ansible roles and collections?

2. Create a role structure:

roles/web/

defaults/main.yml

tasks/main.yml

handlers/main.yml

templates/nginx.conf.j2

files/

meta/main.yml

3. Call the role from a playbook:

- name: Configure web tier

hosts: web

roles:

- role: web

vars:

web_listen_port: 8080

4. If distributing modules/plugins/roles together, package them as a collection and pin it in requirements.yml.

What is a Jinja2 template, and how is it used in Ansible?Advanced

Answer

A Jinja2 template is a text file with variables, loops, and conditionals that Ansible renders for each host. It is commonly used for application config files, systemd units, Nginx configs, and Kubernetes manifests where values vary by host or environment.

Technical explanation

Templates use host variables and facts to produce host-specific files.

Validate rendered configs before replacing critical files when modules support validate.

A template change often notifies a handler to reload or restart a service.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Create templates/app.conf.j2:

port={{ app_port }}

environment={{ env }}

{% for upstream in upstreams %}

upstream={{ upstream }}

{% endfor %}

2. Render it safely:

- name: Render app config

ansible.builtin.template:

src: app.conf.j2

dest: /etc/app/app.conf

owner: root

group: root

mode: '0644'

validate: '/usr/local/bin/app --check-config %s'

notify: Restart app

3. Run with --diff to show exactly what changed before the handler restarts the service.

4. Add defaults for app_port and upstreams so the role works predictably.

How do you run an Ansible playbook in check (dry-run) mode?Advanced

Answer

Check mode is Ansible's dry-run mode, invoked with --check. It predicts changes without applying them for modules that support check mode. I combine it with --diff for file changes, but I still treat it as a signal rather than a perfect guarantee for every module.

Technical explanation

Not every module fully supports check mode, so inspect skipped or unsupported tasks.

--diff shows what file changes would be made, which is useful for review.

Use check mode in PR or pre-prod, not as the only production safety gate.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Add Ansible safety checks for: How do you run an Ansible playbook in check (dry-run) mode?

2. CI commands:

ansible-playbook --syntax-check site.yml

ansible-lint .

yamllint .

ansible-playbook -i inventory/stage site.yml --check --diff

3. Avoid state: latest in production unless the rollout is explicitly an upgrade window. Prefer pinned versions or approved repositories:

- name: Install approved app version

ansible.builtin.package:

state: present

4. For mixed fleets, drive differences through group_vars, host_vars, and facts rather than copied playbooks.

5. Gate production runs behind review and record playbook version, inventory, operator, and output artifact.

What is the difference between state: present and state: latest, and the risk of latest?Advanced

Answer

state: present ensures a package or object exists, usually without upgrading if it is already installed. state: latest upgrades to the newest available version. latest is risky in production because repository changes can create unplanned upgrades and inconsistent versions across batches.

Technical explanation

latest couples deployment to repository state at execution time.

present is more stable, especially when package versions are pinned.

Use explicit versions or staged repositories for controlled upgrades.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Add Ansible safety checks for: What is the difference between state: present and state: latest, and the risk of latest?

2. CI commands:

ansible-playbook --syntax-check site.yml

ansible-lint .

yamllint .

ansible-playbook -i inventory/stage site.yml --check --diff

3. Avoid state: latest in production unless the rollout is explicitly an upgrade window. Prefer pinned versions or approved repositories:

- name: Install approved app version

ansible.builtin.package:

state: present

4. For mixed fleets, drive differences through group_vars, host_vars, and facts rather than copied playbooks.

5. Gate production runs behind review and record playbook version, inventory, operator, and output artifact.

How do you handle host-specific differences across a mixed fleet?Advanced

Answer

For a mixed fleet, I separate common logic from host-specific differences using groups, group_vars, host_vars, facts, OS-family conditionals, role defaults, and clearly named variables. I avoid copying playbooks per host because that creates drift and unreviewed snowflakes.

Technical explanation

OS-specific package names and service names belong in variables or vars files.

Facts can select the correct branch without duplicating entire roles.

Document intentional host exceptions so they do not become unmanaged snowflakes.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Add Ansible safety checks for: How do you handle host-specific differences across a mixed fleet?

2. CI commands:

ansible-playbook --syntax-check site.yml

ansible-lint .

yamllint .

ansible-playbook -i inventory/stage site.yml --check --diff

3. Avoid state: latest in production unless the rollout is explicitly an upgrade window. Prefer pinned versions or approved repositories:

- name: Install approved app version

ansible.builtin.package:

state: present

4. For mixed fleets, drive differences through group_vars, host_vars, and facts rather than copied playbooks.

5. Gate production runs behind review and record playbook version, inventory, operator, and output artifact.

How would you orchestrate a rolling, health-checked upgrade across servers with Ansible?Advanced

Answer

For a rolling, health-checked upgrade, I use serial to batch hosts, drain each host from traffic, upgrade packages or deploy artifacts, restart services through handlers, run health checks, re-add the host, and fail fast if health does not recover.

Technical explanation

The workflow should remove a host from service before mutation and add it back only after health passes.

Handlers should restart services only when files or packages changed.

A failed batch should stop the rollout and leave enough capacity serving traffic.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Orchestrate a rolling update for: How would you orchestrate a rolling, health-checked upgrade across servers with Ansible?

2. Playbook skeleton:

- name: Rolling app upgrade

hosts: app

serial: 2

max_fail_percentage: 20

tasks:

- name: Drain host from load balancer

ansible.builtin.command: /usr/local/bin/lbctl drain {{ inventory_hostname }}

delegate_to: localhost

- name: Upgrade app package

ansible.builtin.package:

state: present

notify: Restart app

- meta: flush_handlers

- name: Wait for health

ansible.builtin.uri:

url: http://{{ inventory_hostname }}:8080/health

status_code: 200

retries: 12

delay: 5

until: health.status == 200

- name: Add host back to load balancer

ansible.builtin.command: /usr/local/bin/lbctl enable {{ inventory_hostname }}

delegate_to: localhost

3. Test against a staging group with serial: 1, then increase batch size after measuring recovery time.

4. Confirm a failed health check stops the rollout before most hosts are touched.

How do you integrate Ansible into CI/CD and keep playbooks tested?Advanced

Answer

I integrate Ansible into CI/CD by linting YAML and Ansible rules, testing roles with Molecule or ephemeral instances, running check mode where useful, scanning secrets, requiring code review, and using controlled credentials for production runs.

Technical explanation

Molecule can test roles with containers or VMs before merge.

CI should run syntax checks, linting, and targeted test scenarios.

Production execution should be auditable and use locked dependencies.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Add Ansible safety checks for: How do you integrate Ansible into CI/CD and keep playbooks tested?

2. CI commands:

ansible-playbook --syntax-check site.yml

ansible-lint .

yamllint .

ansible-playbook -i inventory/stage site.yml --check --diff

3. Avoid state: latest in production unless the rollout is explicitly an upgrade window. Prefer pinned versions or approved repositories:

- name: Install approved app version

ansible.builtin.package:

state: present

4. For mixed fleets, drive differences through group_vars, host_vars, and facts rather than copied playbooks.

5. Gate production runs behind review and record playbook version, inventory, operator, and output artifact.

What is ansible-lint, and what does it catch?Advanced

Answer

ansible-lint is a static analysis tool for Ansible content. It catches risky patterns, style issues, deprecated syntax, missing names, non-idempotent commands, YAML issues, and best-practice violations before playbooks reach production.

Technical explanation

ansible-lint codifies team standards and Ansible best practices.

Rule violations should either be fixed or explicitly justified with narrow skips.

Run it locally and in CI so feedback is early.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Add Ansible safety checks for: What is ansible-lint, and what does it catch?

2. CI commands:

ansible-playbook --syntax-check site.yml

ansible-lint .

yamllint .

ansible-playbook -i inventory/stage site.yml --check --diff

3. Avoid state: latest in production unless the rollout is explicitly an upgrade window. Prefer pinned versions or approved repositories:

- name: Install approved app version

ansible.builtin.package:

state: present

4. For mixed fleets, drive differences through group_vars, host_vars, and facts rather than copied playbooks.

5. Gate production runs behind review and record playbook version, inventory, operator, and output artifact.

When would you choose Ansible over Terraform and vice versa?Advanced

Answer

I choose Terraform when I need to provision and own infrastructure lifecycle through cloud APIs. I choose Ansible when I need to configure systems, orchestrate tasks, or perform procedural changes across hosts. Terraform is best for desired infrastructure graph; Ansible is best for operational automation and host configuration.

Technical explanation

The dividing line is lifecycle ownership: Terraform owns cloud objects; Ansible configures or orchestrates running systems.

Terraform should not be used as a general remote command runner.

Ansible should not replace Terraform for complex graph-based cloud dependencies.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Combine Terraform and Ansible for: When would you choose Ansible over Terraform and vice versa?

2. Terraform provisions instances and outputs inventory data:

output "web_private_ips" { value = aws_instance.web[*].private_ip }

3. CI writes a temporary inventory from Terraform output:

terraform output -json web_private_ips | jq -r '.[]' | awk '{print "web ansible_host="$1}' > inventory.ini

4. Then Ansible configures the hosts:

ansible-playbook -i inventory.ini site.yml --check --diff

ansible-playbook -i inventory.ini site.yml

5. Document that Terraform owns cloud objects and Ansible owns host configuration to prevent dual ownership.

Can Terraform and Ansible be used together, and how would you combine them?Advanced

Answer

Terraform and Ansible work well together. Terraform can provision infrastructure and output inventory or endpoints. Ansible can then configure hosts or deploy software. The key is to keep ownership boundaries clear so both tools do not fight over the same setting.

Technical explanation

Use Terraform outputs to produce dynamic inventory or publish endpoints.

Run Ansible after Terraform only when infrastructure is ready and reachable.

Avoid dual ownership of tags, security groups, config files, or Kubernetes fields.

Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.

Separate reusable role logic from inventory-specific variables so the same automation works across environments.

Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.

Hands-on example

1. Combine Terraform and Ansible for: Can Terraform and Ansible be used together, and how would you combine them?

2. Terraform provisions instances and outputs inventory data:

output "web_private_ips" { value = aws_instance.web[*].private_ip }

3. CI writes a temporary inventory from Terraform output:

terraform output -json web_private_ips | jq -r '.[]' | awk '{print "web ansible_host="$1}' > inventory.ini

4. Then Ansible configures the hosts:

ansible-playbook -i inventory.ini site.yml --check --diff

ansible-playbook -i inventory.ini site.yml

5. Document that Terraform owns cloud objects and Ansible owns host configuration to prevent dual ownership.

What is Kustomize, and how does it differ from Helm?Advanced

Answer

Kustomize customizes Kubernetes YAML without templates by composing bases and overlays with patches, generators, labels, and name transformations. Helm packages and templates charts using values. Kustomize is patch-oriented and YAML-native; Helm is package and template-oriented.

Technical explanation

Kustomize overlays are easier to review because the base Kubernetes YAML remains visible.

Helm is better for reusable application packaging and chart distribution.

Both can be part of GitOps, but uncontrolled template complexity hurts reviewability.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: What is Kustomize, and how does it differ from Helm?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

What is a kustomization.yaml, and what does it define?Advanced

Answer

kustomization.yaml is the control file for Kustomize. It declares resources, bases, patches, generators, images, namespaces, labels, annotations, prefixes, suffixes, and other transformations that produce the final Kubernetes manifests.

Technical explanation

The file is declarative and represents the build recipe for final manifests.

It can pull resources from local paths or remote bases depending on policy.

Keep kustomization.yaml small by placing common YAML in resources and overlays.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: What is a kustomization.yaml, and what does it define?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

What is the difference between a base and an overlay in Kustomize?Advanced

Answer

A base is a reusable set of Kubernetes manifests. An overlay references a base and applies environment-specific changes such as replicas, image tags, labels, patches, namespace, or generated config. Bases define common intent; overlays define differences.

Technical explanation

The base should be environment-neutral.

Overlays should contain only the deltas needed for a target environment.

This model reduces copy-paste and makes production differences reviewable.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: What is the difference between a base and an overlay in Kustomize?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

How do overlays customise a base without copying it?Advanced

Answer

Overlays customize a base by referencing it and applying patches or transformations rather than copying the YAML. This keeps common resources in one place and lets dev, staging, and prod differ only where they must.

Technical explanation

Patches are applied at build time, so source control contains both common YAML and environment deltas.

This prevents divergence where dev and prod copies silently drift apart.

Small overlays are easier to audit than full copied manifests.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: How do overlays customise a base without copying it?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

What are strategic merge patches versus JSON 6902 patches in Kustomize?Advanced

Answer

Strategic merge patches are Kubernetes-aware YAML patches that merge fields using Kubernetes schema behavior. JSON 6902 patches are explicit operation lists such as add, replace, and remove against JSON paths. Strategic merge is often simpler for Kubernetes objects; JSON 6902 is precise for targeted edits.

Technical explanation

Strategic merge depends on Kubernetes merge semantics and may not work for every custom resource the same way.

JSON 6902 is explicit and works well when you need precise path operations.

Choose the patch type that is easiest to review and least surprising.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: What are strategic merge patches versus JSON 6902 patches in Kustomize?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

How does Kustomize handle environment-specific configuration?Advanced

Answer

Kustomize handles environment-specific configuration through overlays, patches, name prefixes or suffixes, namespaces, image overrides, common labels, and generators. Each environment can have a small overlay that transforms a shared base.

Technical explanation

Common patterns include overlays/dev, overlays/stage, and overlays/prod.

Image tags and replicas are frequent environment-specific differences.

Secrets should still be handled with a secure process; generated Secret YAML alone is not secret management.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: How does Kustomize handle environment-specific configuration?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

What are name prefixes/suffixes and common labels in Kustomize?Advanced

Answer

Name prefixes and suffixes modify resource names, commonly to avoid collisions between environments or variants. Common labels apply labels across all resources for ownership, selection, observability, or cost allocation. They are cross-cutting transformations.

Technical explanation

Prefixes and suffixes can avoid name collision but may affect references and external integrations.

Labels support selectors, dashboards, policy, and ownership queries.

Use commonLabels carefully if it changes selectors on existing workloads.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: What are name prefixes/suffixes and common labels in Kustomize?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

What is a configMapGenerator and secretGenerator, and what is the hash suffix for?Advanced

Answer

configMapGenerator and secretGenerator create ConfigMaps and Secrets from literals, files, or env files. The generated resource name usually includes a content hash suffix so a change in config creates a new object name.

Technical explanation

Generators produce resources during build, so the generated output should be reviewed in CI.

The hash suffix is based on content, making config changes visible in resource names.

SecretGenerator helps construct Secret manifests but does not encrypt them by itself.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: What is a configMapGenerator and secretGenerator, and what is the hash suffix for?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

Why does the config hash suffix help trigger rolling updates?Advanced

Answer

The hash suffix helps trigger rolling updates because a Deployment that references the generated ConfigMap or Secret sees its Pod template reference change when the config content changes. That changes the ReplicaSet template hash and causes Kubernetes to roll new Pods.

Technical explanation

Kubernetes rolls a Deployment when spec.template changes.

A generated config name change updates the volume or envFrom reference in the Pod template.

This avoids stale Pods continuing to use old mounted config.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: Why does the config hash suffix help trigger rolling updates?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

How is Kustomize integrated natively into kubectl (kubectl apply -k)?Advanced

Answer

Kustomize is integrated into kubectl. kubectl kustomize builds the rendered manifests, and kubectl apply -k applies a kustomization directory directly. This makes Kustomize usable without a separate templating command in simple workflows.

Technical explanation

kubectl kustomize is for rendering; kubectl apply -k is for applying.

GitOps controllers also render Kustomize internally.

Rendering in CI helps catch patch errors before sync or apply.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: How is Kustomize integrated natively into kubectl (kubectl apply -k)?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

When would you choose Kustomize over Helm, and can you use them together?Advanced

Answer

I choose Kustomize when I already have Kubernetes YAML and need clean overlays without templating. I choose Helm when I need chart packaging, broad configurability, dependencies, or third-party application installs. They can be combined, but the pipeline should clearly separate rendering from patching.

Technical explanation

Helm plus Kustomize is common for third-party charts that need final organization-specific patches.

Do not stack too many rendering layers or debugging becomes hard.

Prefer one primary packaging model per service.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: When would you choose Kustomize over Helm, and can you use them together?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

How does ArgoCD work with Kustomize overlays for multiple environments?Advanced

Answer

Argo CD can point an Application at a Git path containing a Kustomize overlay. For multiple environments, each Argo CD Application references a different overlay path or branch, and Argo CD renders Kustomize, compares desired state with the cluster, and syncs changes.

Technical explanation

Each environment overlay can map to a separate Argo CD Application or ApplicationSet entry.

Argo CD compares rendered manifests with live cluster state and reports drift.

Promotion can be implemented by changing Git paths, branches, or image tags.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: How does ArgoCD work with Kustomize overlays for multiple environments?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

What are the trade-offs of templating (Helm) versus patching (Kustomize)?Advanced

Answer

Helm templating is flexible but can become complex because the final YAML is generated from templates and values. Kustomize patching keeps base YAML visible and modifies it through overlays, but it can become awkward for heavily parameterized applications. The trade-off is package flexibility versus manifest transparency.

Technical explanation

Templating can hide invalid YAML until render time.

Patching can be clearer but less ergonomic for large option matrices.

The best choice depends on whether you are packaging software or customizing known manifests.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Create a Kustomize base and overlay for: What are the trade-offs of templating (Helm) versus patching (Kustomize)?

2. Base files:

base/deployment.yaml

base/service.yaml

base/kustomization.yaml

base/kustomization.yaml:

resources:

- deployment.yaml

- service.yaml

commonLabels:

app.kubernetes.io/name: payments

3. prod overlay:

resources:

- ../../base

namePrefix: prod-

namespace: payments-prod

images:

- name: ghcr.io/company/payments

newTag: 1.8.4

configMapGenerator:

- name: app-config

literals:

- LOG_LEVEL=info

patches:

- path: replica-patch.yaml

4. Render and apply:

kubectl kustomize overlays/prod

kubectl diff -k overlays/prod

kubectl apply -k overlays/prod

5. In GitOps, point Argo CD at overlays/prod and let it render, compare, and sync the desired state.

How do you keep IaC DRY across many similar microservices?Advanced

Answer

To keep IaC DRY across many microservices, I use reusable Terraform modules, standard service blueprints, shared Kustomize bases, overlays for differences, Ansible roles, versioned templates, and CI checks. DRY should not mean hiding important differences; it should standardize the boring parts and expose safe inputs.

Technical explanation

DRY must be balanced with explicitness; over-abstraction makes reviews harder.

Use versioned modules/bases so services can upgrade intentionally.

Standard CI templates enforce consistency without forcing every service into identical infrastructure.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Implement a team workflow for: How do you keep IaC DRY across many similar microservices?

2. Use a repository layout that separates reusable building blocks from environment entrypoints:

iac/

terraform/modules/

terraform/envs/dev|stage|prod/

ansible/roles/

kubernetes/base/

kubernetes/overlays/dev|stage|prod/

3. For every pull request, generate Terraform plans, render Kustomize output, run ansible-lint, and attach summaries for review.

4. Require owners to approve changes touching IAM, networking, data stores, secrets, and production overlays.

5. After merge, apply through controlled pipelines with state locking, audit logs, and drift detection tickets for anything changed manually.

How do you review and approve infrastructure changes safely as a team?Advanced

Answer

I review infrastructure changes safely by requiring plans in pull requests, checking policy and security rules, using saved plans for apply, locking state, separating dev and prod permissions, requiring approvals for destructive changes, and keeping audit trails for who approved and applied each change.

Technical explanation

Plans and rendered manifests should be attached to pull requests.

Approvals should focus on blast radius, data loss, IAM, public exposure, and cost impact.

Emergency paths should exist but still leave an audit trail and follow-up review.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Implement a team workflow for: How do you review and approve infrastructure changes safely as a team?

2. Use a repository layout that separates reusable building blocks from environment entrypoints:

iac/

terraform/modules/

terraform/envs/dev|stage|prod/

ansible/roles/

kubernetes/base/

kubernetes/overlays/dev|stage|prod/

3. For every pull request, generate Terraform plans, render Kustomize output, run ansible-lint, and attach summaries for review.

4. Require owners to approve changes touching IAM, networking, data stores, secrets, and production overlays.

5. After merge, apply through controlled pipelines with state locking, audit logs, and drift detection tickets for anything changed manually.

What recent IaC practice or tool have you adopted, and what did it improve?Advanced

Answer

A recent IaC practice I would highlight is moving from ad-hoc imports and manual reviews to reviewable import blocks, drift detection, and policy-as-code in CI. It improves confidence because existing resources can be onboarded through code review and risky changes are caught before apply.

Technical explanation

A good answer should name a concrete practice and measurable outcome.

Examples include OpenTofu evaluation, import blocks, OIDC federation, drift checks, policy-as-code, or Kustomize build validation.

Tie the practice to reliability, security, speed, or reduced incidents.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Implement a team workflow for: What recent IaC practice or tool have you adopted, and what did it improve?

2. Use a repository layout that separates reusable building blocks from environment entrypoints:

iac/

terraform/modules/

terraform/envs/dev|stage|prod/

ansible/roles/

kubernetes/base/

kubernetes/overlays/dev|stage|prod/

3. For every pull request, generate Terraform plans, render Kustomize output, run ansible-lint, and attach summaries for review.

4. Require owners to approve changes touching IAM, networking, data stores, secrets, and production overlays.

5. After merge, apply through controlled pipelines with state locking, audit logs, and drift detection tickets for anything changed manually.

How do you detect and remediate IaC drift continuously rather than only at apply time?Advanced

Answer

Continuous drift detection means regularly comparing desired IaC state with live infrastructure outside normal apply windows. I use scheduled plans, Terraform Cloud/HCP drift checks or equivalent pipelines, cloud config tools, policy scanners, and alerts that create tickets or pull requests for remediation.

Technical explanation

Drift checks should be read-only by default and alert rather than auto-remediate risky changes.

Some drift is expected when another controller owns a field; define ownership before alerting.

Track drift MTTR so teams know whether detection actually improves operations.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Set up drift handling for: How do you detect and remediate IaC drift continuously rather than only at apply time?

2. Schedule a read-only plan job per workspace:

terraform init

terraform plan -detailed-exitcode -out=drift.tfplan || status=$?

terraform show -json drift.tfplan > drift.json

3. Interpret exit code 0 as no drift, 2 as changes present, and 1 as an error requiring investigation.

4. For state-only synchronization, use refresh-only review:

terraform plan -refresh-only -out=refresh.tfplan

terraform apply refresh.tfplan

5. Open a ticket that classifies drift as revert, codify, ignore because externally owned, or remove from Terraform ownership.

How would you onboard an existing manually-built environment into IaC with confidence?Advanced

Answer

To onboard a manually built environment, I inventory resources, identify ownership boundaries, create modules or resource blocks, import in small batches, compare generated configuration with standards, fix drift deliberately, add tests and policy, and only then enable automated applies for production.

Technical explanation

Do not import everything into a single state file; choose lifecycle boundaries first.

Run no-op plans before enabling automated apply to prove code matches reality.

Preserve rollback options and backups during the transition.

Keep source manifests or IaC definitions readable enough that reviewers can understand the final desired state.

Use overlays, modules, or roles for reuse, but keep environment-specific differences explicit and reviewable.

Validate generated output in CI before applying it through kubectl, Argo CD, Terraform, or Ansible.

Hands-on example

1. Onboard an existing resource for: How would you onboard an existing manually-built environment into IaC with confidence?

2. Inventory the existing object, then create a matching resource block and import block:

import {

to = aws_s3_bucket.logs

id = "existing-company-logs"

}

resource "aws_s3_bucket" "logs" {

bucket = "existing-company-logs"

}

3. Run terraform plan -generate-config-out=generated.tf in a scratch branch when supported for the resource, then clean up the generated configuration to match module standards.

4. Apply the import, run a normal plan, and reconcile every proposed change as intentional, accidental drift, or provider default noise.

5. Repeat in small batches and do not enable automated production apply until no-op plans are reliable.

← All interview topics