Interview › Infrastructure as Code (Terraform, Ansible)
How do you handle a Terraform apply that fails halfway through?
Infrastructure as Code (Terraform, Ansible) · Intermediate level
Answer
If apply fails halfway, I do not rerun blindly. I inspect the error, run terraform plan to see the actual remaining delta, check state for created resources, import or remove state only if needed, fix the root cause, and re-apply. Terraform state should reflect successful operations even if the overall apply failed.
Technical explanation
Terraform records successful resource operations as it goes, so partial success is normal after failures.
Manual cleanup may be needed if a provider created an object but failed before state was updated.
State surgery should be rare, backed up, and peer-reviewed.
Keep Terraform's ownership boundary clear: one state should own a resource or field, and other tools should consume published outputs instead of modifying it.
Use fmt, validate, linting, policy checks, plan review, and state locking before production applies.
Design for small blast radius by splitting state around lifecycle, permissions, and recovery boundaries.
Hands-on example
1. Build a safe IaC delivery workflow for: How do you handle a Terraform apply that fails halfway through?
2. Pull request job:
terraform fmt -check
terraform init -backend=false
terraform validate
tflint --recursive
checkov -d .
terraform init
terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
3. Policy job evaluates plan JSON for public exposure, missing encryption, IAM wildcards, and destructive changes.
4. Apply job runs only after approval, uses remote state locking, short-lived cloud credentials, and applies the saved plan artifact.
5. For failures, rerun plan, inspect state and cloud objects, and fix root cause before any state surgery.
Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.
More Infrastructure as Code (Terraform, Ansible) interview questions
- What is Infrastructure as Code, and what problems does it solve over click-ops?
- What is the difference between declarative and imperative IaC, and where do Terraform and Ansible fall?
- What is the difference between configuration management and provisioning?
- What is Terraform, and what is the core plan/apply workflow?
- What does terraform init do?
- What is the Terraform state file, and why is it critical?
- Why should state be stored remotely, and what backend would you use on AWS?
- What is state locking, and why does it matter for teams?