Interview › Infrastructure as Code (Terraform, Ansible)
What is max_fail_percentage, and how does it protect a rollout?
Infrastructure as Code (Terraform, Ansible) · Advanced level
Answer
max_fail_percentage stops a play when failures exceed an allowed percentage within a batch. It protects rollouts by preventing a bad change from continuing across the fleet after too many hosts fail.
Technical explanation
The threshold applies to hosts in the batch, helping stop widespread damage.
Tune it based on fleet size and service redundancy.
Combine it with any_errors_fatal for stricter orchestration when one failure should stop all.
Prefer idempotent modules over shell so repeated runs are safe and change reporting is meaningful.
Separate reusable role logic from inventory-specific variables so the same automation works across environments.
Run lint, syntax checks, check mode where useful, and staged rollouts before production-wide changes.
Hands-on example
1. Orchestrate a rolling update for: What is max_fail_percentage, and how does it protect a rollout?
2. Playbook skeleton:
- name: Rolling app upgrade
hosts: app
serial: 2
max_fail_percentage: 20
tasks:
- name: Drain host from load balancer
ansible.builtin.command: /usr/local/bin/lbctl drain {{ inventory_hostname }}
delegate_to: localhost
- name: Upgrade app package
ansible.builtin.package:
name: myapp
state: present
notify: Restart app
- meta: flush_handlers
- name: Wait for health
ansible.builtin.uri:
url: http://{{ inventory_hostname }}:8080/health
status_code: 200
retries: 12
delay: 5
register: health
until: health.status == 200
- name: Add host back to load balancer
ansible.builtin.command: /usr/local/bin/lbctl enable {{ inventory_hostname }}
delegate_to: localhost
3. Test against a staging group with serial: 1, then increase batch size after measuring recovery time.
4. Confirm a failed health check stops the rollout before most hosts are touched.
Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.
More Infrastructure as Code (Terraform, Ansible) interview questions
- What is Infrastructure as Code, and what problems does it solve over click-ops?
- What is the difference between declarative and imperative IaC, and where do Terraform and Ansible fall?
- What is the difference between configuration management and provisioning?
- What is Terraform, and what is the core plan/apply workflow?
- What does terraform init do?
- What is the Terraform state file, and why is it critical?
- Why should state be stored remotely, and what backend would you use on AWS?
- What is state locking, and why does it matter for teams?