Interview questions › AWS
AWS interview questions & answers
100 AWS interview questions, each answered three ways: a concise spoken answer, a technical explanation, and a hands-on example.
Tip: paste the job description + your resume into our free resume checker to see which of these skills the role actually requires.
All questions
- What is the AWS shared responsibility model, and where is the line between AWS and the customer?
- Explain the difference between a Region, an Availability Zone, and an Edge Location.
- What is a VPC, and what are its core components (subnets, route tables, IGW, NAT)?
- Difference between a public and a private subnet, and how does each reach the internet?
- What is the difference between a Security Group and a Network ACL?
- Are Security Groups stateful or stateless? What about NACLs?
- What is an Internet Gateway versus a NAT Gateway, and when do you need each?
- How does a NAT Gateway differ from a NAT instance?
- Explain VPC peering and its limitations (e.g., non-transitive routing).
- What is a Transit Gateway and when would you use it over peering?
- What are VPC endpoints, and what is the difference between Gateway and Interface endpoints?
- Explain the difference between IAM users, groups, roles, and policies.
- What is the difference between an IAM role and an IAM user, and when do you use a role?
- How does an EC2 instance assume a role, and why is that better than embedding keys?
- What is the difference between an identity-based and a resource-based policy?
- Explain how an IAM policy is evaluated when there is both an allow and an explicit deny.
- What is an IAM permissions boundary, and when would you use one?
- What are AWS Organizations and Service Control Policies (SCPs)?
- What is the principle of least privilege, and how do you enforce it on AWS?
- How would you give a Kubernetes pod scoped AWS permissions (IRSA / IAM Roles for Service Accounts)?
- What are the main EC2 instance families, and how do you choose one?
- Explain On-Demand, Reserved, Spot, and Savings Plans pricing and when to use each.
- What is an Auto Scaling Group, and what are the scaling policy types?
- Difference between a launch template and a launch configuration.
- How does an Elastic Load Balancer work, and what are ALB vs NLB vs CLB?
- When would you choose an ALB over an NLB?
- What is a target group, and how do health checks work on a load balancer?
- Explain the difference between vertical and horizontal scaling on AWS.
- What are the S3 storage classes, and how do you pick between them?
- How does S3 lifecycle management work?
- Explain S3 bucket policies versus IAM policies versus ACLs.
- How do you make an S3 bucket private and prevent public exposure?
- What is S3 versioning, and how does it interact with lifecycle rules?
- Explain S3 encryption options: SSE-S3, SSE-KMS, SSE-C, and client-side.
- What is the difference between EBS and instance store?
- What are the EBS volume types and their use cases?
- How do EBS snapshots work, and are they incremental?
- What is EFS, and when would you use it over EBS or S3?
- Explain Amazon RDS and the engines it supports.
- What is Multi-AZ in RDS, and how does failover work?
- Difference between RDS Multi-AZ and read replicas.
- What is Amazon Aurora, and how does it differ from standard RDS?
- What is DynamoDB, and when would you choose it over RDS?
- Explain DynamoDB partition keys, sort keys, and the importance of access patterns.
- What is ElastiCache, and what is the difference between Redis and Memcached on it?
- What is AWS Lambda, and what are its key limits (timeout, memory, package size)?
- Explain a Lambda cold start and how to reduce it.
- How do you trigger a Lambda function - name several event sources.
- What is API Gateway, and how does it integrate with Lambda?
- What is Amazon ECR, and how does it relate to EKS and Docker?
- What is Amazon EKS, and what does AWS manage versus what you manage?
- Explain the EKS control plane versus worker node responsibilities.
- What are managed node groups versus self-managed nodes versus Fargate on EKS?
- How does the AWS VPC CNI assign pod networking on EKS?
- What is the AWS Load Balancer Controller, and what does it provision?
- How do you authenticate kubectl to an EKS cluster (aws-auth / access entries)?
- What is CloudWatch, and what is the difference between metrics, logs, and alarms?
- What are CloudWatch custom metrics, and how do you publish them?
- What is the difference between CloudWatch Logs and CloudTrail?
- What does CloudTrail record, and why does it matter for security and audits?
- What is AWS Config, and how does it differ from CloudTrail?
- What is Route 53, and what routing policies does it support?
- Explain Route 53 health checks and failover routing.
- What is the difference between a CNAME and an Alias record in Route 53?
- What is CloudFront, and how does it improve performance and reduce cost?
- What is AWS KMS, and what is the difference between an AWS-managed and a customer-managed key?
- What is the difference between KMS and Secrets Manager?
- What is the difference between Secrets Manager and SSM Parameter Store?
- How does Secrets Manager rotation work?
- What is AWS CloudFormation, and how does it compare to Terraform?
- What is a CloudFormation change set and a drift detection?
- What is the AWS Well-Architected Framework, and what are its pillars?
- How would you design a highly available web application across multiple AZs?
- How do you design for disaster recovery - explain RTO and RPO and the DR strategies.
- Compare backup-and-restore, pilot light, warm standby, and multi-site DR strategies.
- How would you secure data in transit and at rest across an AWS workload?
- How do you detect and reduce unexpected AWS cost increases?
- What AWS tools help with cost visibility (Cost Explorer, Budgets, CUR)?
- How would you right-size EC2 and RDS instances to cut spend without hurting reliability?
- Explain how you would migrate a workload from on-prem to AWS.
- What is the difference between SQS and SNS, and when do you use each?
- What is the difference between an SQS standard and FIFO queue?
- What is EventBridge, and how does it differ from SNS?
- What is AWS Systems Manager, and how is Session Manager safer than SSH bastions?
- How would you patch a fleet of EC2 instances at scale?
- What is an AMI, and how do you build standardised, hardened images?
- How do you troubleshoot an EC2 instance that is unreachable over SSH?
- How do you debug intermittent 5xx errors behind an ALB?
- How would you architect a multi-account AWS strategy with Organizations and landing zones?
- What is AWS WAF, and what kinds of attacks does it mitigate?
- What is GuardDuty, and what does it detect?
- What is the difference between GuardDuty, Inspector, and Security Hub?
- How do you enforce tagging and governance across many AWS accounts?
- What is the difference between a service quota and a rate limit, and how do you handle throttling?
- How would you set up centralised logging across multiple AWS accounts?
- What is cross-account access, and how do you implement it securely with roles?
- How do you rotate and manage access keys, and why prefer roles over long-lived keys?
- Explain how you would design a secure, private EKS cluster with no public API endpoint.
- What recent AWS service or feature have you adopted, and what problem did it solve for you?
- How would you design a landing zone for a new organisation adopting AWS at scale?
Explain the difference between a Region, an Availability Zone, and an Edge Location.Basic
Answer
A Region is a separate geographic AWS area, an Availability Zone is an isolated failure domain inside a Region, and an Edge Location is part of AWS's global edge network for services like CloudFront and Route 53. I use Regions for geography and compliance, AZs for high availability, and Edge Locations for user-facing performance.
Technical explanation
Regions are isolated; AZs are regional failure domains; Edge Locations are for edge services rather than normal compute placement.
AWS foundation answers should clarify ownership boundaries, global infrastructure concepts, failure domains, and the service-specific split between AWS-managed and customer-managed responsibilities.
A strong interview answer connects definitions to architecture decisions: compliance, latency, blast radius, operational ownership, and high availability.
Always state that the exact responsibility or placement decision depends on the specific AWS service and workload requirements.
Hands-on example
1. Choose a simple workload such as a web API with S3 and RDS, then map each component to AWS-owned and customer-owned responsibilities.
2. Place the workload in one Region, spread compute across at least two AZs, and put static assets behind CloudFront to show the Region/AZ/Edge distinction.
3. Create a responsibility matrix covering IAM, encryption, patching, networking, data, backups, monitoring, and incident response.
4. Use that matrix as the interview-ready explanation of how AWS concepts become production operating controls.
What is a VPC, and what are its core components (subnets, route tables, IGW, NAT)?Basic
Answer
A VPC is a logically isolated network in AWS. Its core pieces are CIDR ranges, subnets, route tables, internet gateways, NAT gateways, security groups, NACLs, and endpoints; together they define placement, routing, ingress, egress, and segmentation.
Technical explanation
The VPC CIDR must be planned up front to avoid overlap with peered VPCs, Transit Gateway attachments, and on-prem networks.
In AWS networking, always separate placement, routing, and filtering: subnets place resources, route tables decide next hops, and SG/NACL rules filter traffic.
Design for failure domains by spreading public, private, and data subnets across multiple AZs and avoiding single-AZ dependencies where production availability matters.
Troubleshooting should follow packet flow: source, SG, NACL, route table, endpoint/NAT/IGW/TGW, destination SG, and service listener.
Hands-on example
1. Create a sandbox VPC with two AZs, public subnets, private subnets, route tables, IGW, NAT Gateway, security groups, and one VPC endpoint relevant to the topic.
2. Deploy a small test instance or pod in the correct subnet and validate routing with curl, traceroute where allowed, and VPC Flow Logs.
3. Change one control at a time - route, SG, NACL, endpoint policy, NAT, or TGW route - and observe exactly how connectivity changes.
4. Document the final production pattern as an architecture diagram plus a troubleshooting checklist.
Difference between a public and a private subnet, and how does each reach the internet?Basic
Answer
A public subnet has a route to an internet gateway and can host internet-facing resources if public addressing and security rules allow it. A private subnet has no direct inbound internet path; it usually reaches outbound internet through NAT or uses private VPC endpoints.
Technical explanation
A subnet is public because of routing to an IGW, not because of its name; public IP assignment and security rules also matter.
In AWS networking, always separate placement, routing, and filtering: subnets place resources, route tables decide next hops, and SG/NACL rules filter traffic.
Design for failure domains by spreading public, private, and data subnets across multiple AZs and avoiding single-AZ dependencies where production availability matters.
Troubleshooting should follow packet flow: source, SG, NACL, route table, endpoint/NAT/IGW/TGW, destination SG, and service listener.
Hands-on example
1. Create a sandbox VPC with two AZs, public subnets, private subnets, route tables, IGW, NAT Gateway, security groups, and one VPC endpoint relevant to the topic.
2. Deploy a small test instance or pod in the correct subnet and validate routing with curl, traceroute where allowed, and VPC Flow Logs.
3. Change one control at a time - route, SG, NACL, endpoint policy, NAT, or TGW route - and observe exactly how connectivity changes.
4. Document the final production pattern as an architecture diagram plus a troubleshooting checklist.
What is the difference between a Security Group and a Network ACL?Basic
Answer
Security Groups are stateful resource-level firewalls, while Network ACLs are stateless subnet-level firewalls. I use Security Groups for normal workload access control and NACLs for coarse subnet guardrails or explicit deny use cases.
Technical explanation
Security Groups support references to other groups, which is cleaner than IP-based rules for dynamic compute fleets.
In AWS networking, always separate placement, routing, and filtering: subnets place resources, route tables decide next hops, and SG/NACL rules filter traffic.
Design for failure domains by spreading public, private, and data subnets across multiple AZs and avoiding single-AZ dependencies where production availability matters.
Troubleshooting should follow packet flow: source, SG, NACL, route table, endpoint/NAT/IGW/TGW, destination SG, and service listener.
Hands-on example
1. Create a sandbox VPC with two AZs, public subnets, private subnets, route tables, IGW, NAT Gateway, security groups, and one VPC endpoint relevant to the topic.
2. Deploy a small test instance or pod in the correct subnet and validate routing with curl, traceroute where allowed, and VPC Flow Logs.
3. Change one control at a time - route, SG, NACL, endpoint policy, NAT, or TGW route - and observe exactly how connectivity changes.
4. Document the final production pattern as an architecture diagram plus a troubleshooting checklist.
Are Security Groups stateful or stateless? What about NACLs?Basic
Answer
Security Groups are stateful, so allowed request traffic automatically permits the response path. NACLs are stateless, so both inbound and outbound rules must allow the flow, including return ports.
Technical explanation
NACL troubleshooting often requires checking ephemeral return ports because the response path is not automatically allowed.
In AWS networking, always separate placement, routing, and filtering: subnets place resources, route tables decide next hops, and SG/NACL rules filter traffic.
Design for failure domains by spreading public, private, and data subnets across multiple AZs and avoiding single-AZ dependencies where production availability matters.
Troubleshooting should follow packet flow: source, SG, NACL, route table, endpoint/NAT/IGW/TGW, destination SG, and service listener.
Hands-on example
1. Create a sandbox VPC with two AZs, public subnets, private subnets, route tables, IGW, NAT Gateway, security groups, and one VPC endpoint relevant to the topic.
2. Deploy a small test instance or pod in the correct subnet and validate routing with curl, traceroute where allowed, and VPC Flow Logs.
3. Change one control at a time - route, SG, NACL, endpoint policy, NAT, or TGW route - and observe exactly how connectivity changes.
4. Document the final production pattern as an architecture diagram plus a troubleshooting checklist.
What is an Internet Gateway versus a NAT Gateway, and when do you need each?Basic
Answer
An Internet Gateway gives public subnets a route to the internet. A NAT Gateway lets private resources initiate outbound internet connections without being directly reachable from the internet.
Technical explanation
For production private subnet egress, use one NAT Gateway per AZ to avoid cross-AZ dependency and unnecessary data charges.
In AWS networking, always separate placement, routing, and filtering: subnets place resources, route tables decide next hops, and SG/NACL rules filter traffic.
Design for failure domains by spreading public, private, and data subnets across multiple AZs and avoiding single-AZ dependencies where production availability matters.
Troubleshooting should follow packet flow: source, SG, NACL, route table, endpoint/NAT/IGW/TGW, destination SG, and service listener.
Hands-on example
1. Create a sandbox VPC with two AZs, public subnets, private subnets, route tables, IGW, NAT Gateway, security groups, and one VPC endpoint relevant to the topic.
2. Deploy a small test instance or pod in the correct subnet and validate routing with curl, traceroute where allowed, and VPC Flow Logs.
3. Change one control at a time - route, SG, NACL, endpoint policy, NAT, or TGW route - and observe exactly how connectivity changes.
4. Document the final production pattern as an architecture diagram plus a troubleshooting checklist.
How does a NAT Gateway differ from a NAT instance?Basic
Answer
A NAT Gateway is AWS-managed, highly available within an AZ, and simpler to operate. A NAT instance is self-managed EC2-based NAT; it can be customized, but I own patching, scaling, throughput, and failover.
Technical explanation
NAT instances can support custom inspection, but that flexibility comes with operational ownership and scaling risk.
In AWS networking, always separate placement, routing, and filtering: subnets place resources, route tables decide next hops, and SG/NACL rules filter traffic.
Design for failure domains by spreading public, private, and data subnets across multiple AZs and avoiding single-AZ dependencies where production availability matters.
Troubleshooting should follow packet flow: source, SG, NACL, route table, endpoint/NAT/IGW/TGW, destination SG, and service listener.
Hands-on example
1. Create a sandbox VPC with two AZs, public subnets, private subnets, route tables, IGW, NAT Gateway, security groups, and one VPC endpoint relevant to the topic.
2. Deploy a small test instance or pod in the correct subnet and validate routing with curl, traceroute where allowed, and VPC Flow Logs.
3. Change one control at a time - route, SG, NACL, endpoint policy, NAT, or TGW route - and observe exactly how connectivity changes.
4. Document the final production pattern as an architecture diagram plus a troubleshooting checklist.
Explain VPC peering and its limitations (e.g., non-transitive routing).Basic
Answer
VPC peering privately connects two non-overlapping VPCs, but routing is non-transitive. If A peers with B and B peers with C, A does not reach C through B; at scale, that is why Transit Gateway often becomes cleaner.
Technical explanation
VPC peering route tables must be updated on both sides and overlapping CIDRs are not supported.
In AWS networking, always separate placement, routing, and filtering: subnets place resources, route tables decide next hops, and SG/NACL rules filter traffic.
Design for failure domains by spreading public, private, and data subnets across multiple AZs and avoiding single-AZ dependencies where production availability matters.
Troubleshooting should follow packet flow: source, SG, NACL, route table, endpoint/NAT/IGW/TGW, destination SG, and service listener.
Hands-on example
1. Create a sandbox VPC with two AZs, public subnets, private subnets, route tables, IGW, NAT Gateway, security groups, and one VPC endpoint relevant to the topic.
2. Deploy a small test instance or pod in the correct subnet and validate routing with curl, traceroute where allowed, and VPC Flow Logs.
3. Change one control at a time - route, SG, NACL, endpoint policy, NAT, or TGW route - and observe exactly how connectivity changes.
4. Document the final production pattern as an architecture diagram plus a troubleshooting checklist.
What is a Transit Gateway and when would you use it over peering?Basic
Answer
Transit Gateway is a managed regional network hub for many VPCs, VPNs, and Direct Connect attachments. I choose it over peering when I need hub-and-spoke connectivity, routing domains, multi-account networking, or hybrid connectivity at scale.
Technical explanation
Transit Gateway route tables let you segment prod, non-prod, inspection, and shared-services routing domains.
In AWS networking, always separate placement, routing, and filtering: subnets place resources, route tables decide next hops, and SG/NACL rules filter traffic.
Design for failure domains by spreading public, private, and data subnets across multiple AZs and avoiding single-AZ dependencies where production availability matters.
Troubleshooting should follow packet flow: source, SG, NACL, route table, endpoint/NAT/IGW/TGW, destination SG, and service listener.
Hands-on example
1. Create a sandbox VPC with two AZs, public subnets, private subnets, route tables, IGW, NAT Gateway, security groups, and one VPC endpoint relevant to the topic.
2. Deploy a small test instance or pod in the correct subnet and validate routing with curl, traceroute where allowed, and VPC Flow Logs.
3. Change one control at a time - route, SG, NACL, endpoint policy, NAT, or TGW route - and observe exactly how connectivity changes.
4. Document the final production pattern as an architecture diagram plus a troubleshooting checklist.
What are VPC endpoints, and what is the difference between Gateway and Interface endpoints?Basic
Answer
VPC endpoints keep traffic to supported AWS services on private AWS networking instead of the public internet. Gateway endpoints are mainly for S3 and DynamoDB through route tables; Interface endpoints use PrivateLink ENIs for many AWS and partner services.
Technical explanation
Endpoint policies and private DNS are as important as endpoint creation because they control what resources can be reached privately.
In AWS networking, always separate placement, routing, and filtering: subnets place resources, route tables decide next hops, and SG/NACL rules filter traffic.
Design for failure domains by spreading public, private, and data subnets across multiple AZs and avoiding single-AZ dependencies where production availability matters.
Troubleshooting should follow packet flow: source, SG, NACL, route table, endpoint/NAT/IGW/TGW, destination SG, and service listener.
Hands-on example
1. Create a sandbox VPC with two AZs, public subnets, private subnets, route tables, IGW, NAT Gateway, security groups, and one VPC endpoint relevant to the topic.
2. Deploy a small test instance or pod in the correct subnet and validate routing with curl, traceroute where allowed, and VPC Flow Logs.
3. Change one control at a time - route, SG, NACL, endpoint policy, NAT, or TGW route - and observe exactly how connectivity changes.
4. Document the final production pattern as an architecture diagram plus a troubleshooting checklist.
Explain the difference between IAM users, groups, roles, and policies.Basic
Answer
IAM users are long-lived identities, groups organize users, roles are assumable identities with temporary credentials, and policies define permissions. In production, I prefer roles and federation over static IAM users and access keys.
Technical explanation
Roles with STS temporary credentials are preferred for humans through federation and workloads through instance profiles or service identities.
IAM evaluation is layered: identity policies, resource policies, trust policies, boundaries, SCPs, session policies, and explicit denies all contribute to the final decision.
Prefer temporary credentials through STS, roles, IAM Identity Center, instance profiles, IRSA, or OIDC federation instead of long-lived access keys.
Use conditions, resource ARNs, tags, MFA requirements, external IDs, source account/source ARN constraints, and Access Analyzer to reduce blast radius.
Hands-on example
1. Create a least-privilege IAM role for a small workload, including trust policy, permission policy, tags, and CloudTrail visibility.
2. Test the role with aws sts get-caller-identity and one allowed action, then deliberately test one denied action.
3. Run IAM Access Analyzer or policy simulation and refine broad actions/resources before production.
4. Record the access pattern in IaC and require review for future policy changes.
What is the difference between an IAM role and an IAM user, and when do you use a role?Basic
Answer
An IAM user is persistent and often has long-lived credentials; an IAM role is assumed temporarily by a trusted principal. I use roles for AWS services, workloads, federation, and cross-account access because temporary credentials reduce credential-leak risk.
Technical explanation
A role has a trust policy that controls who may assume it and permission policies that control what it may do afterward.
IAM evaluation is layered: identity policies, resource policies, trust policies, boundaries, SCPs, session policies, and explicit denies all contribute to the final decision.
Prefer temporary credentials through STS, roles, IAM Identity Center, instance profiles, IRSA, or OIDC federation instead of long-lived access keys.
Use conditions, resource ARNs, tags, MFA requirements, external IDs, source account/source ARN constraints, and Access Analyzer to reduce blast radius.
Hands-on example
1. Create a least-privilege IAM role for a small workload, including trust policy, permission policy, tags, and CloudTrail visibility.
2. Test the role with aws sts get-caller-identity and one allowed action, then deliberately test one denied action.
3. Run IAM Access Analyzer or policy simulation and refine broad actions/resources before production.
4. Record the access pattern in IaC and require review for future policy changes.
How does an EC2 instance assume a role, and why is that better than embedding keys?Basic
Answer
EC2 assumes a role through an instance profile and retrieves temporary credentials from the Instance Metadata Service. This is better than embedding keys because credentials rotate automatically, expire, are auditable, and can be tightly scoped.
Technical explanation
Enforce IMDSv2 on EC2 so metadata credentials are harder to steal through SSRF-style attacks.
IAM evaluation is layered: identity policies, resource policies, trust policies, boundaries, SCPs, session policies, and explicit denies all contribute to the final decision.
Prefer temporary credentials through STS, roles, IAM Identity Center, instance profiles, IRSA, or OIDC federation instead of long-lived access keys.
Use conditions, resource ARNs, tags, MFA requirements, external IDs, source account/source ARN constraints, and Access Analyzer to reduce blast radius.
Hands-on example
1. Create a least-privilege IAM role for a small workload, including trust policy, permission policy, tags, and CloudTrail visibility.
2. Test the role with aws sts get-caller-identity and one allowed action, then deliberately test one denied action.
3. Run IAM Access Analyzer or policy simulation and refine broad actions/resources before production.
4. Record the access pattern in IaC and require review for future policy changes.
What is the difference between an identity-based and a resource-based policy?Basic
Answer
Identity-based policies attach to users, groups, or roles and define what they can do. Resource-based policies attach to resources like S3 buckets, KMS keys, queues, and Lambda functions and define who can access that resource.
Technical explanation
Cross-account access often needs permission on both sides: identity policy in the caller account and resource policy or role trust in the target account.
IAM evaluation is layered: identity policies, resource policies, trust policies, boundaries, SCPs, session policies, and explicit denies all contribute to the final decision.
Prefer temporary credentials through STS, roles, IAM Identity Center, instance profiles, IRSA, or OIDC federation instead of long-lived access keys.
Use conditions, resource ARNs, tags, MFA requirements, external IDs, source account/source ARN constraints, and Access Analyzer to reduce blast radius.
Hands-on example
1. Create a least-privilege IAM role for a small workload, including trust policy, permission policy, tags, and CloudTrail visibility.
2. Test the role with aws sts get-caller-identity and one allowed action, then deliberately test one denied action.
3. Run IAM Access Analyzer or policy simulation and refine broad actions/resources before production.
4. Record the access pattern in IaC and require review for future policy changes.
Explain how an IAM policy is evaluated when there is both an allow and an explicit deny.Basic
Answer
IAM starts from implicit deny. An explicit deny always wins, then AWS looks for an applicable allow across the relevant identity, resource, boundary, session, and organization policy layers.
Technical explanation
Permissions boundaries and SCPs limit maximum permissions; they do not grant access by themselves.
IAM evaluation is layered: identity policies, resource policies, trust policies, boundaries, SCPs, session policies, and explicit denies all contribute to the final decision.
Prefer temporary credentials through STS, roles, IAM Identity Center, instance profiles, IRSA, or OIDC federation instead of long-lived access keys.
Use conditions, resource ARNs, tags, MFA requirements, external IDs, source account/source ARN constraints, and Access Analyzer to reduce blast radius.
Hands-on example
1. Create a least-privilege IAM role for a small workload, including trust policy, permission policy, tags, and CloudTrail visibility.
2. Test the role with aws sts get-caller-identity and one allowed action, then deliberately test one denied action.
3. Run IAM Access Analyzer or policy simulation and refine broad actions/resources before production.
4. Record the access pattern in IaC and require review for future policy changes.
What is an IAM permissions boundary, and when would you use one?Basic
Answer
A permissions boundary sets the maximum permissions a user or role can receive. It does not grant access by itself; it limits delegated administrators or automation from creating identities with broader permissions than allowed.
Technical explanation
Boundaries are useful when platform teams delegate role creation to application teams or CI/CD pipelines.
IAM evaluation is layered: identity policies, resource policies, trust policies, boundaries, SCPs, session policies, and explicit denies all contribute to the final decision.
Prefer temporary credentials through STS, roles, IAM Identity Center, instance profiles, IRSA, or OIDC federation instead of long-lived access keys.
Use conditions, resource ARNs, tags, MFA requirements, external IDs, source account/source ARN constraints, and Access Analyzer to reduce blast radius.
Hands-on example
1. Create a least-privilege IAM role for a small workload, including trust policy, permission policy, tags, and CloudTrail visibility.
2. Test the role with aws sts get-caller-identity and one allowed action, then deliberately test one denied action.
3. Run IAM Access Analyzer or policy simulation and refine broad actions/resources before production.
4. Record the access pattern in IaC and require review for future policy changes.
What are AWS Organizations and Service Control Policies (SCPs)?Basic
Answer
AWS Organizations manages multiple AWS accounts centrally, and SCPs define maximum permissions for accounts or OUs. SCPs are guardrails: they restrict what can be done, but they do not grant permissions by themselves.
Technical explanation
SCPs are strongest for denying dangerous actions like disabling CloudTrail or using unapproved Regions.
IAM evaluation is layered: identity policies, resource policies, trust policies, boundaries, SCPs, session policies, and explicit denies all contribute to the final decision.
Prefer temporary credentials through STS, roles, IAM Identity Center, instance profiles, IRSA, or OIDC federation instead of long-lived access keys.
Use conditions, resource ARNs, tags, MFA requirements, external IDs, source account/source ARN constraints, and Access Analyzer to reduce blast radius.
Hands-on example
1. Create a least-privilege IAM role for a small workload, including trust policy, permission policy, tags, and CloudTrail visibility.
2. Test the role with aws sts get-caller-identity and one allowed action, then deliberately test one denied action.
3. Run IAM Access Analyzer or policy simulation and refine broad actions/resources before production.
4. Record the access pattern in IaC and require review for future policy changes.
What is the principle of least privilege, and how do you enforce it on AWS?Basic
Answer
Least privilege means granting only the actions, resources, and duration required for a task. On AWS I enforce it with scoped IAM roles, conditions, boundaries, SCPs, access analysis, CloudTrail review, and policy-as-code checks.
Technical explanation
Least privilege is iterative: start from required actions, observe CloudTrail/access data, and tighten before production.
IAM evaluation is layered: identity policies, resource policies, trust policies, boundaries, SCPs, session policies, and explicit denies all contribute to the final decision.
Prefer temporary credentials through STS, roles, IAM Identity Center, instance profiles, IRSA, or OIDC federation instead of long-lived access keys.
Use conditions, resource ARNs, tags, MFA requirements, external IDs, source account/source ARN constraints, and Access Analyzer to reduce blast radius.
Hands-on example
1. Create a least-privilege IAM role for a small workload, including trust policy, permission policy, tags, and CloudTrail visibility.
2. Test the role with aws sts get-caller-identity and one allowed action, then deliberately test one denied action.
3. Run IAM Access Analyzer or policy simulation and refine broad actions/resources before production.
4. Record the access pattern in IaC and require review for future policy changes.
How would you give a Kubernetes pod scoped AWS permissions (IRSA / IAM Roles for Service Accounts)?Basic
Answer
For EKS pods, I use IRSA or EKS Pod Identity to map a Kubernetes service account to a scoped IAM role. That gives each workload temporary least-privilege AWS credentials instead of sharing node-role permissions or storing static keys.
Technical explanation
IRSA uses the cluster OIDC provider and STS AssumeRoleWithWebIdentity; EKS Pod Identity offers a newer managed pattern with similar least-privilege goals.
IAM evaluation is layered: identity policies, resource policies, trust policies, boundaries, SCPs, session policies, and explicit denies all contribute to the final decision.
Prefer temporary credentials through STS, roles, IAM Identity Center, instance profiles, IRSA, or OIDC federation instead of long-lived access keys.
Use conditions, resource ARNs, tags, MFA requirements, external IDs, source account/source ARN constraints, and Access Analyzer to reduce blast radius.
Hands-on example
1. Create a least-privilege IAM role for a small workload, including trust policy, permission policy, tags, and CloudTrail visibility.
2. Test the role with aws sts get-caller-identity and one allowed action, then deliberately test one denied action.
3. Run IAM Access Analyzer or policy simulation and refine broad actions/resources before production.
4. Record the access pattern in IaC and require review for future policy changes.
What are the main EC2 instance families, and how do you choose one?Basic
Answer
EC2 families include general purpose, compute optimized, memory optimized, storage optimized, accelerated computing, and burstable. I pick based on the bottleneck: CPU, memory, disk, network, GPU, latency, and cost per unit of work.
Technical explanation
Instance selection should consider Graviton compatibility, EBS bandwidth, ENA/network performance, and price-performance, not just vCPU and RAM.
Compute design should balance availability, scaling speed, startup time, instance limits, health checks, and deployment rollback, not just raw instance size.
Autoscaling and load balancing only work well when health checks reflect readiness and when applications externalize state.
Cost optimization should be tied to utilization data and workload tolerance for interruption, commitment, and architecture changes.
Hands-on example
1. Build a launch template or workload definition with IAM role, security groups, user data/bootstrap, health endpoint, and CloudWatch metrics.
2. Place compute behind an ALB/NLB or scaling group and run a controlled load test to observe scaling and health behavior.
3. Tune scaling policy, warmup/cooldown, target group health checks, and rollback procedure.
4. Compare cost and reliability after the test, then promote the configuration through IaC.
Explain On-Demand, Reserved, Spot, and Savings Plans pricing and when to use each.Basic
Answer
On-Demand is flexible pay-as-you-go, Reserved Instances and Savings Plans discount predictable usage, and Spot is cheap spare capacity that can be interrupted. I combine them based on workload predictability and interruption tolerance.
Technical explanation
Spot needs interruption handling and diversified instance types; commitments need coverage and utilization tracking.
Compute design should balance availability, scaling speed, startup time, instance limits, health checks, and deployment rollback, not just raw instance size.
Autoscaling and load balancing only work well when health checks reflect readiness and when applications externalize state.
Cost optimization should be tied to utilization data and workload tolerance for interruption, commitment, and architecture changes.
Hands-on example
1. Build a launch template or workload definition with IAM role, security groups, user data/bootstrap, health endpoint, and CloudWatch metrics.
2. Place compute behind an ALB/NLB or scaling group and run a controlled load test to observe scaling and health behavior.
3. Tune scaling policy, warmup/cooldown, target group health checks, and rollback procedure.
4. Compare cost and reliability after the test, then promote the configuration through IaC.
What is an Auto Scaling Group, and what are the scaling policy types?Basic
Answer
An Auto Scaling Group maintains desired EC2 capacity, replaces unhealthy instances, and scales using policies such as target tracking, step scaling, scheduled scaling, and predictive scaling. I prefer metrics that reflect real demand, not only CPU.
Technical explanation
Scaling on request count per target can be better than CPU for web services because it follows user demand more directly.
Compute design should balance availability, scaling speed, startup time, instance limits, health checks, and deployment rollback, not just raw instance size.
Autoscaling and load balancing only work well when health checks reflect readiness and when applications externalize state.
Cost optimization should be tied to utilization data and workload tolerance for interruption, commitment, and architecture changes.
Hands-on example
1. Build a launch template or workload definition with IAM role, security groups, user data/bootstrap, health endpoint, and CloudWatch metrics.
2. Place compute behind an ALB/NLB or scaling group and run a controlled load test to observe scaling and health behavior.
3. Tune scaling policy, warmup/cooldown, target group health checks, and rollback procedure.
4. Compare cost and reliability after the test, then promote the configuration through IaC.
Difference between a launch template and a launch configuration.Basic
Answer
Launch templates are the modern EC2 launch definition with versioning and support for newer features. Launch configurations are legacy and less capable, so I use launch templates for ASGs and controlled rollouts.
Technical explanation
Launch template versioning enables safe AMI and bootstrap rollouts with ASG instance refresh.
Compute design should balance availability, scaling speed, startup time, instance limits, health checks, and deployment rollback, not just raw instance size.
Autoscaling and load balancing only work well when health checks reflect readiness and when applications externalize state.
Cost optimization should be tied to utilization data and workload tolerance for interruption, commitment, and architecture changes.
Hands-on example
1. Build a launch template or workload definition with IAM role, security groups, user data/bootstrap, health endpoint, and CloudWatch metrics.
2. Place compute behind an ALB/NLB or scaling group and run a controlled load test to observe scaling and health behavior.
3. Tune scaling policy, warmup/cooldown, target group health checks, and rollback procedure.
4. Compare cost and reliability after the test, then promote the configuration through IaC.
How does an Elastic Load Balancer work, and what are ALB vs NLB vs CLB?Basic
Answer
Elastic Load Balancing distributes traffic to healthy targets. ALB is Layer 7 for HTTP/HTTPS routing, NLB is Layer 4 for TCP/UDP/TLS and high performance/static IP needs, and CLB is legacy for older patterns.
Technical explanation
ALB is application-aware; NLB is transport-focused; CLB should generally be treated as legacy.
Compute design should balance availability, scaling speed, startup time, instance limits, health checks, and deployment rollback, not just raw instance size.
Autoscaling and load balancing only work well when health checks reflect readiness and when applications externalize state.
Cost optimization should be tied to utilization data and workload tolerance for interruption, commitment, and architecture changes.
Hands-on example
1. Build a launch template or workload definition with IAM role, security groups, user data/bootstrap, health endpoint, and CloudWatch metrics.
2. Place compute behind an ALB/NLB or scaling group and run a controlled load test to observe scaling and health behavior.
3. Tune scaling policy, warmup/cooldown, target group health checks, and rollback procedure.
4. Compare cost and reliability after the test, then promote the configuration through IaC.
When would you choose an ALB over an NLB?Basic
Answer
I choose ALB when I need HTTP features such as host/path routing, redirects, headers, WAF, authentication, or Kubernetes ingress. I choose NLB for non-HTTP protocols, static IPs, low latency, or transport-level load balancing.
Technical explanation
ALB integrates well with WAF and HTTP routing rules, making it a strong default for web apps and EKS ingress.
Compute design should balance availability, scaling speed, startup time, instance limits, health checks, and deployment rollback, not just raw instance size.
Autoscaling and load balancing only work well when health checks reflect readiness and when applications externalize state.
Cost optimization should be tied to utilization data and workload tolerance for interruption, commitment, and architecture changes.
Hands-on example
1. Build a launch template or workload definition with IAM role, security groups, user data/bootstrap, health endpoint, and CloudWatch metrics.
2. Place compute behind an ALB/NLB or scaling group and run a controlled load test to observe scaling and health behavior.
3. Tune scaling policy, warmup/cooldown, target group health checks, and rollback procedure.
4. Compare cost and reliability after the test, then promote the configuration through IaC.
What is a target group, and how do health checks work on a load balancer?Basic
Answer
A target group is the backend pool for a load balancer. Health checks determine whether each target should receive traffic, so the health endpoint must reflect real readiness and dependencies, not just process liveness.
Technical explanation
A shallow health check can pass while real traffic fails, so readiness should reflect critical dependencies where appropriate.
Compute design should balance availability, scaling speed, startup time, instance limits, health checks, and deployment rollback, not just raw instance size.
Autoscaling and load balancing only work well when health checks reflect readiness and when applications externalize state.
Cost optimization should be tied to utilization data and workload tolerance for interruption, commitment, and architecture changes.
Hands-on example
1. Build a launch template or workload definition with IAM role, security groups, user data/bootstrap, health endpoint, and CloudWatch metrics.
2. Place compute behind an ALB/NLB or scaling group and run a controlled load test to observe scaling and health behavior.
3. Tune scaling policy, warmup/cooldown, target group health checks, and rollback procedure.
4. Compare cost and reliability after the test, then promote the configuration through IaC.
Explain the difference between vertical and horizontal scaling on AWS.Basic
Answer
Vertical scaling means making one resource bigger; horizontal scaling means adding more resources. Horizontal scaling gives better elasticity and resilience for stateless workloads, while vertical scaling is simpler but limited and may require downtime.
Technical explanation
Horizontal scaling requires externalized state; local sessions or local files break elasticity.
Compute design should balance availability, scaling speed, startup time, instance limits, health checks, and deployment rollback, not just raw instance size.
Autoscaling and load balancing only work well when health checks reflect readiness and when applications externalize state.
Cost optimization should be tied to utilization data and workload tolerance for interruption, commitment, and architecture changes.
Hands-on example
1. Build a launch template or workload definition with IAM role, security groups, user data/bootstrap, health endpoint, and CloudWatch metrics.
2. Place compute behind an ALB/NLB or scaling group and run a controlled load test to observe scaling and health behavior.
3. Tune scaling policy, warmup/cooldown, target group health checks, and rollback procedure.
4. Compare cost and reliability after the test, then promote the configuration through IaC.
What are the S3 storage classes, and how do you pick between them?Basic
Answer
S3 storage classes balance cost, durability, access frequency, and retrieval time. I use Standard for hot data, Intelligent-Tiering when access is uncertain, IA for infrequent access, and Glacier classes for archive and compliance retention.
Technical explanation
Retrieval fees and minimum storage duration can erase savings if objects are transitioned too aggressively.
S3 security should start with Block Public Access, least-privilege IAM/bucket policies, encryption, ownership controls, and CloudTrail or S3 data-event visibility for sensitive buckets.
Cost management depends on lifecycle policies, storage classes, version retention, object size, retrieval fees, and access patterns.
Operationally, validate bucket policies, KMS permissions, lifecycle effects, and restore behavior before applying broad production changes.
Hands-on example
1. Create a non-production bucket with Block Public Access, bucket owner enforced object ownership, default encryption, and scoped IAM access.
2. Add a policy control relevant to the question, such as deny non-TLS, require SSE-KMS, or restrict access to a VPC endpoint.
3. Enable versioning or lifecycle where relevant, upload test objects, and verify transitions, deletes, restores, and access-denied behavior.
4. Review Access Analyzer, Config, CloudTrail, and Storage Lens before applying the pattern to production.
How does S3 lifecycle management work?Basic
Answer
S3 lifecycle rules automatically transition or expire objects, noncurrent versions, and incomplete multipart uploads. They are used for cost optimization and retention, but must match access patterns and compliance requirements.
Technical explanation
Versioned buckets need lifecycle rules for current objects, noncurrent versions, and delete markers.
S3 security should start with Block Public Access, least-privilege IAM/bucket policies, encryption, ownership controls, and CloudTrail or S3 data-event visibility for sensitive buckets.
Cost management depends on lifecycle policies, storage classes, version retention, object size, retrieval fees, and access patterns.
Operationally, validate bucket policies, KMS permissions, lifecycle effects, and restore behavior before applying broad production changes.
Hands-on example
1. Create a non-production bucket with Block Public Access, bucket owner enforced object ownership, default encryption, and scoped IAM access.
2. Add a policy control relevant to the question, such as deny non-TLS, require SSE-KMS, or restrict access to a VPC endpoint.
3. Enable versioning or lifecycle where relevant, upload test objects, and verify transitions, deletes, restores, and access-denied behavior.
4. Review Access Analyzer, Config, CloudTrail, and Storage Lens before applying the pattern to production.
Explain S3 bucket policies versus IAM policies versus ACLs.Basic
Answer
S3 bucket policies are resource policies, IAM policies are identity policies, and ACLs are older object/bucket grants. I avoid ACLs in modern designs and use bucket owner enforced ownership, Block Public Access, IAM, and bucket policies.
Technical explanation
Disabling ACLs with bucket owner enforced ownership reduces confusing access paths.
S3 security should start with Block Public Access, least-privilege IAM/bucket policies, encryption, ownership controls, and CloudTrail or S3 data-event visibility for sensitive buckets.
Cost management depends on lifecycle policies, storage classes, version retention, object size, retrieval fees, and access patterns.
Operationally, validate bucket policies, KMS permissions, lifecycle effects, and restore behavior before applying broad production changes.
Hands-on example
1. Create a non-production bucket with Block Public Access, bucket owner enforced object ownership, default encryption, and scoped IAM access.
2. Add a policy control relevant to the question, such as deny non-TLS, require SSE-KMS, or restrict access to a VPC endpoint.
3. Enable versioning or lifecycle where relevant, upload test objects, and verify transitions, deletes, restores, and access-denied behavior.
4. Review Access Analyzer, Config, CloudTrail, and Storage Lens before applying the pattern to production.
How do you make an S3 bucket private and prevent public exposure?Basic
Answer
To keep an S3 bucket private, I enable Block Public Access, disable ACLs with bucket owner enforced mode, avoid public policies, grant only scoped roles, require TLS/encryption where needed, and monitor with Access Analyzer and Config.
Technical explanation
Public exposure can come from bucket policy, ACLs, presigned URLs, broad IAM roles, or CloudFront origin misconfiguration.
S3 security should start with Block Public Access, least-privilege IAM/bucket policies, encryption, ownership controls, and CloudTrail or S3 data-event visibility for sensitive buckets.
Cost management depends on lifecycle policies, storage classes, version retention, object size, retrieval fees, and access patterns.
Operationally, validate bucket policies, KMS permissions, lifecycle effects, and restore behavior before applying broad production changes.
Hands-on example
1. Create a non-production bucket with Block Public Access, bucket owner enforced object ownership, default encryption, and scoped IAM access.
2. Add a policy control relevant to the question, such as deny non-TLS, require SSE-KMS, or restrict access to a VPC endpoint.
3. Enable versioning or lifecycle where relevant, upload test objects, and verify transitions, deletes, restores, and access-denied behavior.
4. Review Access Analyzer, Config, CloudTrail, and Storage Lens before applying the pattern to production.
What is S3 versioning, and how does it interact with lifecycle rules?Basic
Answer
S3 versioning keeps previous object versions and uses delete markers instead of immediate permanent delete. Lifecycle rules should manage noncurrent versions, otherwise versioning can protect recovery but silently grow storage cost.
Technical explanation
Versioning is excellent for recovery but must be paired with lifecycle cost controls.
S3 security should start with Block Public Access, least-privilege IAM/bucket policies, encryption, ownership controls, and CloudTrail or S3 data-event visibility for sensitive buckets.
Cost management depends on lifecycle policies, storage classes, version retention, object size, retrieval fees, and access patterns.
Operationally, validate bucket policies, KMS permissions, lifecycle effects, and restore behavior before applying broad production changes.
Hands-on example
1. Create a non-production bucket with Block Public Access, bucket owner enforced object ownership, default encryption, and scoped IAM access.
2. Add a policy control relevant to the question, such as deny non-TLS, require SSE-KMS, or restrict access to a VPC endpoint.
3. Enable versioning or lifecycle where relevant, upload test objects, and verify transitions, deletes, restores, and access-denied behavior.
4. Review Access Analyzer, Config, CloudTrail, and Storage Lens before applying the pattern to production.
Explain S3 encryption options: SSE-S3, SSE-KMS, SSE-C, and client-side.Intermediate
Answer
SSE-S3 uses S3-managed encryption keys, SSE-KMS uses KMS keys with stronger control and audit, SSE-C uses customer-provided keys, and client-side encryption happens before upload. For sensitive enterprise data I usually prefer SSE-KMS.
Technical explanation
SSE-KMS adds key policy and KMS quota considerations but gives better audit and separation of duties.
S3 security should start with Block Public Access, least-privilege IAM/bucket policies, encryption, ownership controls, and CloudTrail or S3 data-event visibility for sensitive buckets.
Cost management depends on lifecycle policies, storage classes, version retention, object size, retrieval fees, and access patterns.
Operationally, validate bucket policies, KMS permissions, lifecycle effects, and restore behavior before applying broad production changes.
Hands-on example
1. Create a non-production bucket with Block Public Access, bucket owner enforced object ownership, default encryption, and scoped IAM access.
2. Add a policy control relevant to the question, such as deny non-TLS, require SSE-KMS, or restrict access to a VPC endpoint.
3. Enable versioning or lifecycle where relevant, upload test objects, and verify transitions, deletes, restores, and access-denied behavior.
4. Review Access Analyzer, Config, CloudTrail, and Storage Lens before applying the pattern to production.
What is the difference between EBS and instance store?Intermediate
Answer
EBS is persistent block storage attached to EC2 and can survive stop/start or termination if configured. Instance store is local ephemeral storage lost on stop, termination, or host failure, so it is only for rebuildable temporary data.
Technical explanation
EBS is AZ-scoped persistent block storage; instance store is host-local and ephemeral.
AWS storage choices are based on access model: block storage for disks, object storage for objects, shared file storage for POSIX file access, and ephemeral storage for rebuildable temporary data.
Performance must be evaluated at both the storage layer and instance/network layer; a high-performance volume cannot exceed instance bandwidth limits.
Backups are only useful when restore is tested, retention is aligned to policy, and encryption/cross-account/cross-Region protection is considered.
Hands-on example
1. Provision the storage option in a test environment with encryption, tags, backups, and monitoring enabled.
2. Run a workload-specific benchmark for IOPS, throughput, latency, concurrency, or shared-file behavior.
3. Create and restore a backup or snapshot to prove recovery rather than only creation.
4. Document the selected storage type, limits, cost assumptions, and restore runbook.
What are the EBS volume types and their use cases?Intermediate
Answer
gp3 is the common general-purpose EBS default, io2 is for high-IOPS low-latency critical workloads, st1 is throughput HDD, and sc1 is cold HDD. I choose by IOPS, throughput, latency, instance limits, durability, and cost.
Technical explanation
Volume performance must be compared with the EC2 instance's own EBS bandwidth limits.
AWS storage choices are based on access model: block storage for disks, object storage for objects, shared file storage for POSIX file access, and ephemeral storage for rebuildable temporary data.
Performance must be evaluated at both the storage layer and instance/network layer; a high-performance volume cannot exceed instance bandwidth limits.
Backups are only useful when restore is tested, retention is aligned to policy, and encryption/cross-account/cross-Region protection is considered.
Hands-on example
1. Provision the storage option in a test environment with encryption, tags, backups, and monitoring enabled.
2. Run a workload-specific benchmark for IOPS, throughput, latency, concurrency, or shared-file behavior.
3. Create and restore a backup or snapshot to prove recovery rather than only creation.
4. Document the selected storage type, limits, cost assumptions, and restore runbook.
How do EBS snapshots work, and are they incremental?Intermediate
Answer
EBS snapshots are point-in-time backups and are incremental after the first snapshot. Each snapshot can restore a full volume, but application consistency still requires database-aware backup or quiescing for stateful workloads.
Technical explanation
Snapshots are block-incremental internally, but every snapshot presents a complete restore point.
AWS storage choices are based on access model: block storage for disks, object storage for objects, shared file storage for POSIX file access, and ephemeral storage for rebuildable temporary data.
Performance must be evaluated at both the storage layer and instance/network layer; a high-performance volume cannot exceed instance bandwidth limits.
Backups are only useful when restore is tested, retention is aligned to policy, and encryption/cross-account/cross-Region protection is considered.
Hands-on example
1. Provision the storage option in a test environment with encryption, tags, backups, and monitoring enabled.
2. Run a workload-specific benchmark for IOPS, throughput, latency, concurrency, or shared-file behavior.
3. Create and restore a backup or snapshot to prove recovery rather than only creation.
4. Document the selected storage type, limits, cost assumptions, and restore runbook.
What is EFS, and when would you use it over EBS or S3?Intermediate
Answer
EFS is managed elastic NFS shared storage mountable by multiple compute nodes across AZs. I use it when applications need shared POSIX file access; EBS is block storage and S3 is object storage.
Technical explanation
EFS security combines mount targets, security groups, POSIX permissions, access points, IAM, and encryption.
AWS storage choices are based on access model: block storage for disks, object storage for objects, shared file storage for POSIX file access, and ephemeral storage for rebuildable temporary data.
Performance must be evaluated at both the storage layer and instance/network layer; a high-performance volume cannot exceed instance bandwidth limits.
Backups are only useful when restore is tested, retention is aligned to policy, and encryption/cross-account/cross-Region protection is considered.
Hands-on example
1. Provision the storage option in a test environment with encryption, tags, backups, and monitoring enabled.
2. Run a workload-specific benchmark for IOPS, throughput, latency, concurrency, or shared-file behavior.
3. Create and restore a backup or snapshot to prove recovery rather than only creation.
4. Document the selected storage type, limits, cost assumptions, and restore runbook.
Explain Amazon RDS and the engines it supports.Intermediate
Answer
RDS is AWS-managed relational database service for engines such as PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Aurora. AWS manages infrastructure operations, while I still own schema, queries, access, tuning, and data security.
Technical explanation
Managed RDS still requires customer ownership of schema design, indexes, users, parameter tuning, and upgrade testing.
Database service choice depends on data model, access patterns, consistency requirements, operational burden, scaling model, and failure tolerance.
Managed databases still require customer ownership of schema, indexes, queries, IAM/network security, backups, upgrades, and restore validation.
Production data design should include encryption, private networking, least-privilege access, monitoring, backups, failover testing, and cost controls.
Hands-on example
1. Create the database/cache in private subnets with encryption, least-privilege security groups, backup/retention settings, and monitoring enabled.
2. Run a representative workload test and capture latency, throughput, connection count, CPU, memory, I/O, and errors.
3. Test the failure path: failover, replica lag, cache loss, throttling, or restore depending on the service.
4. Write the operational runbook covering access, backup, scaling, alarms, and rollback.
What is Multi-AZ in RDS, and how does failover work?Intermediate
Answer
RDS Multi-AZ is primarily high availability. RDS maintains a standby in another AZ and fails over the endpoint during infrastructure failure or maintenance, so applications must handle brief reconnects and retries.
Technical explanation
Applications must reconnect through the RDS endpoint after failover and should avoid long DNS caching.
Database service choice depends on data model, access patterns, consistency requirements, operational burden, scaling model, and failure tolerance.
Managed databases still require customer ownership of schema, indexes, queries, IAM/network security, backups, upgrades, and restore validation.
Production data design should include encryption, private networking, least-privilege access, monitoring, backups, failover testing, and cost controls.
Hands-on example
1. Create the database/cache in private subnets with encryption, least-privilege security groups, backup/retention settings, and monitoring enabled.
2. Run a representative workload test and capture latency, throughput, connection count, CPU, memory, I/O, and errors.
3. Test the failure path: failover, replica lag, cache loss, throttling, or restore depending on the service.
4. Write the operational runbook covering access, backup, scaling, alarms, and rollback.
Difference between RDS Multi-AZ and read replicas.Intermediate
Answer
Multi-AZ is for availability and failover; read replicas are for read scaling and sometimes DR. Classic Multi-AZ standby is not for reads, while replicas are usually asynchronous and can lag.
Technical explanation
Read replica lag must be monitored before sending user-facing reads to replicas.
Database service choice depends on data model, access patterns, consistency requirements, operational burden, scaling model, and failure tolerance.
Managed databases still require customer ownership of schema, indexes, queries, IAM/network security, backups, upgrades, and restore validation.
Production data design should include encryption, private networking, least-privilege access, monitoring, backups, failover testing, and cost controls.
Hands-on example
1. Create the database/cache in private subnets with encryption, least-privilege security groups, backup/retention settings, and monitoring enabled.
2. Run a representative workload test and capture latency, throughput, connection count, CPU, memory, I/O, and errors.
3. Test the failure path: failover, replica lag, cache loss, throttling, or restore depending on the service.
4. Write the operational runbook covering access, backup, scaling, alarms, and rollback.
What is Amazon Aurora, and how does it differ from standard RDS?Intermediate
Answer
Aurora is AWS's cloud-native MySQL/PostgreSQL-compatible relational engine with distributed storage, fast replication, reader endpoints, and strong HA features. I choose it for scale and resilience after validating compatibility and cost.
Technical explanation
Aurora compatibility should be tested, especially extensions, engine versions, query plans, and operational cost.
Database service choice depends on data model, access patterns, consistency requirements, operational burden, scaling model, and failure tolerance.
Managed databases still require customer ownership of schema, indexes, queries, IAM/network security, backups, upgrades, and restore validation.
Production data design should include encryption, private networking, least-privilege access, monitoring, backups, failover testing, and cost controls.
Hands-on example
1. Create the database/cache in private subnets with encryption, least-privilege security groups, backup/retention settings, and monitoring enabled.
2. Run a representative workload test and capture latency, throughput, connection count, CPU, memory, I/O, and errors.
3. Test the failure path: failover, replica lag, cache loss, throttling, or restore depending on the service.
4. Write the operational runbook covering access, backup, scaling, alarms, and rollback.
What is DynamoDB, and when would you choose it over RDS?Intermediate
Answer
DynamoDB is managed NoSQL for high-scale, low-latency key-value/document workloads. I choose it over RDS when access patterns are known and denormalized; I choose RDS for relational joins, constraints, and SQL flexibility.
Technical explanation
DynamoDB table design starts from exact queries, not normalized relational modeling.
Database service choice depends on data model, access patterns, consistency requirements, operational burden, scaling model, and failure tolerance.
Managed databases still require customer ownership of schema, indexes, queries, IAM/network security, backups, upgrades, and restore validation.
Production data design should include encryption, private networking, least-privilege access, monitoring, backups, failover testing, and cost controls.
Hands-on example
1. Create the database/cache in private subnets with encryption, least-privilege security groups, backup/retention settings, and monitoring enabled.
2. Run a representative workload test and capture latency, throughput, connection count, CPU, memory, I/O, and errors.
3. Test the failure path: failover, replica lag, cache loss, throttling, or restore depending on the service.
4. Write the operational runbook covering access, backup, scaling, alarms, and rollback.
Explain DynamoDB partition keys, sort keys, and the importance of access patterns.Intermediate
Answer
DynamoDB partition keys distribute data and traffic; sort keys order related items and support range queries. Good table design starts with access patterns because poor keys cause hot partitions, expensive indexes, and throttling.
Technical explanation
High-cardinality partition keys help avoid hot partitions and uneven throughput.
Database service choice depends on data model, access patterns, consistency requirements, operational burden, scaling model, and failure tolerance.
Managed databases still require customer ownership of schema, indexes, queries, IAM/network security, backups, upgrades, and restore validation.
Production data design should include encryption, private networking, least-privilege access, monitoring, backups, failover testing, and cost controls.
Hands-on example
1. Create the database/cache in private subnets with encryption, least-privilege security groups, backup/retention settings, and monitoring enabled.
2. Run a representative workload test and capture latency, throughput, connection count, CPU, memory, I/O, and errors.
3. Test the failure path: failover, replica lag, cache loss, throttling, or restore depending on the service.
4. Write the operational runbook covering access, backup, scaling, alarms, and rollback.
What is ElastiCache, and what is the difference between Redis and Memcached on it?Intermediate
Answer
ElastiCache provides managed in-memory caching. Redis-compatible engines are better for rich data structures, replication, persistence options, pub/sub, and counters; Memcached is simpler for ephemeral distributed object caching.
Technical explanation
A cache must tolerate misses, eviction, stampedes, and full cache loss without corrupting source-of-truth data.
Database service choice depends on data model, access patterns, consistency requirements, operational burden, scaling model, and failure tolerance.
Managed databases still require customer ownership of schema, indexes, queries, IAM/network security, backups, upgrades, and restore validation.
Production data design should include encryption, private networking, least-privilege access, monitoring, backups, failover testing, and cost controls.
Hands-on example
1. Create the database/cache in private subnets with encryption, least-privilege security groups, backup/retention settings, and monitoring enabled.
2. Run a representative workload test and capture latency, throughput, connection count, CPU, memory, I/O, and errors.
3. Test the failure path: failover, replica lag, cache loss, throttling, or restore depending on the service.
4. Write the operational runbook covering access, backup, scaling, alarms, and rollback.
What is AWS Lambda, and what are its key limits (timeout, memory, package size)?Intermediate
Answer
Lambda runs code in response to events without managing servers. Key limits include a 15-minute maximum timeout, memory configuration, concurrency controls, package/container size considerations, payload limits, and temporary storage configuration.
Technical explanation
Concurrency limits and downstream capacity are often more important operationally than the function code itself.
Serverless design shifts server operations to AWS but increases the importance of event semantics, timeouts, retries, idempotency, concurrency, and downstream protection.
Observability must include structured logs, metrics, traces, DLQs or failure destinations, and alarms on errors, throttles, duration, and age/lag where applicable.
Cold starts, package size, runtime choice, and VPC dependencies should be measured against p95/p99 latency rather than assumed.
Hands-on example
1. Build a small event-driven flow using Lambda plus the relevant event source such as API Gateway, S3, SQS, or EventBridge.
2. Configure timeout, memory, reserved concurrency, IAM role, structured logs, metrics, tracing, and DLQ or failure destination.
3. Inject duplicate events, timeouts, and downstream failures to validate retries and idempotency.
4. Create alarms for errors, throttles, duration, iterator age or queue age, and DLQ messages.
Explain a Lambda cold start and how to reduce it.Intermediate
Answer
A Lambda cold start is extra latency when AWS initializes a new execution environment. I reduce it with smaller packages, efficient runtime/init code, client reuse, provisioned concurrency for critical paths, and careful VPC/dependency design.
Technical explanation
Provisioned concurrency reduces cold starts for latency-sensitive APIs but adds cost.
Serverless design shifts server operations to AWS but increases the importance of event semantics, timeouts, retries, idempotency, concurrency, and downstream protection.
Observability must include structured logs, metrics, traces, DLQs or failure destinations, and alarms on errors, throttles, duration, and age/lag where applicable.
Cold starts, package size, runtime choice, and VPC dependencies should be measured against p95/p99 latency rather than assumed.
Hands-on example
1. Build a small event-driven flow using Lambda plus the relevant event source such as API Gateway, S3, SQS, or EventBridge.
2. Configure timeout, memory, reserved concurrency, IAM role, structured logs, metrics, tracing, and DLQ or failure destination.
3. Inject duplicate events, timeouts, and downstream failures to validate retries and idempotency.
4. Create alarms for errors, throttles, duration, iterator age or queue age, and DLQ messages.
How do you trigger a Lambda function - name several event sources.Intermediate
Answer
Lambda can be triggered by API Gateway, ALB, S3, SQS, SNS, EventBridge, DynamoDB Streams, Kinesis, Step Functions, CloudWatch Logs, Cognito, and direct SDK calls. The trigger determines retry, batching, and failure behavior.
Technical explanation
Each event source has different retry, batching, ordering, and failure semantics; idempotency is mandatory.
Serverless design shifts server operations to AWS but increases the importance of event semantics, timeouts, retries, idempotency, concurrency, and downstream protection.
Observability must include structured logs, metrics, traces, DLQs or failure destinations, and alarms on errors, throttles, duration, and age/lag where applicable.
Cold starts, package size, runtime choice, and VPC dependencies should be measured against p95/p99 latency rather than assumed.
Hands-on example
1. Build a small event-driven flow using Lambda plus the relevant event source such as API Gateway, S3, SQS, or EventBridge.
2. Configure timeout, memory, reserved concurrency, IAM role, structured logs, metrics, tracing, and DLQ or failure destination.
3. Inject duplicate events, timeouts, and downstream failures to validate retries and idempotency.
4. Create alarms for errors, throttles, duration, iterator age or queue age, and DLQ messages.
What is API Gateway, and how does it integrate with Lambda?Intermediate
Answer
API Gateway is a managed API front door for exposing, securing, throttling, and routing APIs. With Lambda integration, it handles the HTTP request path and invokes Lambda as the backend, with authorization, logging, and throttling controls.
Technical explanation
API Gateway throttling and authorization protect Lambda and downstream systems before code execution.
Serverless design shifts server operations to AWS but increases the importance of event semantics, timeouts, retries, idempotency, concurrency, and downstream protection.
Observability must include structured logs, metrics, traces, DLQs or failure destinations, and alarms on errors, throttles, duration, and age/lag where applicable.
Cold starts, package size, runtime choice, and VPC dependencies should be measured against p95/p99 latency rather than assumed.
Hands-on example
1. Build a small event-driven flow using Lambda plus the relevant event source such as API Gateway, S3, SQS, or EventBridge.
2. Configure timeout, memory, reserved concurrency, IAM role, structured logs, metrics, tracing, and DLQ or failure destination.
3. Inject duplicate events, timeouts, and downstream failures to validate retries and idempotency.
4. Create alarms for errors, throttles, duration, iterator age or queue age, and DLQ messages.
What is Amazon ECR, and how does it relate to EKS and Docker?Intermediate
Answer
ECR is AWS's managed container registry for Docker/OCI images. CI builds images, pushes them to ECR, and EKS or ECS pulls them for deployment, ideally using immutable tags, scanning, lifecycle policies, and IAM-controlled access.
Technical explanation
Immutable tags and digest pinning reduce supply-chain ambiguity and deployment drift.
EKS is managed Kubernetes, not no-ops Kubernetes: IAM, networking, add-ons, node strategy, upgrades, RBAC, policies, and workload reliability remain customer responsibilities.
Workload identity, private networking, image security, ingress standards, autoscaling, and observability are foundational controls for production clusters.
Troubleshooting EKS requires separating control-plane, node, CNI, scheduler, ingress, and application failure domains.
Hands-on example
1. Create or use an EKS sandbox cluster with private subnets, managed add-ons, workload IAM, and a sample namespace.
2. Deploy a small container from ECR using Kubernetes manifests or Helm, with readiness/liveness probes and resource requests.
3. Add ingress/load balancing, pod IAM, logging, metrics, and network/security controls relevant to the question.
4. Test node replacement, pod rescheduling, image pulls, access control, and rollback.
What is Amazon EKS, and what does AWS manage versus what you manage?Intermediate
Answer
EKS is AWS-managed Kubernetes. AWS manages the control plane infrastructure, while I manage workloads, nodes or Fargate choices, add-ons, IAM, networking, security, observability, upgrades, and reliability practices.
Technical explanation
EKS removes control-plane infrastructure work but not Kubernetes platform engineering.
EKS is managed Kubernetes, not no-ops Kubernetes: IAM, networking, add-ons, node strategy, upgrades, RBAC, policies, and workload reliability remain customer responsibilities.
Workload identity, private networking, image security, ingress standards, autoscaling, and observability are foundational controls for production clusters.
Troubleshooting EKS requires separating control-plane, node, CNI, scheduler, ingress, and application failure domains.
Hands-on example
1. Create or use an EKS sandbox cluster with private subnets, managed add-ons, workload IAM, and a sample namespace.
2. Deploy a small container from ECR using Kubernetes manifests or Helm, with readiness/liveness probes and resource requests.
3. Add ingress/load balancing, pod IAM, logging, metrics, and network/security controls relevant to the question.
4. Test node replacement, pod rescheduling, image pulls, access control, and rollback.
Explain the EKS control plane versus worker node responsibilities.Intermediate
Answer
The EKS control plane includes managed API server and etcd; worker nodes run kubelet, pods, networking components, and container runtime. AWS manages control-plane infrastructure, while customer operations focus heavily on nodes and workloads.
Technical explanation
Pods Pending usually point to node, scheduling, quota, taint, or CNI issues rather than the control plane.
EKS is managed Kubernetes, not no-ops Kubernetes: IAM, networking, add-ons, node strategy, upgrades, RBAC, policies, and workload reliability remain customer responsibilities.
Workload identity, private networking, image security, ingress standards, autoscaling, and observability are foundational controls for production clusters.
Troubleshooting EKS requires separating control-plane, node, CNI, scheduler, ingress, and application failure domains.
Hands-on example
1. Create or use an EKS sandbox cluster with private subnets, managed add-ons, workload IAM, and a sample namespace.
2. Deploy a small container from ECR using Kubernetes manifests or Helm, with readiness/liveness probes and resource requests.
3. Add ingress/load balancing, pod IAM, logging, metrics, and network/security controls relevant to the question.
4. Test node replacement, pod rescheduling, image pulls, access control, and rollback.
What are managed node groups versus self-managed nodes versus Fargate on EKS?Intermediate
Answer
Managed node groups simplify EC2 worker lifecycle, self-managed nodes give more control, and Fargate runs pods without node management but with constraints. I often mix them based on workload control, cost, and operational needs.
Technical explanation
Fargate reduces node management but does not support every daemonset, privileged, storage, or networking pattern.
EKS is managed Kubernetes, not no-ops Kubernetes: IAM, networking, add-ons, node strategy, upgrades, RBAC, policies, and workload reliability remain customer responsibilities.
Workload identity, private networking, image security, ingress standards, autoscaling, and observability are foundational controls for production clusters.
Troubleshooting EKS requires separating control-plane, node, CNI, scheduler, ingress, and application failure domains.
Hands-on example
1. Create or use an EKS sandbox cluster with private subnets, managed add-ons, workload IAM, and a sample namespace.
2. Deploy a small container from ECR using Kubernetes manifests or Helm, with readiness/liveness probes and resource requests.
3. Add ingress/load balancing, pod IAM, logging, metrics, and network/security controls relevant to the question.
4. Test node replacement, pod rescheduling, image pulls, access control, and rollback.
How does the AWS VPC CNI assign pod networking on EKS?Intermediate
Answer
The AWS VPC CNI gives pods routable VPC IPs by assigning ENI secondary IPs or prefixes from subnets to nodes. This makes pod-to-VPC communication simple but turns subnet IP capacity and ENI limits into scaling concerns.
Technical explanation
Subnet IP exhaustion is a common EKS scaling failure mode when using the VPC CNI.
EKS is managed Kubernetes, not no-ops Kubernetes: IAM, networking, add-ons, node strategy, upgrades, RBAC, policies, and workload reliability remain customer responsibilities.
Workload identity, private networking, image security, ingress standards, autoscaling, and observability are foundational controls for production clusters.
Troubleshooting EKS requires separating control-plane, node, CNI, scheduler, ingress, and application failure domains.
Hands-on example
1. Create or use an EKS sandbox cluster with private subnets, managed add-ons, workload IAM, and a sample namespace.
2. Deploy a small container from ECR using Kubernetes manifests or Helm, with readiness/liveness probes and resource requests.
3. Add ingress/load balancing, pod IAM, logging, metrics, and network/security controls relevant to the question.
4. Test node replacement, pod rescheduling, image pulls, access control, and rollback.
What is the AWS Load Balancer Controller, and what does it provision?Intermediate
Answer
The AWS Load Balancer Controller watches Kubernetes Ingress and Service resources and provisions AWS ALBs or NLBs. It lets teams declare load-balancing through Kubernetes manifests while AWS resources are reconciled automatically.
Technical explanation
The controller needs scoped AWS IAM permissions, usually through workload identity, because it creates AWS load balancer resources.
EKS is managed Kubernetes, not no-ops Kubernetes: IAM, networking, add-ons, node strategy, upgrades, RBAC, policies, and workload reliability remain customer responsibilities.
Workload identity, private networking, image security, ingress standards, autoscaling, and observability are foundational controls for production clusters.
Troubleshooting EKS requires separating control-plane, node, CNI, scheduler, ingress, and application failure domains.
Hands-on example
1. Create or use an EKS sandbox cluster with private subnets, managed add-ons, workload IAM, and a sample namespace.
2. Deploy a small container from ECR using Kubernetes manifests or Helm, with readiness/liveness probes and resource requests.
3. Add ingress/load balancing, pod IAM, logging, metrics, and network/security controls relevant to the question.
4. Test node replacement, pod rescheduling, image pulls, access control, and rollback.
How do you authenticate kubectl to an EKS cluster (aws-auth / access entries)?Intermediate
Answer
kubectl to EKS uses AWS IAM authentication, then Kubernetes authorization. Historically access mapping used aws-auth ConfigMap; access entries provide an EKS API-managed access model, but RBAC still controls in-cluster permissions.
Technical explanation
Authentication is IAM-based, but Kubernetes RBAC or access policies still define what the caller can do.
EKS is managed Kubernetes, not no-ops Kubernetes: IAM, networking, add-ons, node strategy, upgrades, RBAC, policies, and workload reliability remain customer responsibilities.
Workload identity, private networking, image security, ingress standards, autoscaling, and observability are foundational controls for production clusters.
Troubleshooting EKS requires separating control-plane, node, CNI, scheduler, ingress, and application failure domains.
Hands-on example
1. Create or use an EKS sandbox cluster with private subnets, managed add-ons, workload IAM, and a sample namespace.
2. Deploy a small container from ECR using Kubernetes manifests or Helm, with readiness/liveness probes and resource requests.
3. Add ingress/load balancing, pod IAM, logging, metrics, and network/security controls relevant to the question.
4. Test node replacement, pod rescheduling, image pulls, access control, and rollback.
What is CloudWatch, and what is the difference between metrics, logs, and alarms?Intermediate
Answer
CloudWatch provides metrics, logs, dashboards, and alarms. Metrics are numeric time series, logs are event records, and alarms evaluate metric conditions to trigger notifications or automation.
Technical explanation
Metrics tell you symptoms and trends; logs give context; alarms must be actionable and tied to user impact.
Observability should answer symptoms, cause, scope, and owner: metrics show trends and alerts, logs provide context, traces connect calls, and audit logs attribute changes.
Alert only on actionable conditions such as user impact, fast SLO burn, saturation, unhealthy capacity, or security-sensitive changes.
Centralize retention and access policies so operational debugging and audit investigations are possible without exposing sensitive logs unnecessarily.
Hands-on example
1. Enable the relevant telemetry source: CloudWatch metrics/logs, CloudTrail, Config, ALB logs, VPC Flow Logs, or application structured logs.
2. Create a dashboard and one actionable alarm tied to user impact or security risk.
3. Trigger a controlled change or failure and verify that the signal appears with enough context to identify owner and root cause.
4. Document the query, dashboard link, alarm routing, and runbook action.
What are CloudWatch custom metrics, and how do you publish them?Intermediate
Answer
Custom CloudWatch metrics are application or business metrics that AWS does not emit by default. I publish them through SDK/CLI, CloudWatch Agent, or embedded metric format, and use them for SLOs, alarms, dashboards, and scaling.
Technical explanation
Avoid high-cardinality custom metric dimensions such as userId because they can create high cost and noisy dashboards.
Observability should answer symptoms, cause, scope, and owner: metrics show trends and alerts, logs provide context, traces connect calls, and audit logs attribute changes.
Alert only on actionable conditions such as user impact, fast SLO burn, saturation, unhealthy capacity, or security-sensitive changes.
Centralize retention and access policies so operational debugging and audit investigations are possible without exposing sensitive logs unnecessarily.
Hands-on example
1. Enable the relevant telemetry source: CloudWatch metrics/logs, CloudTrail, Config, ALB logs, VPC Flow Logs, or application structured logs.
2. Create a dashboard and one actionable alarm tied to user impact or security risk.
3. Trigger a controlled change or failure and verify that the signal appears with enough context to identify owner and root cause.
4. Document the query, dashboard link, alarm routing, and runbook action.
What is the difference between CloudWatch Logs and CloudTrail?Intermediate
Answer
CloudWatch Logs stores application and service log events; CloudTrail records AWS API activity. Logs explain workload behavior, while CloudTrail explains who changed what in the AWS control plane.
Technical explanation
Security investigations often correlate CloudTrail API calls with CloudWatch application logs and VPC Flow Logs.
Observability should answer symptoms, cause, scope, and owner: metrics show trends and alerts, logs provide context, traces connect calls, and audit logs attribute changes.
Alert only on actionable conditions such as user impact, fast SLO burn, saturation, unhealthy capacity, or security-sensitive changes.
Centralize retention and access policies so operational debugging and audit investigations are possible without exposing sensitive logs unnecessarily.
Hands-on example
1. Enable the relevant telemetry source: CloudWatch metrics/logs, CloudTrail, Config, ALB logs, VPC Flow Logs, or application structured logs.
2. Create a dashboard and one actionable alarm tied to user impact or security risk.
3. Trigger a controlled change or failure and verify that the signal appears with enough context to identify owner and root cause.
4. Document the query, dashboard link, alarm routing, and runbook action.
What does CloudTrail record, and why does it matter for security and audits?Intermediate
Answer
CloudTrail records AWS account activity such as IAM changes, security group updates, S3 policy changes, and optional data events. It matters for audit, forensics, incident response, and change attribution.
Technical explanation
Organization trails and protected log buckets are important so workload account admins cannot tamper with audit logs.
Observability should answer symptoms, cause, scope, and owner: metrics show trends and alerts, logs provide context, traces connect calls, and audit logs attribute changes.
Alert only on actionable conditions such as user impact, fast SLO burn, saturation, unhealthy capacity, or security-sensitive changes.
Centralize retention and access policies so operational debugging and audit investigations are possible without exposing sensitive logs unnecessarily.
Hands-on example
1. Enable the relevant telemetry source: CloudWatch metrics/logs, CloudTrail, Config, ALB logs, VPC Flow Logs, or application structured logs.
2. Create a dashboard and one actionable alarm tied to user impact or security risk.
3. Trigger a controlled change or failure and verify that the signal appears with enough context to identify owner and root cause.
4. Document the query, dashboard link, alarm routing, and runbook action.
What is AWS Config, and how does it differ from CloudTrail?Intermediate
Answer
AWS Config records resource configuration history and compliance; CloudTrail records API calls. Config shows what changed and whether it is compliant, while CloudTrail shows who made the change and when.
Technical explanation
Config compliance rules detect drift and noncompliance; CloudTrail identifies the API caller behind the change.
Observability should answer symptoms, cause, scope, and owner: metrics show trends and alerts, logs provide context, traces connect calls, and audit logs attribute changes.
Alert only on actionable conditions such as user impact, fast SLO burn, saturation, unhealthy capacity, or security-sensitive changes.
Centralize retention and access policies so operational debugging and audit investigations are possible without exposing sensitive logs unnecessarily.
Hands-on example
1. Enable the relevant telemetry source: CloudWatch metrics/logs, CloudTrail, Config, ALB logs, VPC Flow Logs, or application structured logs.
2. Create a dashboard and one actionable alarm tied to user impact or security risk.
3. Trigger a controlled change or failure and verify that the signal appears with enough context to identify owner and root cause.
4. Document the query, dashboard link, alarm routing, and runbook action.
What is Route 53, and what routing policies does it support?Intermediate
Answer
Route 53 is AWS DNS with hosted zones, domain registration, health checks, and routing policies such as simple, weighted, latency, failover, geolocation, geoproximity, IP-based, and multivalue routing.
Technical explanation
Weighted routing is useful for canary migration, latency routing for regional performance, and failover routing for DNS-level DR.
DNS and CDN design must account for caching behavior, TTLs, origin protection, health signals, TLS, and global user latency.
Route 53 routing policies and CloudFront cache policies should be chosen based on the real traffic-management goal, not because they are available.
Always test failover, cache invalidation, header/cookie/query-string behavior, and origin access controls before production cutover.
Hands-on example
1. Create a test hosted zone or subdomain and route traffic to a controlled ALB, API, S3/CloudFront origin, or secondary Region.
2. Configure the relevant policy - weighted, failover, alias, cache behavior, OAC, or health check - and keep TTLs low during testing.
3. Use dig/curl and CloudFront/Route 53 logs or metrics to verify routing, caching, TLS, and failover behavior.
4. Increase TTLs and tighten origin access after validation.
Explain Route 53 health checks and failover routing.Intermediate
Answer
Route 53 health checks monitor endpoint or alarm health and can drive failover routing. DNS failover is useful, but TTLs, client caching, and data replication determine real recovery time.
Technical explanation
Health checks should validate real readiness; a static port check can produce unsafe failover decisions.
DNS and CDN design must account for caching behavior, TTLs, origin protection, health signals, TLS, and global user latency.
Route 53 routing policies and CloudFront cache policies should be chosen based on the real traffic-management goal, not because they are available.
Always test failover, cache invalidation, header/cookie/query-string behavior, and origin access controls before production cutover.
Hands-on example
1. Create a test hosted zone or subdomain and route traffic to a controlled ALB, API, S3/CloudFront origin, or secondary Region.
2. Configure the relevant policy - weighted, failover, alias, cache behavior, OAC, or health check - and keep TTLs low during testing.
3. Use dig/curl and CloudFront/Route 53 logs or metrics to verify routing, caching, TLS, and failover behavior.
4. Increase TTLs and tighten origin access after validation.
What is the difference between a CNAME and an Alias record in Route 53?Intermediate
Answer
A CNAME maps a DNS name to another name and usually cannot be used at the zone apex. A Route 53 Alias points to AWS resources, can be used at the apex, and is preferred for ALB, CloudFront, API Gateway, and similar targets.
Technical explanation
Alias records are AWS-specific and can be used at the zone apex where CNAME normally cannot.
DNS and CDN design must account for caching behavior, TTLs, origin protection, health signals, TLS, and global user latency.
Route 53 routing policies and CloudFront cache policies should be chosen based on the real traffic-management goal, not because they are available.
Always test failover, cache invalidation, header/cookie/query-string behavior, and origin access controls before production cutover.
Hands-on example
1. Create a test hosted zone or subdomain and route traffic to a controlled ALB, API, S3/CloudFront origin, or secondary Region.
2. Configure the relevant policy - weighted, failover, alias, cache behavior, OAC, or health check - and keep TTLs low during testing.
3. Use dig/curl and CloudFront/Route 53 logs or metrics to verify routing, caching, TLS, and failover behavior.
4. Increase TTLs and tighten origin access after validation.
What is CloudFront, and how does it improve performance and reduce cost?Intermediate
Answer
CloudFront is AWS's CDN. It caches and serves content from edge locations close to users, reducing latency, origin load, and sometimes cost, while adding controls like TLS, WAF integration, signed URLs, and origin protection.
Technical explanation
Cache key design controls correctness: headers, cookies, and query strings should be included only when needed.
DNS and CDN design must account for caching behavior, TTLs, origin protection, health signals, TLS, and global user latency.
Route 53 routing policies and CloudFront cache policies should be chosen based on the real traffic-management goal, not because they are available.
Always test failover, cache invalidation, header/cookie/query-string behavior, and origin access controls before production cutover.
Hands-on example
1. Create a test hosted zone or subdomain and route traffic to a controlled ALB, API, S3/CloudFront origin, or secondary Region.
2. Configure the relevant policy - weighted, failover, alias, cache behavior, OAC, or health check - and keep TTLs low during testing.
3. Use dig/curl and CloudFront/Route 53 logs or metrics to verify routing, caching, TLS, and failover behavior.
4. Increase TTLs and tighten origin access after validation.
What is AWS KMS, and what is the difference between an AWS-managed and a customer-managed key?Intermediate
Answer
KMS manages cryptographic keys used by AWS services and applications. AWS-managed keys are service-managed with limited control; customer-managed keys give custom policy, audit, rotation, grants, aliases, and lifecycle control.
Technical explanation
KMS key policies are foundational; IAM permissions alone do not help if the key policy blocks the access path.
Key and secret controls must combine IAM policy, resource policy, KMS key policy, rotation, audit logging, and application refresh behavior.
Do not confuse encryption with authorization: encrypted data is still exposed if decrypt and read permissions are too broad.
Secret rotation must include monitoring and rollback because a failed rotation can become a production outage.
Hands-on example
1. Create a test KMS key, secret or parameter, IAM role, and workload that retrieves the value at runtime.
2. Scope permissions to the specific secret/parameter and KMS key, then test allowed and denied reads.
3. If rotation is relevant, run a manual rotation and confirm the application refreshes safely.
4. Add CloudTrail/CloudWatch alarms for failed rotation, denied decrypts, and suspicious access.
What is the difference between KMS and Secrets Manager?Advanced
Answer
KMS manages encryption keys; Secrets Manager stores and rotates secrets. Secrets Manager often uses KMS underneath, but it is the secret lifecycle and retrieval service, while KMS is the key control plane.
Technical explanation
A workload may need both secretsmanager:GetSecretValue and kms:Decrypt to read a secret encrypted by a customer-managed key.
Key and secret controls must combine IAM policy, resource policy, KMS key policy, rotation, audit logging, and application refresh behavior.
Do not confuse encryption with authorization: encrypted data is still exposed if decrypt and read permissions are too broad.
Secret rotation must include monitoring and rollback because a failed rotation can become a production outage.
Hands-on example
1. Create a test KMS key, secret or parameter, IAM role, and workload that retrieves the value at runtime.
2. Scope permissions to the specific secret/parameter and KMS key, then test allowed and denied reads.
3. If rotation is relevant, run a manual rotation and confirm the application refreshes safely.
4. Add CloudTrail/CloudWatch alarms for failed rotation, denied decrypts, and suspicious access.
What is the difference between Secrets Manager and SSM Parameter Store?Advanced
Answer
Secrets Manager is best for secrets needing rotation and version staging. SSM Parameter Store is often simpler for hierarchical configuration, feature flags, AMI IDs, and lower-rotation SecureString parameters.
Technical explanation
Parameter Store is strong for hierarchical config paths; Secrets Manager is stronger for secret lifecycle and rotation.
Key and secret controls must combine IAM policy, resource policy, KMS key policy, rotation, audit logging, and application refresh behavior.
Do not confuse encryption with authorization: encrypted data is still exposed if decrypt and read permissions are too broad.
Secret rotation must include monitoring and rollback because a failed rotation can become a production outage.
Hands-on example
1. Create a test KMS key, secret or parameter, IAM role, and workload that retrieves the value at runtime.
2. Scope permissions to the specific secret/parameter and KMS key, then test allowed and denied reads.
3. If rotation is relevant, run a manual rotation and confirm the application refreshes safely.
4. Add CloudTrail/CloudWatch alarms for failed rotation, denied decrypts, and suspicious access.
How does Secrets Manager rotation work?Advanced
Answer
Secrets Manager rotation normally uses a Lambda function that creates, applies, tests, and promotes a new secret version. It must be tested carefully because broken rotation can cause application authentication outages.
Technical explanation
Applications must refresh secrets safely; storing rotated secrets permanently in environment variables defeats the rotation model.
Key and secret controls must combine IAM policy, resource policy, KMS key policy, rotation, audit logging, and application refresh behavior.
Do not confuse encryption with authorization: encrypted data is still exposed if decrypt and read permissions are too broad.
Secret rotation must include monitoring and rollback because a failed rotation can become a production outage.
Hands-on example
1. Create a test KMS key, secret or parameter, IAM role, and workload that retrieves the value at runtime.
2. Scope permissions to the specific secret/parameter and KMS key, then test allowed and denied reads.
3. If rotation is relevant, run a manual rotation and confirm the application refreshes safely.
4. Add CloudTrail/CloudWatch alarms for failed rotation, denied decrypts, and suspicious access.
What is AWS CloudFormation, and how does it compare to Terraform?Advanced
Answer
CloudFormation is AWS-native IaC; Terraform is multi-provider IaC with external state. I choose based on ecosystem, governance, provider scope, module maturity, state model, and team standardization.
Technical explanation
Terraform state security and locking are operational responsibilities; CloudFormation stack state lives inside AWS.
Infrastructure as code should use reviewable plans/change sets, reusable modules, policy checks, drift detection, and controlled rollout pipelines.
Architecture reviews should produce prioritized risk remediation with owners and dates, not just high-level best-practice statements.
State, stack outputs, secrets, and deployment permissions must be secured because IaC pipelines often have powerful privileges.
Hands-on example
1. Model the resource or architecture through CloudFormation or Terraform rather than console changes.
2. Review the plan/change set for replacements, deletes, security exposure, and cost-impacting changes.
3. Apply in non-production, run validation tests, then promote through approval to production.
4. Run drift detection or state comparison afterward and remediate manual changes through code.
What is a CloudFormation change set and a drift detection?Advanced
Answer
A CloudFormation change set previews stack changes before execution. Drift detection compares live resource configuration with the stack's expected configuration to find manual or external changes.
Technical explanation
A change set protects before deployment; drift detection finds differences after deployment.
Infrastructure as code should use reviewable plans/change sets, reusable modules, policy checks, drift detection, and controlled rollout pipelines.
Architecture reviews should produce prioritized risk remediation with owners and dates, not just high-level best-practice statements.
State, stack outputs, secrets, and deployment permissions must be secured because IaC pipelines often have powerful privileges.
Hands-on example
1. Model the resource or architecture through CloudFormation or Terraform rather than console changes.
2. Review the plan/change set for replacements, deletes, security exposure, and cost-impacting changes.
3. Apply in non-production, run validation tests, then promote through approval to production.
4. Run drift detection or state comparison afterward and remediate manual changes through code.
What is the AWS Well-Architected Framework, and what are its pillars?Advanced
Answer
The AWS Well-Architected Framework reviews workloads across Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. I use it to turn architecture risk into prioritized actions.
Technical explanation
Well-Architected reviews should produce tracked remediation items, not just discussion notes.
Infrastructure as code should use reviewable plans/change sets, reusable modules, policy checks, drift detection, and controlled rollout pipelines.
Architecture reviews should produce prioritized risk remediation with owners and dates, not just high-level best-practice statements.
State, stack outputs, secrets, and deployment permissions must be secured because IaC pipelines often have powerful privileges.
Hands-on example
1. Model the resource or architecture through CloudFormation or Terraform rather than console changes.
2. Review the plan/change set for replacements, deletes, security exposure, and cost-impacting changes.
3. Apply in non-production, run validation tests, then promote through approval to production.
4. Run drift detection or state comparison afterward and remediate manual changes through code.
How would you design a highly available web application across multiple AZs?Advanced
Answer
For a highly available web app, I deploy across multiple AZs with ALB, private app subnets, autoscaling compute, managed multi-AZ data services, monitoring, backups, and safe deployment/rollback patterns.
Technical explanation
High availability needs multiple AZs, health checks, autoscaling, resilient data services, and tested failure behavior.
Availability design should start from business impact, RTO/RPO, dependency mapping, and failure-mode testing, not only from deploying resources in multiple AZs.
Stateless compute, resilient data stores, health checks, rollback, backups, and game days are all required to prove resilience.
Lower recovery targets require higher cost, more automation, replicated data, pre-provisioned capacity, and regularly tested runbooks.
Hands-on example
1. Draw the workload dependency map, then define target RTO/RPO with the business owner.
2. Implement multi-AZ or multi-Region components required by those targets, including data replication and automated provisioning.
3. Run a game day: instance failure, AZ impairment, database failover, restore test, or regional failover depending on scope.
4. Measure actual recovery time/data loss and update the architecture or runbook if targets are missed.
How do you design for disaster recovery - explain RTO and RPO and the DR strategies.Advanced
Answer
DR design starts with RTO and RPO. Backup-restore, pilot light, warm standby, and multi-site are increasing levels of readiness, cost, and complexity for decreasing recovery time and data loss.
Technical explanation
Lower RTO/RPO requires more pre-provisioning, replication, automation, and regular DR exercises.
Availability design should start from business impact, RTO/RPO, dependency mapping, and failure-mode testing, not only from deploying resources in multiple AZs.
Stateless compute, resilient data stores, health checks, rollback, backups, and game days are all required to prove resilience.
Lower recovery targets require higher cost, more automation, replicated data, pre-provisioned capacity, and regularly tested runbooks.
Hands-on example
1. Draw the workload dependency map, then define target RTO/RPO with the business owner.
2. Implement multi-AZ or multi-Region components required by those targets, including data replication and automated provisioning.
3. Run a game day: instance failure, AZ impairment, database failover, restore test, or regional failover depending on scope.
4. Measure actual recovery time/data loss and update the architecture or runbook if targets are missed.
Compare backup-and-restore, pilot light, warm standby, and multi-site DR strategies.Advanced
Answer
Backup-and-restore is cheapest and slowest; pilot light keeps core pieces ready; warm standby runs a scaled-down environment; multi-site runs active environments. The right choice depends on business RTO/RPO and cost tolerance.
Technical explanation
The four DR strategies are a cost-versus-readiness spectrum; choose based on business impact.
Availability design should start from business impact, RTO/RPO, dependency mapping, and failure-mode testing, not only from deploying resources in multiple AZs.
Stateless compute, resilient data stores, health checks, rollback, backups, and game days are all required to prove resilience.
Lower recovery targets require higher cost, more automation, replicated data, pre-provisioned capacity, and regularly tested runbooks.
Hands-on example
1. Draw the workload dependency map, then define target RTO/RPO with the business owner.
2. Implement multi-AZ or multi-Region components required by those targets, including data replication and automated provisioning.
3. Run a game day: instance failure, AZ impairment, database failover, restore test, or regional failover depending on scope.
4. Measure actual recovery time/data loss and update the architecture or runbook if targets are missed.
How would you secure data in transit and at rest across an AWS workload?Advanced
Answer
I secure data in transit with TLS and data at rest with encryption across S3, EBS, RDS, DynamoDB, EFS, snapshots, backups, and logs. The critical part is pairing encryption with IAM, KMS policy, secrets management, and audit controls.
Technical explanation
Encryption without least-privilege decrypt access is incomplete security.
Availability design should start from business impact, RTO/RPO, dependency mapping, and failure-mode testing, not only from deploying resources in multiple AZs.
Stateless compute, resilient data stores, health checks, rollback, backups, and game days are all required to prove resilience.
Lower recovery targets require higher cost, more automation, replicated data, pre-provisioned capacity, and regularly tested runbooks.
Hands-on example
1. Draw the workload dependency map, then define target RTO/RPO with the business owner.
2. Implement multi-AZ or multi-Region components required by those targets, including data replication and automated provisioning.
3. Run a game day: instance failure, AZ impairment, database failover, restore test, or regional failover depending on scope.
4. Measure actual recovery time/data loss and update the architecture or runbook if targets are missed.
How do you detect and reduce unexpected AWS cost increases?Advanced
Answer
For unexpected AWS cost spikes, I detect with Budgets, Cost Anomaly Detection, Cost Explorer, CUR, and tags, then identify the driver and remediate with rightsizing, lifecycle policies, scheduling, architecture changes, or commitments.
Technical explanation
Cost spikes should be triaged like incidents: scope, driver, stop bleeding, root cause, prevention.
Cost analysis should be based on tagged usage, CUR/Cost Explorer data, service-level owners, and usage-type drivers rather than account-level totals only.
Every cost reduction should be checked against reliability, performance, security, and operational risk.
Use budgets and anomaly detection for early signal, then use rightsizing, lifecycle, commitments, scheduling, and architecture fixes for remediation.
Hands-on example
1. Use Cost Explorer or CUR/Athena to identify the top cost driver by account, service, tag, Region, usage type, and daily delta.
2. Validate the operational cause with service metrics such as utilization, logs volume, NAT bytes, snapshot growth, or data transfer.
3. Apply a targeted fix - rightsizing, lifecycle, retention, endpoint, schedule, or commitment - with a rollback plan.
4. Track savings, performance, and reliability for at least one billing cycle.
What AWS tools help with cost visibility (Cost Explorer, Budgets, CUR)?Advanced
Answer
Cost visibility tools include Cost Explorer, Budgets, Cost and Usage Report, Cost Anomaly Detection, Compute Optimizer, Trusted Advisor, and billing dashboards. CUR plus tags is the most powerful for deep allocation.
Technical explanation
CUR with resource IDs and tags is the most flexible base for chargeback, showback, and FinOps analytics.
Cost analysis should be based on tagged usage, CUR/Cost Explorer data, service-level owners, and usage-type drivers rather than account-level totals only.
Every cost reduction should be checked against reliability, performance, security, and operational risk.
Use budgets and anomaly detection for early signal, then use rightsizing, lifecycle, commitments, scheduling, and architecture fixes for remediation.
Hands-on example
1. Use Cost Explorer or CUR/Athena to identify the top cost driver by account, service, tag, Region, usage type, and daily delta.
2. Validate the operational cause with service metrics such as utilization, logs volume, NAT bytes, snapshot growth, or data transfer.
3. Apply a targeted fix - rightsizing, lifecycle, retention, endpoint, schedule, or commitment - with a rollback plan.
4. Track savings, performance, and reliability for at least one billing cycle.
How would you right-size EC2 and RDS instances to cut spend without hurting reliability?Advanced
Answer
Rightsizing EC2 and RDS means comparing actual CPU, memory, I/O, network, latency, connections, and peak patterns against capacity. I change gradually, test performance, and preserve failover and reliability headroom.
Technical explanation
Average CPU is not enough for rightsizing; p95/p99, memory, I/O, failover headroom, and business peaks matter.
Cost analysis should be based on tagged usage, CUR/Cost Explorer data, service-level owners, and usage-type drivers rather than account-level totals only.
Every cost reduction should be checked against reliability, performance, security, and operational risk.
Use budgets and anomaly detection for early signal, then use rightsizing, lifecycle, commitments, scheduling, and architecture fixes for remediation.
Hands-on example
1. Use Cost Explorer or CUR/Athena to identify the top cost driver by account, service, tag, Region, usage type, and daily delta.
2. Validate the operational cause with service metrics such as utilization, logs volume, NAT bytes, snapshot growth, or data transfer.
3. Apply a targeted fix - rightsizing, lifecycle, retention, endpoint, schedule, or commitment - with a rollback plan.
4. Track savings, performance, and reliability for at least one billing cycle.
Explain how you would migrate a workload from on-prem to AWS.Advanced
Answer
For on-prem to AWS migration, I discover dependencies, classify workloads by migration strategy, build a landing zone, design connectivity/security/observability, migrate in waves, validate data, cut over safely, and optimize afterward.
Technical explanation
Dependency mapping is usually the hardest part of migration because hidden flows break cutovers.
Migration planning requires application discovery, dependency mapping, network/security foundations, data movement strategy, cutover plan, rollback plan, and post-migration optimization.
Choose rehost, replatform, refactor, repurchase, retain, or retire per workload based on risk and business value.
Pilot with a low-risk workload before migrating critical systems, and validate performance, data integrity, monitoring, and operations.
Hands-on example
1. Inventory applications, dependencies, data stores, network flows, identities, and compliance constraints.
2. Create the landing zone, connectivity, security baseline, monitoring, and backup patterns before migrating production.
3. Migrate a pilot workload, validate data and performance, then cut over with DNS TTL reduced and rollback documented.
4. After cutover, right-size and modernize instead of preserving all on-prem assumptions.
What is the difference between SQS and SNS, and when do you use each?Advanced
Answer
SQS is a queue for decoupled work processing; SNS is pub/sub fan-out notification. They are often combined by publishing to SNS and subscribing multiple SQS queues so each consumer can process independently.
Technical explanation
SNS plus SQS creates durable fan-out with independent retry and DLQ behavior for each consumer.
Messaging services decouple producers and consumers, but the reliability model depends on ordering, retries, DLQs, idempotency, batching, and backpressure.
At-least-once delivery means consumers must safely handle duplicate messages unless the business can tolerate duplicates.
Monitor queue depth, age of oldest message, DLQ count, consumer errors, throttles, and downstream saturation.
Hands-on example
1. Create a small producer and consumer using the relevant service - SQS, SNS, or EventBridge.
2. Add retry, DLQ, visibility timeout or event rule configuration, and idempotency key handling.
3. Inject duplicate messages, consumer failure, and downstream throttling to observe retry and backpressure behavior.
4. Alarm on age, DLQ count, failed deliveries, throttles, and consumer error rate.
What is the difference between an SQS standard and FIFO queue?Advanced
Answer
SQS Standard offers high throughput with at-least-once delivery and best-effort ordering. FIFO preserves ordering within message groups and supports deduplication, but requires careful design for throughput and parallelism.
Technical explanation
FIFO ordering is per message group, so message group design determines parallelism.
Messaging services decouple producers and consumers, but the reliability model depends on ordering, retries, DLQs, idempotency, batching, and backpressure.
At-least-once delivery means consumers must safely handle duplicate messages unless the business can tolerate duplicates.
Monitor queue depth, age of oldest message, DLQ count, consumer errors, throttles, and downstream saturation.
Hands-on example
1. Create a small producer and consumer using the relevant service - SQS, SNS, or EventBridge.
2. Add retry, DLQ, visibility timeout or event rule configuration, and idempotency key handling.
3. Inject duplicate messages, consumer failure, and downstream throttling to observe retry and backpressure behavior.
4. Alarm on age, DLQ count, failed deliveries, throttles, and consumer error rate.
What is EventBridge, and how does it differ from SNS?Advanced
Answer
EventBridge is an event bus with filtering, routing, schedules, SaaS integrations, archives, and replay. SNS is simpler pub/sub fan-out; I use EventBridge for event-driven architecture routing and SNS for straightforward notification fan-out.
Technical explanation
EventBridge is better for event routing and governance; SNS is simpler for pub/sub fan-out.
Messaging services decouple producers and consumers, but the reliability model depends on ordering, retries, DLQs, idempotency, batching, and backpressure.
At-least-once delivery means consumers must safely handle duplicate messages unless the business can tolerate duplicates.
Monitor queue depth, age of oldest message, DLQ count, consumer errors, throttles, and downstream saturation.
Hands-on example
1. Create a small producer and consumer using the relevant service - SQS, SNS, or EventBridge.
2. Add retry, DLQ, visibility timeout or event rule configuration, and idempotency key handling.
3. Inject duplicate messages, consumer failure, and downstream throttling to observe retry and backpressure behavior.
4. Alarm on age, DLQ count, failed deliveries, throttles, and consumer error rate.
What is AWS Systems Manager, and how is Session Manager safer than SSH bastions?Advanced
Answer
Systems Manager provides operational capabilities like Session Manager, Run Command, Patch Manager, Parameter Store, Automation, and Inventory. Session Manager is safer than SSH bastions because it avoids inbound SSH and uses IAM/audit logging.
Technical explanation
Session Manager needs SSM Agent, IAM permissions, and network access to SSM endpoints or the internet.
Operations at scale should prefer managed access, automation, immutable infrastructure, repeatable runbooks, and auditability over manual host-by-host changes.
Troubleshooting should isolate layers: identity, network, host, application, dependency, deployment, and AWS service signals.
Patch, access, AMI, and incident workflows must be tested and measurable so they do not depend on tribal knowledge.
Hands-on example
1. Set up a sandbox EC2 fleet with SSM Agent, IAM instance role, CloudWatch Agent, hardened AMI baseline, and no unnecessary inbound access.
2. Perform the operation through automation: Session Manager, Run Command, Patch Manager, Image Builder, ASG instance refresh, or a runbook.
3. Introduce a realistic failure and use logs, metrics, status checks, and reachability tools to troubleshoot layer by layer.
4. Update the runbook and define the alarm or compliance check that would catch the issue next time.
How would you patch a fleet of EC2 instances at scale?Advanced
Answer
To patch EC2 fleets, I use SSM Patch Manager or immutable AMI replacement. I test patches, roll out by waves, use maintenance windows, track compliance, and prefer golden images for autoscaled stateless fleets.
Technical explanation
Immutable patching with new AMIs reduces drift for autoscaled fleets.
Operations at scale should prefer managed access, automation, immutable infrastructure, repeatable runbooks, and auditability over manual host-by-host changes.
Troubleshooting should isolate layers: identity, network, host, application, dependency, deployment, and AWS service signals.
Patch, access, AMI, and incident workflows must be tested and measurable so they do not depend on tribal knowledge.
Hands-on example
1. Set up a sandbox EC2 fleet with SSM Agent, IAM instance role, CloudWatch Agent, hardened AMI baseline, and no unnecessary inbound access.
2. Perform the operation through automation: Session Manager, Run Command, Patch Manager, Image Builder, ASG instance refresh, or a runbook.
3. Introduce a realistic failure and use logs, metrics, status checks, and reachability tools to troubleshoot layer by layer.
4. Update the runbook and define the alarm or compliance check that would catch the issue next time.
What is an AMI, and how do you build standardised, hardened images?Advanced
Answer
An AMI is a launch template image for EC2. I build hardened AMIs with Image Builder or Packer, applying patches, CIS controls, agents, IMDSv2, vulnerability scans, no embedded secrets, and versioned promotion.
Technical explanation
Never bake secrets into AMIs; use roles and secret stores at runtime.
Operations at scale should prefer managed access, automation, immutable infrastructure, repeatable runbooks, and auditability over manual host-by-host changes.
Troubleshooting should isolate layers: identity, network, host, application, dependency, deployment, and AWS service signals.
Patch, access, AMI, and incident workflows must be tested and measurable so they do not depend on tribal knowledge.
Hands-on example
1. Set up a sandbox EC2 fleet with SSM Agent, IAM instance role, CloudWatch Agent, hardened AMI baseline, and no unnecessary inbound access.
2. Perform the operation through automation: Session Manager, Run Command, Patch Manager, Image Builder, ASG instance refresh, or a runbook.
3. Introduce a realistic failure and use logs, metrics, status checks, and reachability tools to troubleshoot layer by layer.
4. Update the runbook and define the alarm or compliance check that would catch the issue next time.
How do you troubleshoot an EC2 instance that is unreachable over SSH?Advanced
Answer
For unreachable SSH, I check instance state, status checks, IP path, security groups, NACLs, routes, public/VPN path, key/user, sshd, host firewall, disk, CPU, and logs. Session Manager is often the fastest recovery path.
Technical explanation
Separate network reachability, authentication, and host-health checks so you do not chase the wrong layer.
Operations at scale should prefer managed access, automation, immutable infrastructure, repeatable runbooks, and auditability over manual host-by-host changes.
Troubleshooting should isolate layers: identity, network, host, application, dependency, deployment, and AWS service signals.
Patch, access, AMI, and incident workflows must be tested and measurable so they do not depend on tribal knowledge.
Hands-on example
1. Set up a sandbox EC2 fleet with SSM Agent, IAM instance role, CloudWatch Agent, hardened AMI baseline, and no unnecessary inbound access.
2. Perform the operation through automation: Session Manager, Run Command, Patch Manager, Image Builder, ASG instance refresh, or a runbook.
3. Introduce a realistic failure and use logs, metrics, status checks, and reachability tools to troubleshoot layer by layer.
4. Update the runbook and define the alarm or compliance check that would catch the issue next time.
How do you debug intermittent 5xx errors behind an ALB?Advanced
Answer
For intermittent 5xx behind ALB, I separate ELB-generated errors from target errors, inspect ALB metrics/access logs, correlate by target/path/AZ/deployment, then check health checks, timeouts, app logs, dependencies, and scaling events.
Technical explanation
Compare HTTPCode_ELB_5XX with HTTPCode_Target_5XX to determine whether the ALB or the backend generated errors.
Operations at scale should prefer managed access, automation, immutable infrastructure, repeatable runbooks, and auditability over manual host-by-host changes.
Troubleshooting should isolate layers: identity, network, host, application, dependency, deployment, and AWS service signals.
Patch, access, AMI, and incident workflows must be tested and measurable so they do not depend on tribal knowledge.
Hands-on example
1. Set up a sandbox EC2 fleet with SSM Agent, IAM instance role, CloudWatch Agent, hardened AMI baseline, and no unnecessary inbound access.
2. Perform the operation through automation: Session Manager, Run Command, Patch Manager, Image Builder, ASG instance refresh, or a runbook.
3. Introduce a realistic failure and use logs, metrics, status checks, and reachability tools to troubleshoot layer by layer.
4. Update the runbook and define the alarm or compliance check that would catch the issue next time.
How would you architect a multi-account AWS strategy with Organizations and landing zones?Advanced
Answer
A multi-account strategy uses Organizations and OUs to isolate workloads, environments, security, networking, logs, and shared services. It improves blast-radius control, governance, billing visibility, and account-level ownership.
Technical explanation
Separate accounts reduce blast radius and make ownership, billing, quotas, and security boundaries clearer.
Multi-account governance should combine preventive controls such as SCPs with detective controls such as Config, GuardDuty, Inspector, Security Hub, CloudTrail, and Access Analyzer.
Central security, logging, and networking accounts reduce blast radius and protect evidence from workload account compromise.
Every control needs an owner, exception process, alert route, and remediation workflow or it becomes shelfware.
Hands-on example
1. Create a multi-account sandbox or use separate dev/security/logging accounts to test the control pattern.
2. Enable the relevant organization-level service or guardrail, then generate a controlled finding or denied action.
3. Route findings to Security Hub, EventBridge, ticketing, SIEM, or an incident channel with ownership metadata.
4. Document the exception process, remediation automation, and evidence required for audit.
What is AWS WAF, and what kinds of attacks does it mitigate?Advanced
Answer
AWS WAF protects HTTP applications behind CloudFront, ALB, API Gateway, or AppSync from application-layer attacks like SQL injection, XSS, bad bots, abusive IPs, and request floods using managed and custom rules.
Technical explanation
Start WAF managed rules in count mode to tune false positives before blocking production traffic.
Multi-account governance should combine preventive controls such as SCPs with detective controls such as Config, GuardDuty, Inspector, Security Hub, CloudTrail, and Access Analyzer.
Central security, logging, and networking accounts reduce blast radius and protect evidence from workload account compromise.
Every control needs an owner, exception process, alert route, and remediation workflow or it becomes shelfware.
Hands-on example
1. Create a multi-account sandbox or use separate dev/security/logging accounts to test the control pattern.
2. Enable the relevant organization-level service or guardrail, then generate a controlled finding or denied action.
3. Route findings to Security Hub, EventBridge, ticketing, SIEM, or an incident channel with ownership metadata.
4. Document the exception process, remediation automation, and evidence required for audit.
What is GuardDuty, and what does it detect?Advanced
Answer
GuardDuty is managed threat detection using AWS telemetry such as CloudTrail, VPC flow, DNS, and supported workload signals. It detects suspicious behavior like credential misuse, reconnaissance, malicious IP communication, and workload threats.
Technical explanation
GuardDuty findings need routing, ownership, and response automation; detection alone is not incident response.
Multi-account governance should combine preventive controls such as SCPs with detective controls such as Config, GuardDuty, Inspector, Security Hub, CloudTrail, and Access Analyzer.
Central security, logging, and networking accounts reduce blast radius and protect evidence from workload account compromise.
Every control needs an owner, exception process, alert route, and remediation workflow or it becomes shelfware.
Hands-on example
1. Create a multi-account sandbox or use separate dev/security/logging accounts to test the control pattern.
2. Enable the relevant organization-level service or guardrail, then generate a controlled finding or denied action.
3. Route findings to Security Hub, EventBridge, ticketing, SIEM, or an incident channel with ownership metadata.
4. Document the exception process, remediation automation, and evidence required for audit.
What is the difference between GuardDuty, Inspector, and Security Hub?Advanced
Answer
GuardDuty detects suspicious activity, Inspector finds vulnerabilities in workloads and images, and Security Hub aggregates findings and posture checks. Together they support detection, vulnerability management, and centralized prioritization.
Technical explanation
Security Hub is a findings aggregator and posture tool; it does not replace GuardDuty or Inspector.
Multi-account governance should combine preventive controls such as SCPs with detective controls such as Config, GuardDuty, Inspector, Security Hub, CloudTrail, and Access Analyzer.
Central security, logging, and networking accounts reduce blast radius and protect evidence from workload account compromise.
Every control needs an owner, exception process, alert route, and remediation workflow or it becomes shelfware.
Hands-on example
1. Create a multi-account sandbox or use separate dev/security/logging accounts to test the control pattern.
2. Enable the relevant organization-level service or guardrail, then generate a controlled finding or denied action.
3. Route findings to Security Hub, EventBridge, ticketing, SIEM, or an incident channel with ownership metadata.
4. Document the exception process, remediation automation, and evidence required for audit.
How do you enforce tagging and governance across many AWS accounts?Advanced
Answer
For tagging and governance, I combine tag policies, SCPs, IAM conditions, Config rules, CI/CD checks, cost allocation tags, and remediation workflows. Required tags should be small, consistent, and tied to ownership and billing.
Technical explanation
Governance fails when tags are optional and unused; link them to billing, ownership, access, and automation.
Multi-account governance should combine preventive controls such as SCPs with detective controls such as Config, GuardDuty, Inspector, Security Hub, CloudTrail, and Access Analyzer.
Central security, logging, and networking accounts reduce blast radius and protect evidence from workload account compromise.
Every control needs an owner, exception process, alert route, and remediation workflow or it becomes shelfware.
Hands-on example
1. Create a multi-account sandbox or use separate dev/security/logging accounts to test the control pattern.
2. Enable the relevant organization-level service or guardrail, then generate a controlled finding or denied action.
3. Route findings to Security Hub, EventBridge, ticketing, SIEM, or an incident channel with ownership metadata.
4. Document the exception process, remediation automation, and evidence required for audit.
What is the difference between a service quota and a rate limit, and how do you handle throttling?Advanced
Answer
A service quota limits resource quantities or capacity; a rate limit throttles request speed. I manage quotas proactively and handle throttling with backoff, jitter, batching, caching, and client-side rate control.
Technical explanation
Backoff without jitter can create synchronized retry storms during throttling events.
Multi-account governance should combine preventive controls such as SCPs with detective controls such as Config, GuardDuty, Inspector, Security Hub, CloudTrail, and Access Analyzer.
Central security, logging, and networking accounts reduce blast radius and protect evidence from workload account compromise.
Every control needs an owner, exception process, alert route, and remediation workflow or it becomes shelfware.
Hands-on example
1. Create a multi-account sandbox or use separate dev/security/logging accounts to test the control pattern.
2. Enable the relevant organization-level service or guardrail, then generate a controlled finding or denied action.
3. Route findings to Security Hub, EventBridge, ticketing, SIEM, or an incident channel with ownership metadata.
4. Document the exception process, remediation automation, and evidence required for audit.
How would you set up centralised logging across multiple AWS accounts?Advanced
Answer
Centralized logging uses dedicated log archive/security accounts, organization trails, Config aggregation, VPC/ALB/WAF/app log delivery, encryption, retention, restricted access, and searchable storage through Athena, SIEM, or log analytics.
Technical explanation
High-volume logs need retention, filtering, partitioning, and lifecycle design to avoid runaway cost.
Multi-account governance should combine preventive controls such as SCPs with detective controls such as Config, GuardDuty, Inspector, Security Hub, CloudTrail, and Access Analyzer.
Central security, logging, and networking accounts reduce blast radius and protect evidence from workload account compromise.
Every control needs an owner, exception process, alert route, and remediation workflow or it becomes shelfware.
Hands-on example
1. Create a multi-account sandbox or use separate dev/security/logging accounts to test the control pattern.
2. Enable the relevant organization-level service or guardrail, then generate a controlled finding or denied action.
3. Route findings to Security Hub, EventBridge, ticketing, SIEM, or an incident channel with ownership metadata.
4. Document the exception process, remediation automation, and evidence required for audit.
What is cross-account access, and how do you implement it securely with roles?Advanced
Answer
Cross-account access is usually implemented with a role in the target account, a trust policy for the source principal, least-privilege permissions, and STS AssumeRole temporary credentials. Conditions, MFA, external IDs, and CloudTrail improve security.
Technical explanation
A cross-account role has two sides: trust policy for who can assume it and permission policy for what it can do.
Multi-account governance should combine preventive controls such as SCPs with detective controls such as Config, GuardDuty, Inspector, Security Hub, CloudTrail, and Access Analyzer.
Central security, logging, and networking accounts reduce blast radius and protect evidence from workload account compromise.
Every control needs an owner, exception process, alert route, and remediation workflow or it becomes shelfware.
Hands-on example
1. Create a multi-account sandbox or use separate dev/security/logging accounts to test the control pattern.
2. Enable the relevant organization-level service or guardrail, then generate a controlled finding or denied action.
3. Route findings to Security Hub, EventBridge, ticketing, SIEM, or an incident channel with ownership metadata.
4. Document the exception process, remediation automation, and evidence required for audit.
How do you rotate and manage access keys, and why prefer roles over long-lived keys?Advanced
Answer
I prefer roles and federation over long-lived access keys. If keys are unavoidable, I scope permissions, store securely, monitor last use, rotate with validation, disable before deletion, and alert on old or unused keys.
Technical explanation
OIDC federation for CI/CD is safer than storing static AWS keys in pipeline variables.
Multi-account governance should combine preventive controls such as SCPs with detective controls such as Config, GuardDuty, Inspector, Security Hub, CloudTrail, and Access Analyzer.
Central security, logging, and networking accounts reduce blast radius and protect evidence from workload account compromise.
Every control needs an owner, exception process, alert route, and remediation workflow or it becomes shelfware.
Hands-on example
1. Create a multi-account sandbox or use separate dev/security/logging accounts to test the control pattern.
2. Enable the relevant organization-level service or guardrail, then generate a controlled finding or denied action.
3. Route findings to Security Hub, EventBridge, ticketing, SIEM, or an incident channel with ownership metadata.
4. Document the exception process, remediation automation, and evidence required for audit.
Explain how you would design a secure, private EKS cluster with no public API endpoint.Advanced
Answer
For a private EKS cluster, I disable public API endpoint access, use private subnets, private connectivity for admins, VPC endpoints or controlled egress, scoped IAM, RBAC, private image pulls, and centralized logs.
Technical explanation
Private EKS clusters often fail operationally because required VPC endpoints for ECR, STS, logs, SSM, or Secrets Manager are missing.
Multi-account governance should combine preventive controls such as SCPs with detective controls such as Config, GuardDuty, Inspector, Security Hub, CloudTrail, and Access Analyzer.
Central security, logging, and networking accounts reduce blast radius and protect evidence from workload account compromise.
Every control needs an owner, exception process, alert route, and remediation workflow or it becomes shelfware.
Hands-on example
1. Create a multi-account sandbox or use separate dev/security/logging accounts to test the control pattern.
2. Enable the relevant organization-level service or guardrail, then generate a controlled finding or denied action.
3. Route findings to Security Hub, EventBridge, ticketing, SIEM, or an incident channel with ownership metadata.
4. Document the exception process, remediation automation, and evidence required for audit.
What recent AWS service or feature have you adopted, and what problem did it solve for you?Advanced
Answer
A strong recent-feature answer should name a real feature, the problem it solved, and the measured outcome. A credible example is EKS access entries replacing fragile aws-auth edits with API-managed, auditable cluster access.
Technical explanation
Personalize the feature example with a real migration, trade-off, metric, or operational improvement.
A recent-feature answer should describe the old pain point, why the feature was selected, how it was rolled out, and what operational risk changed.
For EKS access entries, the key technical value is moving cluster access management from fragile ConfigMap editing toward API-managed, auditable access definitions.
Still validate least privilege, RBAC behavior, migration from legacy mappings, break-glass access, and CloudTrail evidence.
Hands-on example
1. Inventory the current manual or legacy process and identify the failure modes, such as access drift, lockout risk, or slow approvals.
2. Implement the feature in a non-production account or cluster first using IaC and a clear rollback path.
3. Migrate one low-risk workload or team, validate behavior, and compare metrics such as manual changes, incident count, deployment time, or audit evidence.
4. Roll out gradually and document the adoption story in STAR format for interviews.
How would you design a landing zone for a new organisation adopting AWS at scale?Advanced
Answer
A landing zone for AWS at scale establishes accounts, OUs, identity, networking, logging, security guardrails, tagging, budgets, account vending, and baseline IaC so teams can move fast within controlled boundaries.
Technical explanation
A landing zone should make the secure path the easy path through automated account vending and standard baselines.
A mature AWS foundation standardizes identity, accounts, networking, logging, security, tags, budgets, and deployment guardrails before teams scale usage.
The platform should provide paved roads: account vending, baseline modules, CI/CD patterns, observability, and clear ownership.
Guardrails should enable safe self-service rather than forcing every team through manual platform tickets.
Hands-on example
1. Create OUs, baseline accounts, IAM Identity Center permission sets, central logging, security services, network baselines, budgets, and required tags.
2. Define preventive guardrails with SCPs and detective guardrails with Config, GuardDuty, Security Hub, CloudTrail, and Access Analyzer.
3. Build account vending so new accounts receive standard VPC, logging, KMS, budget, tags, and CI/CD bootstrap automatically.
4. Test with a new workload account and verify developers can deploy safely without bypassing governance.