Interview questions › Databases & Caching
Databases & Caching interview questions & answers
100 Databases & Caching interview questions, each answered three ways: a concise spoken answer, a technical explanation, and a hands-on example.
Tip: paste the job description + your resume into our free resume checker to see which of these skills the role actually requires.
All questions
- What is Amazon RDS, and what does it manage for you versus self-managed databases?
- What database engines does RDS support?
- What is the difference between RDS and Aurora?
- What is Multi-AZ in RDS, and how does automatic failover work?
- How long does an RDS Multi-AZ failover typically take, and what triggers it?
- What is the difference between Multi-AZ and a read replica?
- When would you use a read replica, and can it become a standalone database?
- Can a read replica be in a different region, and why would you do that?
- What is the difference between automated backups and manual snapshots in RDS?
- What is the backup retention period, and what is point-in-time recovery?
- How does point-in-time recovery work under the hood (snapshots plus transaction logs)?
- What is an RDS parameter group, and give an example of a parameter you would tune?
- What is an RDS option group?
- How do you perform a major version upgrade of RDS with minimal downtime?
- Walk me through a safe RDS PostgreSQL/MySQL upgrade with a rollback plan (as you did at Intuit).
- What is a blue-green deployment for RDS, and how does it reduce upgrade risk?
- How do you test that an application is compatible before a database major upgrade?
- How do you encrypt an RDS instance at rest, and can you encrypt an existing unencrypted one?
- How do you enforce encryption in transit (SSL/TLS) to RDS?
- How do you rotate database credentials, and how does Secrets Manager help?
- How do you monitor RDS performance (CloudWatch, Performance Insights, Enhanced Monitoring)?
- What is Performance Insights, and what does it help you find?
- What are common causes of high RDS CPU, and how do you investigate?
- How do you scale an RDS database vertically and what is the downtime impact?
- How do you scale reads versus writes on a relational database?
- What is connection pooling, and why does it matter for RDS (e.g., RDS Proxy)?
- What is RDS Proxy, and what problem does it solve?
- What is the difference between PostgreSQL and MySQL from an operator's view?
- What is MVCC, and how does PostgreSQL implement it?
- What is VACUUM in PostgreSQL, and why is autovacuum important?
- What is transaction wraparound in PostgreSQL, and why is it dangerous?
- What is the difference between a clustered and a non-clustered index?
- What is an index, and what are the trade-offs of adding one?
- How do you decide which columns to index?
- What is a composite index, and does column order matter?
- What is a covering index?
- How do you find and fix a slow query (EXPLAIN / EXPLAIN ANALYZE)?
- What does an execution plan tell you, and what is a sequential scan versus an index scan?
- What are the SQL isolation levels, and what anomalies does each prevent?
- What is the difference between a dirty read, non-repeatable read, and phantom read?
- What is a deadlock, and how do databases detect and resolve them?
- What is the difference between OLTP and OLAP workloads?
- What is database normalisation, and when would you denormalise?
- What is the difference between a primary key, a unique key, and a foreign key?
- What is the difference between DELETE, TRUNCATE, and DROP?
- What is the difference between an INNER JOIN and a LEFT JOIN?
- What is a transaction, and what do the ACID properties mean?
- What is WAL (write-ahead logging), and how does it support durability and replication?
- What is synchronous versus asynchronous replication, and the trade-offs?
- How do you migrate a database with near-zero downtime (e.g., using DMS or logical replication)?
- What is AWS DMS, and when would you use it?
- What is schema migration, and how do you run migrations safely in production?
- How do you make a schema change backward compatible during a rolling deploy?
- Why are some ALTER TABLE operations dangerous on a large table, and how do you mitigate?
- What is connection exhaustion, and how do you prevent it?
- What is ElastiCache, and what does it manage?
- What is the difference between ElastiCache Redis and Memcached?
- What is Redis, and what core data types does it support?
- What is Valkey, and why did the community fork it from Redis?
- Walk me through your ElastiCache Redis to Valkey migration: motivation, plan, and validation.
- How do you migrate from Redis to Valkey with minimal downtime?
- How is Valkey wire-compatible with Redis, and what risks remain in a migration?
- What are common caching patterns (cache-aside, read-through, write-through, write-behind)?
- What is cache-aside (lazy loading), and what are its failure modes?
- What is a cache stampede / thundering herd, and how do you prevent it?
- What is cache invalidation, and why is it hard?
- What are Redis eviction policies (e.g., LRU, LFU, volatile vs allkeys)?
- What happens when Redis reaches maxmemory?
- What is a TTL, and how do you choose one for cached data?
- What is the difference between Redis persistence options RDB and AOF?
- What are the durability trade-offs between RDB snapshots and AOF?
- What is Redis replication, and how does a replica stay in sync?
- What is Redis Sentinel, and what does it provide?
- What is Redis Cluster, and how does sharding with hash slots work?
- What is the difference between Redis Sentinel and Redis Cluster?
- How does ElastiCache provide high availability and automatic failover?
- What is a cache hit ratio, and how do you improve it?
- How do you monitor Redis/ElastiCache (evictions, memory, latency, hit ratio)?
- What is the impact of a large key or hot key in Redis, and how do you handle it?
- Why are some Redis commands (KEYS, FLUSHALL) dangerous in production?
- What is pipelining in Redis, and how does it improve throughput?
- What is the difference between Redis as a cache and Redis as a primary datastore?
- How would you use Redis for rate limiting?
- How would you use Redis for distributed locking, and what are the caveats?
- How do you secure Redis (AUTH, TLS, network isolation, no public exposure)?
- What is the difference between strong and eventual consistency, and where does caching sit?
- How do you keep a cache and database consistent?
- What is read-your-writes consistency, and how can caching break it?
- How would you design caching for a read-heavy microservice?
- How do you decide what to cache and what not to cache?
- How do you handle cache warming after a restart or failover?
- What is the blast radius of a cache outage, and how do you make the app degrade gracefully?
- How do you size an ElastiCache cluster (memory, nodes, shards)?
- What backup and restore options exist for ElastiCache Redis?
- How do you test a Redis or database failover without impacting users?
- How do you handle connection storms against the cache after a deploy?
- What is the difference between a write-through and write-behind cache regarding data loss?
- How would you debug rising latency on a service that depends on Redis and RDS together?
- What recent database or caching change have you made, and what did it improve?
- How do you plan capacity and storage autoscaling for a growing RDS database?
What is Amazon RDS, and what does it manage for you versus self-managed databases?Basic
Answer
Amazon RDS is AWS managed relational database service. AWS handles provisioning, backups, patching, storage operations, monitoring integration, and HA primitives, while the customer still owns schema design, query tuning, access control, data correctness, capacity choices, and application connection behavior.
Technical explanation
RDS removes much of the undifferentiated DBA work but it does not remove database engineering accountability.
The shared responsibility line matters: AWS manages the platform; the team manages workload design, users, queries, parameters, and safe changes.
A strong SRE answer mentions both managed-service benefits and remaining operational risks such as bad SQL, connection storms, and poor indexing.
Hands-on example
Create a small encrypted PostgreSQL instance, then validate what remains your responsibility:
$ aws rds create-db-instance --db-instance-identifier app-dev-pg --engine postgres --db-instance-class db.t4g.micro --allocated-storage 20 --backup-retention-period 7 --storage-encrypted
After creation: create least-privilege app users, restrict the security group to app subnets, enable slow query logging, test restore, and baseline CPU, connections, and query latency.
What database engines does RDS support?Basic
Answer
RDS supports multiple relational engines, including PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Db2. Aurora is part of the RDS family but is a separate AWS-built engine compatible with MySQL and PostgreSQL. I always verify version and feature availability by Region before finalizing design.
Technical explanation
Engine choice affects HA behavior, licensing, extensions, replication, upgrade method, parameters, and operational tooling.
Open-source engines are often simpler from a licensing perspective, while Oracle, SQL Server, and Db2 need edition and option review.
Aurora should be evaluated separately because its storage, replica, failover, and endpoint behavior differs from standard RDS instances.
Hands-on example
Discovery example:
$ aws rds describe-db-engine-versions --engine postgres --query "DBEngineVersions[].EngineVersion" --output text
Decision checklist: required extensions, Region/version support, HA target, licensing, read scaling, backup/restore needs, and team experience.
What is the difference between RDS and Aurora?Basic
Answer
RDS is the managed service umbrella for standard database engines; Aurora is AWS cloud-native relational engine compatible with MySQL and PostgreSQL. Aurora uses a distributed storage layer and cluster endpoints, while standard RDS is closer to a managed DB instance with optional Multi-AZ and replicas.
Technical explanation
Aurora decouples compute from a replicated storage layer more than standard RDS.
Standard RDS can be simpler for lift-and-shift workloads; Aurora can help with read scale, failover behavior, and cloud-native features.
Compatibility testing is mandatory because Aurora-compatible is not the same as identical for every extension, parameter, or query plan.
Hands-on example
Example architecture: use the Aurora writer endpoint for writes and the reader endpoint for read-only traffic:
APP_DB_WRITE_HOST=orders.cluster-abc.us-east-1.rds.amazonaws.com
APP_DB_READ_HOST=orders.cluster-ro-abc.us-east-1.rds.amazonaws.com
Run integration and explain-plan tests before choosing Aurora over standard RDS.
What is Multi-AZ in RDS, and how does automatic failover work?Basic
Answer
RDS Multi-AZ is a high-availability deployment where RDS maintains a standby or alternate writer in another Availability Zone and automatically fails over when the primary cannot serve safely. Traditional Multi-AZ DB instance failover is commonly around 60 to 120 seconds, but application recovery can be longer.
Technical explanation
Multi-AZ is for availability, not read scaling in the traditional DB instance model.
Failover can be triggered by infrastructure failure, AZ disruption, maintenance, storage/network issues, or manual reboot with failover.
Applications must reconnect, retry idempotently, respect DNS, and handle in-flight transaction failure.
Hands-on example
Staging failover test:
$ while true; do date; psql "$DB_URL" -c "select now();" && echo ok || echo failed; sleep 1; done
$ aws rds reboot-db-instance --db-instance-identifier orders-stage --force-failover
Measure first failure, first success after recovery, API 5xx, p95 latency, and whether any pod restart was required.
How long does an RDS Multi-AZ failover typically take, and what triggers it?Basic
Answer
RDS Multi-AZ is a high-availability deployment where RDS maintains a standby or alternate writer in another Availability Zone and automatically fails over when the primary cannot serve safely. Traditional Multi-AZ DB instance failover is commonly around 60 to 120 seconds, but application recovery can be longer.
Technical explanation
Multi-AZ is for availability, not read scaling in the traditional DB instance model.
Failover can be triggered by infrastructure failure, AZ disruption, maintenance, storage/network issues, or manual reboot with failover.
Applications must reconnect, retry idempotently, respect DNS, and handle in-flight transaction failure.
Hands-on example
Staging failover test:
$ while true; do date; psql "$DB_URL" -c "select now();" && echo ok || echo failed; sleep 1; done
$ aws rds reboot-db-instance --db-instance-identifier orders-stage --force-failover
Measure first failure, first success after recovery, API 5xx, p95 latency, and whether any pod restart was required.
What is the difference between Multi-AZ and a read replica?Basic
Answer
Multi-AZ protects the writer for HA and failover; read replicas are readable copies used for read scaling, reporting, migration, and DR. Read replicas are generally asynchronous, can lag, can often be cross-Region, and can be promoted to standalone databases when needed.
Technical explanation
Use Multi-AZ for availability and read replicas for read workload isolation or migration targets.
Replica lag means read-your-writes is not guaranteed from replicas.
Cross-Region replicas are useful for DR and locality, but require RPO/RTO, secrets, DNS, KMS, and failover runbooks.
Hands-on example
Read/write routing example:
POST /orders -> primary endpoint
GET /orders/{id} immediately after create -> primary endpoint
GET /orders/report -> read replica endpoint
Promotion command for migration or DR:
$ aws rds promote-read-replica --db-instance-identifier orders-replica-dr
Before promotion, check replica lag and stop writes if a clean cutover is required.
When would you use a read replica, and can it become a standalone database?Basic
Answer
Multi-AZ protects the writer for HA and failover; read replicas are readable copies used for read scaling, reporting, migration, and DR. Read replicas are generally asynchronous, can lag, can often be cross-Region, and can be promoted to standalone databases when needed.
Technical explanation
Use Multi-AZ for availability and read replicas for read workload isolation or migration targets.
Replica lag means read-your-writes is not guaranteed from replicas.
Cross-Region replicas are useful for DR and locality, but require RPO/RTO, secrets, DNS, KMS, and failover runbooks.
Hands-on example
Read/write routing example:
POST /orders -> primary endpoint
GET /orders/{id} immediately after create -> primary endpoint
GET /orders/report -> read replica endpoint
Promotion command for migration or DR:
$ aws rds promote-read-replica --db-instance-identifier orders-replica-dr
Before promotion, check replica lag and stop writes if a clean cutover is required.
Can a read replica be in a different region, and why would you do that?Basic
Answer
Multi-AZ protects the writer for HA and failover; read replicas are readable copies used for read scaling, reporting, migration, and DR. Read replicas are generally asynchronous, can lag, can often be cross-Region, and can be promoted to standalone databases when needed.
Technical explanation
Use Multi-AZ for availability and read replicas for read workload isolation or migration targets.
Replica lag means read-your-writes is not guaranteed from replicas.
Cross-Region replicas are useful for DR and locality, but require RPO/RTO, secrets, DNS, KMS, and failover runbooks.
Hands-on example
Read/write routing example:
POST /orders -> primary endpoint
GET /orders/{id} immediately after create -> primary endpoint
GET /orders/report -> read replica endpoint
Promotion command for migration or DR:
$ aws rds promote-read-replica --db-instance-identifier orders-replica-dr
Before promotion, check replica lag and stop writes if a clean cutover is required.
What is the difference between automated backups and manual snapshots in RDS?Basic
Answer
Automated backups are retained for a configured window and support point-in-time recovery. Manual snapshots are retained until deleted and are useful before risky changes or for long-term copies. PITR restores from a base backup plus transaction logs to create a new database at a chosen timestamp.
Technical explanation
PITR does not rewind the existing DB in place; it restores a new instance that must be validated and cut over or used for data repair.
Retention should be based on business recovery needs and detection time for corruption, not just default cost settings.
Restore testing is part of the backup strategy because backup success does not prove recovery success.
Hands-on example
PITR example:
$ aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier orders-prod --target-db-instance-identifier orders-pitr-20260630 --restore-time 2026-06-30T10:14:30Z
Then compare row counts, extract repaired rows if needed, run smoke tests, and decide between partial repair and full cutover.
What is the backup retention period, and what is point-in-time recovery?Basic
Answer
Automated backups are retained for a configured window and support point-in-time recovery. Manual snapshots are retained until deleted and are useful before risky changes or for long-term copies. PITR restores from a base backup plus transaction logs to create a new database at a chosen timestamp.
Technical explanation
PITR does not rewind the existing DB in place; it restores a new instance that must be validated and cut over or used for data repair.
Retention should be based on business recovery needs and detection time for corruption, not just default cost settings.
Restore testing is part of the backup strategy because backup success does not prove recovery success.
Hands-on example
PITR example:
$ aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier orders-prod --target-db-instance-identifier orders-pitr-20260630 --restore-time 2026-06-30T10:14:30Z
Then compare row counts, extract repaired rows if needed, run smoke tests, and decide between partial repair and full cutover.
How does point-in-time recovery work under the hood (snapshots plus transaction logs)?Basic
Answer
Automated backups are retained for a configured window and support point-in-time recovery. Manual snapshots are retained until deleted and are useful before risky changes or for long-term copies. PITR restores from a base backup plus transaction logs to create a new database at a chosen timestamp.
Technical explanation
PITR does not rewind the existing DB in place; it restores a new instance that must be validated and cut over or used for data repair.
Retention should be based on business recovery needs and detection time for corruption, not just default cost settings.
Restore testing is part of the backup strategy because backup success does not prove recovery success.
Hands-on example
PITR example:
$ aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier orders-prod --target-db-instance-identifier orders-pitr-20260630 --restore-time 2026-06-30T10:14:30Z
Then compare row counts, extract repaired rows if needed, run smoke tests, and decide between partial repair and full cutover.
What is an RDS parameter group, and give an example of a parameter you would tune?Basic
Answer
An RDS parameter group is a managed set of database engine configuration parameters. It controls behavior such as logging, memory-related settings, connection limits, TLS enforcement flags, query planner settings, and extension preload settings.
Technical explanation
Some parameters are dynamic; others require reboot, so the apply type must be checked before production change.
Parameter groups should be managed through IaC and promoted through environments.
Changing max_connections, logging, or planner parameters can have broad performance impact and should be measured.
Hands-on example
PostgreSQL slow query visibility example:
Set log_min_duration_statement = 500
Set shared_preload_libraries = pg_stat_statements
Then run:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
SELECT query, calls, total_exec_time, mean_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;
What is an RDS option group?Basic
Answer
An RDS option group enables engine-specific optional capabilities that are not simple parameter settings, especially for Oracle, SQL Server, and some MySQL/MariaDB features. It can attach features such as native backup/restore, audit plugins, or other engine options depending on support.
Technical explanation
Parameter groups tune engine settings; option groups attach optional engine features or integrations.
Some options require IAM roles, S3 access, ports, license review, or downtime.
Treat option group changes as production changes with compatibility and rollback review.
Hands-on example
Evaluation example:
$ aws rds describe-option-group-options --engine-name sqlserver-se
Before enabling an option, confirm engine edition/version support, required IAM role, port exposure, downtime/reboot requirement, and whether the option can be removed later.
How do you perform a major version upgrade of RDS with minimal downtime?Basic
Answer
For a major RDS upgrade, I avoid untested in-place changes on critical systems. I prefer blue/green deployment, logical replication, DMS, or a replica-style approach so most upgrade work and validation happens away from production before a controlled cutover.
Technical explanation
Major upgrades can change query plans, parameters, extensions, authentication behavior, reserved words, and client compatibility.
Blue/green reduces risk by creating a synchronized green environment where the new version can be tested before switchover.
Rollback must be designed before cutover; after writes hit the new version, rollback may require restore or reverse migration.
Hands-on example
Upgrade runbook:
1. Restore prod snapshot to staging and upgrade it.
2. Run app regression, migrations, batch jobs, and top SQL comparisons.
3. Create blue/green or logical target.
4. Freeze schema changes.
5. Cut over during approved window.
6. Monitor p95 latency, DB load, locks, error rate, and slow queries.
7. Keep old environment read-only until rollback window expires.
Walk me through a safe RDS PostgreSQL/MySQL upgrade with a rollback plan (as you did at Intuit).Basic
Answer
For a major RDS upgrade, I avoid untested in-place changes on critical systems. I prefer blue/green deployment, logical replication, DMS, or a replica-style approach so most upgrade work and validation happens away from production before a controlled cutover.
Technical explanation
Major upgrades can change query plans, parameters, extensions, authentication behavior, reserved words, and client compatibility.
Blue/green reduces risk by creating a synchronized green environment where the new version can be tested before switchover.
Rollback must be designed before cutover; after writes hit the new version, rollback may require restore or reverse migration.
Hands-on example
Upgrade runbook:
1. Restore prod snapshot to staging and upgrade it.
2. Run app regression, migrations, batch jobs, and top SQL comparisons.
3. Create blue/green or logical target.
4. Freeze schema changes.
5. Cut over during approved window.
6. Monitor p95 latency, DB load, locks, error rate, and slow queries.
7. Keep old environment read-only until rollback window expires.
What is a blue-green deployment for RDS, and how does it reduce upgrade risk?Basic
Answer
For a major RDS upgrade, I avoid untested in-place changes on critical systems. I prefer blue/green deployment, logical replication, DMS, or a replica-style approach so most upgrade work and validation happens away from production before a controlled cutover.
Technical explanation
Major upgrades can change query plans, parameters, extensions, authentication behavior, reserved words, and client compatibility.
Blue/green reduces risk by creating a synchronized green environment where the new version can be tested before switchover.
Rollback must be designed before cutover; after writes hit the new version, rollback may require restore or reverse migration.
Hands-on example
Upgrade runbook:
1. Restore prod snapshot to staging and upgrade it.
2. Run app regression, migrations, batch jobs, and top SQL comparisons.
3. Create blue/green or logical target.
4. Freeze schema changes.
5. Cut over during approved window.
6. Monitor p95 latency, DB load, locks, error rate, and slow queries.
7. Keep old environment read-only until rollback window expires.
How do you test that an application is compatible before a database major upgrade?Basic
Answer
For a major RDS upgrade, I avoid untested in-place changes on critical systems. I prefer blue/green deployment, logical replication, DMS, or a replica-style approach so most upgrade work and validation happens away from production before a controlled cutover.
Technical explanation
Major upgrades can change query plans, parameters, extensions, authentication behavior, reserved words, and client compatibility.
Blue/green reduces risk by creating a synchronized green environment where the new version can be tested before switchover.
Rollback must be designed before cutover; after writes hit the new version, rollback may require restore or reverse migration.
Hands-on example
Upgrade runbook:
1. Restore prod snapshot to staging and upgrade it.
2. Run app regression, migrations, batch jobs, and top SQL comparisons.
3. Create blue/green or logical target.
4. Freeze schema changes.
5. Cut over during approved window.
6. Monitor p95 latency, DB load, locks, error rate, and slow queries.
7. Keep old environment read-only until rollback window expires.
How do you encrypt an RDS instance at rest, and can you encrypt an existing unencrypted one?Basic
Answer
New RDS instances should be encrypted at creation with KMS. Existing unencrypted RDS storage cannot simply be encrypted in place; the normal method is snapshot, copy snapshot with encryption enabled, restore a new encrypted instance, validate, and cut over.
Technical explanation
Encryption covers storage and related backups/snapshots according to RDS and KMS behavior.
KMS key policy and cross-account or cross-Region permissions must be planned.
Because restore creates a new endpoint, cutover and rollback must be part of the migration plan.
Hands-on example
Encryption conversion example:
$ aws rds create-db-snapshot --db-instance-identifier legacy-orders --db-snapshot-identifier legacy-orders-pre-encryption
$ aws rds copy-db-snapshot --source-db-snapshot-identifier legacy-orders-pre-encryption --target-db-snapshot-identifier legacy-orders-encrypted --kms-key-id alias/rds-prod
$ aws rds restore-db-instance-from-db-snapshot --db-instance-identifier orders-encrypted --db-snapshot-identifier legacy-orders-encrypted
How do you enforce encryption in transit (SSL/TLS) to RDS?Basic
Answer
Encryption in transit is enforced by requiring TLS from clients and configuring the engine to reject non-TLS connections where supported. For PostgreSQL on RDS, rds.force_ssl plus client sslmode settings are common. For MySQL, require SSL for users and trust the RDS CA bundle.
Technical explanation
At-rest encryption and in-transit encryption solve different problems and both are expected for sensitive workloads.
Clients should verify the server certificate, not just turn on encryption without validation.
CA rotation must be planned because old clients may fail if trust stores are stale.
Hands-on example
PostgreSQL TLS validation:
$ psql "host=orders.xxxxxx.us-east-1.rds.amazonaws.com dbname=orders user=app sslmode=verify-full sslrootcert=/etc/ssl/rds-ca.pem"
SELECT ssl, version, cipher FROM pg_stat_ssl WHERE pid = pg_backend_pid();
Expected: ssl = true with an approved TLS version/cipher.
How do you rotate database credentials, and how does Secrets Manager help?Basic
Answer
I rotate DB credentials through AWS Secrets Manager or an equivalent secrets system. The rotation process updates the database user password, updates secret versions, and ensures applications pick up the new credential without hardcoded values or manual redeploy risk.
Technical explanation
Use least-privilege application users instead of the master user.
Rotation must account for connection pools and long-lived pods using old credentials.
Alternating-user rotation can provide lower downtime because one credential remains valid while the other rotates.
Hands-on example
Rotation checklist:
1. Store prod/orders/db in Secrets Manager.
2. Grant only the app role secretsmanager:GetSecretValue.
3. Enable tested rotation Lambda.
4. Restart or reload app pools safely.
5. Monitor failed logins and connection errors.
CLI:
$ aws secretsmanager describe-secret --secret-id prod/orders/db
How do you monitor RDS performance (CloudWatch, Performance Insights, Enhanced Monitoring)?Basic
Answer
RDS performance monitoring should combine CloudWatch, Performance Insights, Enhanced Monitoring, database-native views, logs, and application SLOs. CloudWatch shows resource symptoms; Performance Insights shows database load, waits, and top SQL; the app tells whether users are impacted.
Technical explanation
Key metrics include CPU, connections, free memory, free storage, IOPS, read/write latency, disk queue depth, replica lag, and DB load.
High CPU is investigated through top SQL, waits, connection count, recent deployments, locks, and query plans.
The best incident response correlates application latency with database load and the exact SQL or wait event causing it.
Hands-on example
Incident triage SQL:
SELECT state, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;
Then open Performance Insights, identify top SQL by DB load, run EXPLAIN (ANALYZE, BUFFERS) in staging, and mitigate with query change, index, pool cap, feature flag, or scale-up.
What is Performance Insights, and what does it help you find?Basic
Answer
RDS performance monitoring should combine CloudWatch, Performance Insights, Enhanced Monitoring, database-native views, logs, and application SLOs. CloudWatch shows resource symptoms; Performance Insights shows database load, waits, and top SQL; the app tells whether users are impacted.
Technical explanation
Key metrics include CPU, connections, free memory, free storage, IOPS, read/write latency, disk queue depth, replica lag, and DB load.
High CPU is investigated through top SQL, waits, connection count, recent deployments, locks, and query plans.
The best incident response correlates application latency with database load and the exact SQL or wait event causing it.
Hands-on example
Incident triage SQL:
SELECT state, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;
Then open Performance Insights, identify top SQL by DB load, run EXPLAIN (ANALYZE, BUFFERS) in staging, and mitigate with query change, index, pool cap, feature flag, or scale-up.
What are common causes of high RDS CPU, and how do you investigate?Basic
Answer
RDS performance monitoring should combine CloudWatch, Performance Insights, Enhanced Monitoring, database-native views, logs, and application SLOs. CloudWatch shows resource symptoms; Performance Insights shows database load, waits, and top SQL; the app tells whether users are impacted.
Technical explanation
Key metrics include CPU, connections, free memory, free storage, IOPS, read/write latency, disk queue depth, replica lag, and DB load.
High CPU is investigated through top SQL, waits, connection count, recent deployments, locks, and query plans.
The best incident response correlates application latency with database load and the exact SQL or wait event causing it.
Hands-on example
Incident triage SQL:
SELECT state, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;
Then open Performance Insights, identify top SQL by DB load, run EXPLAIN (ANALYZE, BUFFERS) in staging, and mitigate with query change, index, pool cap, feature flag, or scale-up.
How do you scale an RDS database vertically and what is the downtime impact?Basic
Answer
Vertical scaling changes the RDS DB instance class to add CPU, memory, or network capacity. It can require downtime or a failover depending on engine, class change, Multi-AZ setup, and whether the change is applied immediately or in a maintenance window.
Technical explanation
Scale-up can buy time but does not fix bad SQL, locks, missing indexes, or connection storms.
Multi-AZ can reduce impact but still requires app reconnect and retry behavior.
Scale-down should be tested because less memory can reduce cache hit ratio and change query behavior.
Hands-on example
Scale-up example:
$ aws rds modify-db-instance --db-instance-identifier orders-prod --db-instance-class db.r7g.2xlarge --apply-immediately
Watch RDS events, p95 latency, DB load, CPU, free memory, slow queries, and application connection errors during the change.
How do you scale reads versus writes on a relational database?Basic
Answer
Reads can be scaled with replicas, caching, materialized views, query optimization, denormalized read models, and analytics offload. Writes are harder because consistency and uniqueness usually concentrate around one writer, so write scaling may require partitioning, sharding, queues, batching, or architecture changes.
Technical explanation
Read scaling is safe only when stale reads are acceptable or read-after-write paths use the primary.
Write scaling is constrained by transactions, constraints, referential integrity, and cross-shard behavior.
Most teams should tune queries and cache reads before redesigning writes.
Hands-on example
Example split:
POST /orders -> primary RDS writer
GET /products -> Redis/Valkey cache
GET /reports -> read replica or warehouse
Heavy post-processing -> queue and worker
This keeps OLTP writes short and moves expensive reads away from the writer.
What is connection pooling, and why does it matter for RDS (e.g., RDS Proxy)?Basic
Answer
Connection pooling reuses a bounded set of database connections and prevents unbounded clients from exhausting the database. RDS Proxy is a managed pooling proxy that helps with bursty traffic, Lambda concurrency, failover recovery, and credential rotation.
Technical explanation
Each DB connection consumes memory and CPU; in Kubernetes total connections equal pod count times pool size.
RDS Proxy multiplexes when session state allows it, but pinned sessions reduce benefits.
Prevention includes pool caps, backoff with jitter, timeouts, RDS Proxy/PgBouncer, and avoiding unbounded concurrency.
Hands-on example
Pool sizing example:
DB safe app budget = 850 connections
App pods = 50
Max pool per pod = floor(850/50) = 17
Choose 10 to 15 and load test.
PostgreSQL check:
SELECT usename, application_name, state, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;
What is RDS Proxy, and what problem does it solve?Basic
Answer
Connection pooling reuses a bounded set of database connections and prevents unbounded clients from exhausting the database. RDS Proxy is a managed pooling proxy that helps with bursty traffic, Lambda concurrency, failover recovery, and credential rotation.
Technical explanation
Each DB connection consumes memory and CPU; in Kubernetes total connections equal pod count times pool size.
RDS Proxy multiplexes when session state allows it, but pinned sessions reduce benefits.
Prevention includes pool caps, backoff with jitter, timeouts, RDS Proxy/PgBouncer, and avoiding unbounded concurrency.
Hands-on example
Pool sizing example:
DB safe app budget = 850 connections
App pods = 50
Max pool per pod = floor(850/50) = 17
Choose 10 to 15 and load test.
PostgreSQL check:
SELECT usename, application_name, state, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;
What is the difference between PostgreSQL and MySQL from an operator's view?Basic
Answer
PostgreSQL and MySQL differ operationally in storage behavior, MVCC implementation, replication, indexing, planner behavior, extensions, and maintenance. PostgreSQL MVCC keeps row versions for concurrent snapshots, and VACUUM/autovacuum is essential to clean dead tuples and prevent wraparound.
Technical explanation
PostgreSQL readers do not usually block writers, but old row versions create dead tuples and bloat if vacuum falls behind.
Long-running transactions can prevent cleanup and increase wraparound risk.
Operational checks should include dead tuples, transaction age, autovacuum timing, slow queries, and extension compatibility.
Hands-on example
PostgreSQL health checks:
SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;
SELECT datname, age(datfrozenxid) AS xid_age FROM pg_database ORDER BY xid_age DESC;
If values are high, investigate long transactions, tune autovacuum per hot table, and schedule safe VACUUM work.
What is MVCC, and how does PostgreSQL implement it?Basic
Answer
PostgreSQL and MySQL differ operationally in storage behavior, MVCC implementation, replication, indexing, planner behavior, extensions, and maintenance. PostgreSQL MVCC keeps row versions for concurrent snapshots, and VACUUM/autovacuum is essential to clean dead tuples and prevent wraparound.
Technical explanation
PostgreSQL readers do not usually block writers, but old row versions create dead tuples and bloat if vacuum falls behind.
Long-running transactions can prevent cleanup and increase wraparound risk.
Operational checks should include dead tuples, transaction age, autovacuum timing, slow queries, and extension compatibility.
Hands-on example
PostgreSQL health checks:
SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;
SELECT datname, age(datfrozenxid) AS xid_age FROM pg_database ORDER BY xid_age DESC;
If values are high, investigate long transactions, tune autovacuum per hot table, and schedule safe VACUUM work.
What is VACUUM in PostgreSQL, and why is autovacuum important?Basic
Answer
PostgreSQL and MySQL differ operationally in storage behavior, MVCC implementation, replication, indexing, planner behavior, extensions, and maintenance. PostgreSQL MVCC keeps row versions for concurrent snapshots, and VACUUM/autovacuum is essential to clean dead tuples and prevent wraparound.
Technical explanation
PostgreSQL readers do not usually block writers, but old row versions create dead tuples and bloat if vacuum falls behind.
Long-running transactions can prevent cleanup and increase wraparound risk.
Operational checks should include dead tuples, transaction age, autovacuum timing, slow queries, and extension compatibility.
Hands-on example
PostgreSQL health checks:
SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;
SELECT datname, age(datfrozenxid) AS xid_age FROM pg_database ORDER BY xid_age DESC;
If values are high, investigate long transactions, tune autovacuum per hot table, and schedule safe VACUUM work.
What is transaction wraparound in PostgreSQL, and why is it dangerous?Basic
Answer
PostgreSQL and MySQL differ operationally in storage behavior, MVCC implementation, replication, indexing, planner behavior, extensions, and maintenance. PostgreSQL MVCC keeps row versions for concurrent snapshots, and VACUUM/autovacuum is essential to clean dead tuples and prevent wraparound.
Technical explanation
PostgreSQL readers do not usually block writers, but old row versions create dead tuples and bloat if vacuum falls behind.
Long-running transactions can prevent cleanup and increase wraparound risk.
Operational checks should include dead tuples, transaction age, autovacuum timing, slow queries, and extension compatibility.
Hands-on example
PostgreSQL health checks:
SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;
SELECT datname, age(datfrozenxid) AS xid_age FROM pg_database ORDER BY xid_age DESC;
If values are high, investigate long transactions, tune autovacuum per hot table, and schedule safe VACUUM work.
What is the difference between a clustered and a non-clustered index?Basic
Answer
A clustered index controls or reflects the physical row order in engines that support it; a non-clustered index is a separate lookup structure that points to the table row. In MySQL InnoDB the primary key is clustered, while PostgreSQL normally uses separate indexes and only physically reorders with CLUSTER when explicitly run.
Technical explanation
Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.
EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.
Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.
Hands-on example
Index tuning example:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;
CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);
Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.
What is an index, and what are the trade-offs of adding one?Basic
Answer
An index is a structure that lets the database find rows without scanning the full table. It improves selected reads, joins, ordering, and uniqueness checks, but it costs storage and slows writes because the index must be maintained.
Technical explanation
Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.
EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.
Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.
Hands-on example
Index tuning example:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;
CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);
Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.
How do you decide which columns to index?Intermediate
Answer
I decide indexes from real workload evidence: WHERE predicates, JOIN keys, ORDER BY, GROUP BY, uniqueness rules, and top expensive queries. I prioritize selective columns and query-specific composite indexes, then verify with EXPLAIN ANALYZE.
Technical explanation
Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.
EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.
Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.
Hands-on example
Index tuning example:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;
CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);
Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.
What is a composite index, and does column order matter?Intermediate
Answer
A composite index covers multiple columns, and column order matters because the optimizer can use the leftmost prefix most effectively. Equality filters usually come before range or sort columns, but the final decision is based on the exact query and data distribution.
Technical explanation
Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.
EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.
Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.
Hands-on example
Index tuning example:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;
CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);
Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.
What is a covering index?Intermediate
Answer
A covering index includes all columns needed by a query so the database can answer from the index without reading the base table, when visibility rules allow it. In PostgreSQL, INCLUDE columns are commonly used for this pattern.
Technical explanation
Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.
EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.
Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.
Hands-on example
Index tuning example:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;
CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);
Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.
How do you find and fix a slow query (EXPLAIN / EXPLAIN ANALYZE)?Intermediate
Answer
I find a slow query from traces, slow logs, Performance Insights, or pg_stat_statements, then inspect it with EXPLAIN or EXPLAIN ANALYZE. I fix the root cause through better indexes, query rewrite, updated statistics, batching, or schema changes.
Technical explanation
Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.
EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.
Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.
Hands-on example
Index tuning example:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;
CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);
Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.
What does an execution plan tell you, and what is a sequential scan versus an index scan?Intermediate
Answer
An execution plan shows how the database will access, join, sort, and filter data. A sequential scan reads the table broadly; an index scan uses an index to find matching rows. The right choice depends on table size, selectivity, and cost.
Technical explanation
Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.
EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.
Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.
Hands-on example
Index tuning example:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;
CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);
Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.
What are the SQL isolation levels, and what anomalies does each prevent?Intermediate
Answer
Isolation levels define what concurrent transactions are allowed to see. The standard levels are Read Uncommitted, Read Committed, Repeatable Read, and Serializable; stronger isolation prevents more anomalies but can increase locking or retry needs.
Technical explanation
Stronger isolation can improve correctness but may increase blocking or retry requirements.
Deadlocks are handled by aborting a victim transaction; the application must retry the whole transaction safely.
Short transactions, consistent lock ordering, proper indexes, and idempotent retry logic are essential.
Hands-on example
Safe transfer example:
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT id FROM accounts WHERE id IN (1,2) ORDER BY id FOR UPDATE;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
On serialization failure or deadlock, retry the whole transaction, not just one statement.
What is the difference between a dirty read, non-repeatable read, and phantom read?Intermediate
Answer
A dirty read sees uncommitted data, a non-repeatable read sees the same row change between reads in one transaction, and a phantom read sees the result set of a predicate change because rows were inserted, deleted, or updated by another transaction.
Technical explanation
Stronger isolation can improve correctness but may increase blocking or retry requirements.
Deadlocks are handled by aborting a victim transaction; the application must retry the whole transaction safely.
Short transactions, consistent lock ordering, proper indexes, and idempotent retry logic are essential.
Hands-on example
Safe transfer example:
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT id FROM accounts WHERE id IN (1,2) ORDER BY id FOR UPDATE;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
On serialization failure or deadlock, retry the whole transaction, not just one statement.
What is a deadlock, and how do databases detect and resolve them?Intermediate
Answer
A deadlock occurs when transactions wait on each other in a cycle. Databases detect the cycle and abort one transaction, so applications must retry the full transaction and engineers should fix lock ordering, transaction length, and indexes.
Technical explanation
Stronger isolation can improve correctness but may increase blocking or retry requirements.
Deadlocks are handled by aborting a victim transaction; the application must retry the whole transaction safely.
Short transactions, consistent lock ordering, proper indexes, and idempotent retry logic are essential.
Hands-on example
Safe transfer example:
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT id FROM accounts WHERE id IN (1,2) ORDER BY id FOR UPDATE;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
On serialization failure or deadlock, retry the whole transaction, not just one statement.
What is the difference between OLTP and OLAP workloads?Intermediate
Answer
OLTP workloads serve many small, concurrent transactions with low latency and strong correctness. OLAP workloads serve analytical scans, aggregations, and reports over large historical data, so heavy OLAP should usually be isolated from the OLTP primary.
Technical explanation
OLTP is optimized for short, indexed, concurrent transactions; OLAP is optimized for large scans, joins, and aggregations.
Running OLAP reports on the OLTP primary can consume I/O, CPU, buffers, and locks needed by user traffic.
Common designs isolate analytics through read replicas, CDC pipelines, warehouses, materialized views, or summary tables.
Hands-on example
Example separation:
Checkout API reads/writes orders on the primary RDS instance.
Monthly revenue report runs on a warehouse fed by CDC.
Immediate mitigation for a heavy reporting user:
ALTER ROLE reporting_user SET statement_timeout = "5min";
What is database normalisation, and when would you denormalise?Intermediate
Answer
Normalization reduces duplication and update anomalies by modeling entities in separate related tables. Denormalization intentionally duplicates or precomputes data for read performance, but it adds consistency and reconciliation work.
Technical explanation
Normalization reduces duplicated facts and prevents update anomalies.
Denormalization improves specific reads but creates consistency and reconciliation work.
The normalized OLTP schema can remain source of truth while read models, materialized views, or caches serve optimized reads.
Hands-on example
Example denormalization:
orders(customer_id) references customers(id).
Add orders.customer_name_at_purchase for immutable order history.
Backfill:
UPDATE orders o SET customer_name_at_purchase = c.name FROM customers c WHERE c.id = o.customer_id;
What is the difference between a primary key, a unique key, and a foreign key?Intermediate
Answer
A primary key uniquely identifies each row and cannot be null. A unique key enforces uniqueness for a column or column set. A foreign key enforces that values in one table refer to valid rows in another table.
Technical explanation
Primary keys identify rows and are often referenced by foreign keys.
Unique constraints enforce business uniqueness such as email or tenant plus name.
Foreign keys protect referential integrity but need indexing and migration planning on large tables.
Hands-on example
Example schema:
CREATE TABLE customers (id uuid PRIMARY KEY, email text UNIQUE NOT NULL);
CREATE TABLE orders (id uuid PRIMARY KEY, customer_id uuid NOT NULL REFERENCES customers(id));
Add an index on orders(customer_id) for join and delete/update performance.
What is the difference between DELETE, TRUNCATE, and DROP?Intermediate
Answer
DELETE removes selected rows, usually with a WHERE clause. TRUNCATE removes all rows from a table more quickly but with stronger table-level implications. DROP removes the table object and its definition entirely.
Technical explanation
DELETE is selective and can be batched, but huge deletes create logs, bloat, locks, and replica lag.
TRUNCATE is fast for emptying a table but usually takes stronger locks and affects the whole table.
DROP removes schema and data, so it requires backups, approvals, and migration controls.
Hands-on example
Safe batch delete:
DELETE FROM events WHERE id IN (SELECT id FROM events WHERE created_at < now() - interval '180 days' LIMIT 10000);
Repeat with sleeps and monitor locks, WAL, replica lag, and table bloat.
What is the difference between an INNER JOIN and a LEFT JOIN?Intermediate
Answer
An INNER JOIN returns only matching rows from both tables. A LEFT JOIN returns every row from the left table and matching right-table rows, with NULLs when no right-side match exists.
Technical explanation
INNER JOIN filters to rows that match on both sides.
LEFT JOIN preserves all left-table rows and fills right-side columns with NULL when no match exists.
Putting right-table filters in WHERE can accidentally turn a LEFT JOIN into an INNER JOIN.
Hands-on example
Find customers with no orders:
SELECT c.id, c.email FROM customers c LEFT JOIN orders o ON o.customer_id = c.id WHERE o.id IS NULL;
Support it with:
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id);
What is a transaction, and what do the ACID properties mean?Intermediate
Answer
A transaction is an all-or-nothing unit of work. ACID means Atomicity, Consistency, Isolation, and Durability: the database commits complete valid changes, controls concurrency, and preserves committed data across crashes.
Technical explanation
Stronger isolation can improve correctness but may increase blocking or retry requirements.
Deadlocks are handled by aborting a victim transaction; the application must retry the whole transaction safely.
Short transactions, consistent lock ordering, proper indexes, and idempotent retry logic are essential.
Hands-on example
Safe transfer example:
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT id FROM accounts WHERE id IN (1,2) ORDER BY id FOR UPDATE;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
On serialization failure or deadlock, retry the whole transaction, not just one statement.
What is WAL (write-ahead logging), and how does it support durability and replication?Intermediate
Answer
WAL, or write-ahead logging, records changes durably before data pages are considered committed. It enables crash recovery, replication, and point-in-time recovery because the database can replay committed changes.
Technical explanation
WAL/binlog volume affects storage, backup, replica lag, and restore behavior.
Near-zero downtime migration requires initial load, CDC, validation, short write drain, cutover, and rollback planning.
DMS helps move data, but schema conversion, application compatibility, and data validation remain engineering responsibilities.
Hands-on example
Migration runbook:
1. Create target database and schema.
2. Start full-load plus CDC using DMS or logical replication.
3. Monitor lag and task errors.
4. Compare row counts and critical aggregates.
5. Stop writes briefly.
6. Wait for lag to reach zero.
7. Switch endpoint/secret.
8. Monitor app errors and DB load.
9. Keep source read-only until rollback window closes.
What is synchronous versus asynchronous replication, and the trade-offs?Intermediate
Answer
Synchronous replication waits for a replica acknowledgement before commit returns, reducing data-loss risk but increasing latency. Asynchronous replication returns faster but permits replica lag and possible data loss during failover.
Technical explanation
WAL/binlog volume affects storage, backup, replica lag, and restore behavior.
Near-zero downtime migration requires initial load, CDC, validation, short write drain, cutover, and rollback planning.
DMS helps move data, but schema conversion, application compatibility, and data validation remain engineering responsibilities.
Hands-on example
Migration runbook:
1. Create target database and schema.
2. Start full-load plus CDC using DMS or logical replication.
3. Monitor lag and task errors.
4. Compare row counts and critical aggregates.
5. Stop writes briefly.
6. Wait for lag to reach zero.
7. Switch endpoint/secret.
8. Monitor app errors and DB load.
9. Keep source read-only until rollback window closes.
How do you migrate a database with near-zero downtime (e.g., using DMS or logical replication)?Intermediate
Answer
Near-zero downtime migration uses initial load plus ongoing change replication, then a short write drain and controlled cutover. The core tasks are schema prep, CDC, validation, lag monitoring, endpoint switch, and rollback planning.
Technical explanation
WAL/binlog volume affects storage, backup, replica lag, and restore behavior.
Near-zero downtime migration requires initial load, CDC, validation, short write drain, cutover, and rollback planning.
DMS helps move data, but schema conversion, application compatibility, and data validation remain engineering responsibilities.
Hands-on example
Migration runbook:
1. Create target database and schema.
2. Start full-load plus CDC using DMS or logical replication.
3. Monitor lag and task errors.
4. Compare row counts and critical aggregates.
5. Stop writes briefly.
6. Wait for lag to reach zero.
7. Switch endpoint/secret.
8. Monitor app errors and DB load.
9. Keep source read-only until rollback window closes.
What is AWS DMS, and when would you use it?Intermediate
Answer
AWS DMS is a managed migration and replication service that can perform full loads and change data capture between supported sources and targets. I use it for cloud migrations, heterogeneous migrations, and low-downtime moves when native replication is not ideal.
Technical explanation
WAL/binlog volume affects storage, backup, replica lag, and restore behavior.
Near-zero downtime migration requires initial load, CDC, validation, short write drain, cutover, and rollback planning.
DMS helps move data, but schema conversion, application compatibility, and data validation remain engineering responsibilities.
Hands-on example
Migration runbook:
1. Create target database and schema.
2. Start full-load plus CDC using DMS or logical replication.
3. Monitor lag and task errors.
4. Compare row counts and critical aggregates.
5. Stop writes briefly.
6. Wait for lag to reach zero.
7. Switch endpoint/secret.
8. Monitor app errors and DB load.
9. Keep source read-only until rollback window closes.
What is schema migration, and how do you run migrations safely in production?Intermediate
Answer
Schema migration is a controlled change to database structure or data. I run it safely by versioning migrations, testing on production-like data, using backward-compatible steps, limiting locks, batching backfills, and monitoring rollout.
Technical explanation
Old and new app versions may run at the same time, so schema must support both during rollout.
Large table rewrites, blocking locks, foreign-key validation, and index builds can cause outages if done naively.
Always rehearse on production-sized data and set lock_timeout and statement_timeout.
Hands-on example
Expand-contract example:
ALTER TABLE users ADD COLUMN full_name text;
Deploy app that dual-writes name and full_name and reads COALESCE(full_name, name).
Backfill in batches.
After all old app versions are gone, add constraints and later drop the old column.
Foreign key safer pattern:
ALTER TABLE orders ADD CONSTRAINT orders_customer_fk FOREIGN KEY (customer_id) REFERENCES customers(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_fk;
How do you make a schema change backward compatible during a rolling deploy?Intermediate
Answer
For rolling deploys, I make schema changes backward compatible with expand-and-contract: add new structures first, deploy code that supports old and new, backfill, switch reads, then remove old structures later.
Technical explanation
Old and new app versions may run at the same time, so schema must support both during rollout.
Large table rewrites, blocking locks, foreign-key validation, and index builds can cause outages if done naively.
Always rehearse on production-sized data and set lock_timeout and statement_timeout.
Hands-on example
Expand-contract example:
ALTER TABLE users ADD COLUMN full_name text;
Deploy app that dual-writes name and full_name and reads COALESCE(full_name, name).
Backfill in batches.
After all old app versions are gone, add constraints and later drop the old column.
Foreign key safer pattern:
ALTER TABLE orders ADD CONSTRAINT orders_customer_fk FOREIGN KEY (customer_id) REFERENCES customers(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_fk;
Why are some ALTER TABLE operations dangerous on a large table, and how do you mitigate?Intermediate
Answer
Large ALTER TABLE operations are dangerous because they can lock tables, rewrite data, generate huge WAL/binlogs, increase replica lag, and exhaust storage. I mitigate with concurrent operations, batches, NOT VALID constraints, and rehearsals.
Technical explanation
Old and new app versions may run at the same time, so schema must support both during rollout.
Large table rewrites, blocking locks, foreign-key validation, and index builds can cause outages if done naively.
Always rehearse on production-sized data and set lock_timeout and statement_timeout.
Hands-on example
Expand-contract example:
ALTER TABLE users ADD COLUMN full_name text;
Deploy app that dual-writes name and full_name and reads COALESCE(full_name, name).
Backfill in batches.
After all old app versions are gone, add constraints and later drop the old column.
Foreign key safer pattern:
ALTER TABLE orders ADD CONSTRAINT orders_customer_fk FOREIGN KEY (customer_id) REFERENCES customers(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_fk;
What is connection exhaustion, and how do you prevent it?Intermediate
Answer
Connection pooling reuses a bounded set of database connections and prevents unbounded clients from exhausting the database. RDS Proxy is a managed pooling proxy that helps with bursty traffic, Lambda concurrency, failover recovery, and credential rotation.
Technical explanation
Each DB connection consumes memory and CPU; in Kubernetes total connections equal pod count times pool size.
RDS Proxy multiplexes when session state allows it, but pinned sessions reduce benefits.
Prevention includes pool caps, backoff with jitter, timeouts, RDS Proxy/PgBouncer, and avoiding unbounded concurrency.
Hands-on example
Pool sizing example:
DB safe app budget = 850 connections
App pods = 50
Max pool per pod = floor(850/50) = 17
Choose 10 to 15 and load test.
PostgreSQL check:
SELECT usename, application_name, state, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;
What is ElastiCache, and what does it manage?Intermediate
Answer
ElastiCache is AWS managed in-memory caching for Valkey, Redis OSS, and Memcached. AWS manages nodes, patching, monitoring integration, replication/failover features, and backups where supported; I own keys, TTLs, consistency, and fallback behavior.
Technical explanation
Managed cache still requires correct key design, TTLs, invalidation, memory sizing, security, and fallback behavior.
Redis/Valkey is better for counters, rate limits, sorted sets, locks, streams, and HA patterns; Memcached is good for simple ephemeral object cache.
The cache should not become a hidden source of truth unless durability, backup, and failover are deliberately designed.
Hands-on example
Cache-aside example:
GET product:v1:123
If hit, return cached value.
If miss, query RDS, then SETEX product:v1:123 300 <json>.
Redis data type examples:
INCR api:tenant:42:requests
HSET user:123 tier premium
ZADD leaderboard 9100 user:123
XADD orders:events * order_id 1001 status created
What is the difference between ElastiCache Redis and Memcached?Intermediate
Answer
Redis or Valkey supports rich data structures, persistence-related features, replication, clustering, Lua, streams, and pub/sub. Memcached is simpler and best for volatile key-value object caching without complex data structures.
Technical explanation
Managed cache still requires correct key design, TTLs, invalidation, memory sizing, security, and fallback behavior.
Redis/Valkey is better for counters, rate limits, sorted sets, locks, streams, and HA patterns; Memcached is good for simple ephemeral object cache.
The cache should not become a hidden source of truth unless durability, backup, and failover are deliberately designed.
Hands-on example
Cache-aside example:
GET product:v1:123
If hit, return cached value.
If miss, query RDS, then SETEX product:v1:123 300 <json>.
Redis data type examples:
INCR api:tenant:42:requests
HSET user:123 tier premium
ZADD leaderboard 9100 user:123
XADD orders:events * order_id 1001 status created
What is Redis, and what core data types does it support?Intermediate
Answer
Redis is an in-memory data structure server used for caching, counters, queues, rate limits, sessions, and coordination. Core types include strings, hashes, lists, sets, sorted sets, streams, bitmaps, HyperLogLog, and geospatial indexes.
Technical explanation
Managed cache still requires correct key design, TTLs, invalidation, memory sizing, security, and fallback behavior.
Redis/Valkey is better for counters, rate limits, sorted sets, locks, streams, and HA patterns; Memcached is good for simple ephemeral object cache.
The cache should not become a hidden source of truth unless durability, backup, and failover are deliberately designed.
Hands-on example
Cache-aside example:
GET product:v1:123
If hit, return cached value.
If miss, query RDS, then SETEX product:v1:123 300 <json>.
Redis data type examples:
INCR api:tenant:42:requests
HSET user:123 tier premium
ZADD leaderboard 9100 user:123
XADD orders:events * order_id 1001 status created
What is Valkey, and why did the community fork it from Redis?Intermediate
Answer
Valkey is an open-source Redis-compatible datastore forked after Redis changed its license. It preserves a community-governed open-source path while aiming to keep Redis protocol and command compatibility.
Technical explanation
Wire compatibility does not guarantee every feature, module, metric, or managed-service setting behaves identically.
If the workload is pure cache, cutover can tolerate misses; if Redis holds authoritative data, migration must be much stricter.
Validation should compare hit ratio, latency, evictions, memory, CPU, connection count, app errors, and failover behavior.
Hands-on example
Migration runbook:
1. Inventory Redis version, commands, clients, Lua scripts, modules, ACLs, cluster mode, and parameters.
2. Create equivalent Valkey cluster.
3. Restore backup or use online migration if supported.
4. Test staging app and replay synthetic traffic.
5. Lower TTLs and warm hot keys.
6. Switch endpoint/secret.
7. Monitor p95 latency, hit ratio, evictions, CPU, memory, and errors.
8. Keep old cluster for rollback window.
Walk me through your ElastiCache Redis to Valkey migration: motivation, plan, and validation.Intermediate
Answer
A Redis to Valkey migration should be driven by support, licensing, cost, or platform-standardization goals, then executed through inventory, compatibility testing, staging validation, cutover, metrics comparison, and rollback planning.
Technical explanation
Wire compatibility does not guarantee every feature, module, metric, or managed-service setting behaves identically.
If the workload is pure cache, cutover can tolerate misses; if Redis holds authoritative data, migration must be much stricter.
Validation should compare hit ratio, latency, evictions, memory, CPU, connection count, app errors, and failover behavior.
Hands-on example
Migration runbook:
1. Inventory Redis version, commands, clients, Lua scripts, modules, ACLs, cluster mode, and parameters.
2. Create equivalent Valkey cluster.
3. Restore backup or use online migration if supported.
4. Test staging app and replay synthetic traffic.
5. Lower TTLs and warm hot keys.
6. Switch endpoint/secret.
7. Monitor p95 latency, hit ratio, evictions, CPU, memory, and errors.
8. Keep old cluster for rollback window.
How do you migrate from Redis to Valkey with minimal downtime?Intermediate
Answer
For minimal downtime Redis to Valkey migration, I first classify whether the cache is disposable or authoritative. Pure caches can use warmup and endpoint cutover; authoritative data needs replication, backup/restore, dual-write, or a write freeze.
Technical explanation
Wire compatibility does not guarantee every feature, module, metric, or managed-service setting behaves identically.
If the workload is pure cache, cutover can tolerate misses; if Redis holds authoritative data, migration must be much stricter.
Validation should compare hit ratio, latency, evictions, memory, CPU, connection count, app errors, and failover behavior.
Hands-on example
Migration runbook:
1. Inventory Redis version, commands, clients, Lua scripts, modules, ACLs, cluster mode, and parameters.
2. Create equivalent Valkey cluster.
3. Restore backup or use online migration if supported.
4. Test staging app and replay synthetic traffic.
5. Lower TTLs and warm hot keys.
6. Switch endpoint/secret.
7. Monitor p95 latency, hit ratio, evictions, CPU, memory, and errors.
8. Keep old cluster for rollback window.
How is Valkey wire-compatible with Redis, and what risks remain in a migration?Intermediate
Answer
Valkey wire compatibility means normal Redis clients and protocol-level commands usually work, but risks remain around version differences, modules, Lua scripts, cluster behavior, managed-service restrictions, metrics, and ACL assumptions.
Technical explanation
Wire compatibility does not guarantee every feature, module, metric, or managed-service setting behaves identically.
If the workload is pure cache, cutover can tolerate misses; if Redis holds authoritative data, migration must be much stricter.
Validation should compare hit ratio, latency, evictions, memory, CPU, connection count, app errors, and failover behavior.
Hands-on example
Migration runbook:
1. Inventory Redis version, commands, clients, Lua scripts, modules, ACLs, cluster mode, and parameters.
2. Create equivalent Valkey cluster.
3. Restore backup or use online migration if supported.
4. Test staging app and replay synthetic traffic.
5. Lower TTLs and warm hot keys.
6. Switch endpoint/secret.
7. Monitor p95 latency, hit ratio, evictions, CPU, memory, and errors.
8. Keep old cluster for rollback window.
What are common caching patterns (cache-aside, read-through, write-through, write-behind)?Intermediate
Answer
Cache-aside loads from the DB on miss, read-through lets the cache layer load, write-through writes cache and DB synchronously, and write-behind writes to cache first then flushes asynchronously. Each pattern trades consistency, latency, and loss risk differently.
Technical explanation
TTL controls the trade-off between freshness and backend load.
Stampedes are prevented with request coalescing, locks, TTL jitter, stale-while-revalidate, and rate limits.
Invalidation is hard because writes can happen through many services; outbox events make invalidation more reliable after DB commit.
Hands-on example
Cache-aside with stampede protection:
1. GET product:v2:123.
2. On miss, acquire SET product:v2:123:lock 1 NX PX 5000.
3. Lock holder reads RDS and SETEX product:v2:123 300+jitter value.
4. Other callers return stale value or wait briefly.
5. After product update, commit DB then DEL product:v2:123 or publish invalidation via outbox.
What is cache-aside (lazy loading), and what are its failure modes?Intermediate
Answer
Cache-aside means the app checks cache first, reads the database on miss, writes the value to cache with TTL, and returns it. Its failure modes are stale data, stampedes, inconsistent invalidation, and DB overload during cache failure.
Technical explanation
TTL controls the trade-off between freshness and backend load.
Stampedes are prevented with request coalescing, locks, TTL jitter, stale-while-revalidate, and rate limits.
Invalidation is hard because writes can happen through many services; outbox events make invalidation more reliable after DB commit.
Hands-on example
Cache-aside with stampede protection:
1. GET product:v2:123.
2. On miss, acquire SET product:v2:123:lock 1 NX PX 5000.
3. Lock holder reads RDS and SETEX product:v2:123 300+jitter value.
4. Other callers return stale value or wait briefly.
5. After product update, commit DB then DEL product:v2:123 or publish invalidation via outbox.
What is a cache stampede / thundering herd, and how do you prevent it?Intermediate
Answer
A cache stampede happens when many clients miss the same hot key at once and all hit the backend. I prevent it with request coalescing, TTL jitter, locks, stale-while-revalidate, warmup, and fallback rate limits.
Technical explanation
TTL controls the trade-off between freshness and backend load.
Stampedes are prevented with request coalescing, locks, TTL jitter, stale-while-revalidate, and rate limits.
Invalidation is hard because writes can happen through many services; outbox events make invalidation more reliable after DB commit.
Hands-on example
Cache-aside with stampede protection:
1. GET product:v2:123.
2. On miss, acquire SET product:v2:123:lock 1 NX PX 5000.
3. Lock holder reads RDS and SETEX product:v2:123 300+jitter value.
4. Other callers return stale value or wait briefly.
5. After product update, commit DB then DEL product:v2:123 or publish invalidation via outbox.
What is cache invalidation, and why is it hard?Intermediate
Answer
Cache invalidation is deciding when cached data is no longer correct and removing or refreshing it. It is hard because writes can come from many paths and caches can exist at multiple layers.
Technical explanation
TTL controls the trade-off between freshness and backend load.
Stampedes are prevented with request coalescing, locks, TTL jitter, stale-while-revalidate, and rate limits.
Invalidation is hard because writes can happen through many services; outbox events make invalidation more reliable after DB commit.
Hands-on example
Cache-aside with stampede protection:
1. GET product:v2:123.
2. On miss, acquire SET product:v2:123:lock 1 NX PX 5000.
3. Lock holder reads RDS and SETEX product:v2:123 300+jitter value.
4. Other callers return stale value or wait briefly.
5. After product update, commit DB then DEL product:v2:123 or publish invalidation via outbox.
What are Redis eviction policies (e.g., LRU, LFU, volatile vs allkeys)?Advanced
Answer
Redis eviction policies decide what Redis removes when maxmemory is reached. allkeys policies can evict any key; volatile policies evict only TTL keys; LRU and LFU approximate recency and frequency; noeviction makes writes fail.
Technical explanation
LRU approximates least recently used; LFU approximates least frequently used; volatile policies protect non-TTL keys but can fail writes if no TTL keys are eligible.
TTL jitter prevents synchronized expiry and thundering herds.
Memory alerts must fire before maxmemory to avoid eviction storms or write failures.
Hands-on example
Policy example for pure cache:
maxmemory-policy allkeys-lfu
SETEX product:v2:123 642 <json>
642 seconds = 600 base + random jitter.
Incident checks:
INFO memory
INFO stats
CONFIG GET maxmemory-policy
Watch evicted_keys, keyspace_hits, keyspace_misses, app cache errors, and RDS fallback load.
What happens when Redis reaches maxmemory?Advanced
Answer
When Redis reaches maxmemory, it either evicts eligible keys based on policy or rejects memory-growing writes if no key can be evicted. Reads may still work, but latency and miss rate can degrade quickly.
Technical explanation
LRU approximates least recently used; LFU approximates least frequently used; volatile policies protect non-TTL keys but can fail writes if no TTL keys are eligible.
TTL jitter prevents synchronized expiry and thundering herds.
Memory alerts must fire before maxmemory to avoid eviction storms or write failures.
Hands-on example
Policy example for pure cache:
maxmemory-policy allkeys-lfu
SETEX product:v2:123 642 <json>
642 seconds = 600 base + random jitter.
Incident checks:
INFO memory
INFO stats
CONFIG GET maxmemory-policy
Watch evicted_keys, keyspace_hits, keyspace_misses, app cache errors, and RDS fallback load.
What is a TTL, and how do you choose one for cached data?Advanced
Answer
TTL is the expiration time for a key. I choose it from freshness needs, update frequency, miss cost, database capacity, and user impact of staleness, and I add jitter to avoid synchronized expiry.
Technical explanation
LRU approximates least recently used; LFU approximates least frequently used; volatile policies protect non-TTL keys but can fail writes if no TTL keys are eligible.
TTL jitter prevents synchronized expiry and thundering herds.
Memory alerts must fire before maxmemory to avoid eviction storms or write failures.
Hands-on example
Policy example for pure cache:
maxmemory-policy allkeys-lfu
SETEX product:v2:123 642 <json>
642 seconds = 600 base + random jitter.
Incident checks:
INFO memory
INFO stats
CONFIG GET maxmemory-policy
Watch evicted_keys, keyspace_hits, keyspace_misses, app cache errors, and RDS fallback load.
What is the difference between Redis persistence options RDB and AOF?Advanced
Answer
RDB creates point-in-time snapshot files, while AOF appends write commands to a log for replay. RDB is compact and simple but can lose recent writes; AOF improves durability at the cost of disk and write overhead.
Technical explanation
RDB snapshots are compact point-in-time files and are simple to restore.
AOF logs write commands and can reduce data loss depending on fsync policy.
For a pure cache, persistence may be less important; for authoritative Redis data, persistence and backup testing are mandatory.
Hands-on example
Self-managed example:
save 900 1
save 300 10
appendonly yes
appendfsync everysec
Test by writing known keys, forcing restart in staging, and verifying which keys survive.
What are the durability trade-offs between RDB snapshots and AOF?Advanced
Answer
RDB has lower overhead and faster compact backups but can lose changes after the last snapshot. AOF can reduce loss depending on fsync policy, but increases disk usage, write cost, and rewrite complexity.
Technical explanation
RDB has lower overhead but loses changes since the last snapshot.
AOF with everysec can limit loss but adds disk usage and rewrite overhead.
Neither protects against logical deletes by itself; backups and restore tests are still needed.
Hands-on example
Decision example:
Rate-limit counters: RDB or no persistence may be acceptable.
Shopping cart stored only in Redis: use replication, backups, and stronger persistence, or move source of truth to RDS.
Document the RPO and prove it with a crash test.
What is Redis replication, and how does a replica stay in sync?Advanced
Answer
Redis replication copies data from a primary to replicas through initial synchronization and ongoing write streams. It improves read scale and HA, but is usually asynchronous, so replicas can lag.
Technical explanation
Initial sync copies a snapshot or backlog; ongoing sync applies the primary write stream.
Replication lag can appear during high write volume, network issues, or slow replicas.
Replica reads can be stale, so strongly consistent reads should use the primary.
Hands-on example
Replication check:
INFO replication
Review role, connected_replicas, master_repl_offset, replica offsets, and link status.
Alert on sustained lag or broken replica links.
What is Redis Sentinel, and what does it provide?Advanced
Answer
Redis Sentinel monitors a primary-replica deployment, detects primary failure, elects a new primary, and provides discovery. It gives failover but not sharding.
Technical explanation
Sentinel monitors Redis primary and replicas and decides failover through quorum.
It promotes a replica and tells Sentinel-aware clients where the new primary is.
It does not shard data; it is an HA mechanism for a non-clustered dataset.
Hands-on example
Sentinel failover flow:
1. Sentinels detect primary down.
2. Quorum agrees.
3. A leader selects a replica.
4. Replica is promoted.
5. Other replicas follow it.
6. Clients reconnect through Sentinel discovery.
What is Redis Cluster, and how does sharding with hash slots work?Advanced
Answer
Redis Cluster shards the keyspace across 16,384 hash slots owned by primary nodes, with replicas for HA. Clients must be cluster-aware and multi-key operations need keys in the same slot.
Technical explanation
Redis Cluster splits keys across 16,384 hash slots.
Each primary owns slots and replicas provide HA for those slots.
Clients must handle MOVED/ASK redirects, and multi-key operations need same-slot keys.
Hands-on example
Hash-tag example:
cart:{user123}:items
cart:{user123}:total
CLUSTER KEYSLOT cart:{user123}:items
CLUSTER KEYSLOT cart:{user123}:total
Both should map to the same slot.
What is the difference between Redis Sentinel and Redis Cluster?Advanced
Answer
Sentinel provides failover for one primary dataset without sharding. Redis Cluster provides horizontal sharding plus failover per shard, but requires cluster-aware clients and careful key design.
Technical explanation
Sentinel is simpler when one primary dataset fits on one node and you need failover.
Cluster is used when memory or throughput requires multiple primaries.
Cluster adds client and key-design complexity, especially for multi-key operations.
Hands-on example
Selection example:
20 GB simple cache: primary plus replica with managed failover may be enough.
900 GB tenant cache: cluster mode with shard planning is more appropriate.
Validate client cluster support before choosing Cluster.
How does ElastiCache provide high availability and automatic failover?Advanced
Answer
ElastiCache provides HA through replication groups, Multi-AZ replicas, node health monitoring, automatic failover, and endpoint updates. Applications still need reconnect and retry behavior during failover.
Technical explanation
ElastiCache uses replication groups, replicas, Multi-AZ placement, health checks, and automatic failover.
Cluster-mode enabled deployments fail over per shard.
Applications must handle dropped connections, endpoint updates, and retry behavior.
Hands-on example
Staging test:
Run GET/SET loop with latency measurement.
Trigger test failover.
Monitor CurrConnections, EngineCPUUtilization, ReplicationLag, CacheHitRate, Evictions, and app p95 latency.
What is a cache hit ratio, and how do you improve it?Advanced
Answer
Cache hit ratio is hits divided by total lookups. I improve it by caching the right data, sizing memory, avoiding evictions, using stable keys, warming hot keys, choosing good TTLs, and avoiding unnecessary invalidation.
Technical explanation
Hit ratio equals hits divided by hits plus misses.
A global hit ratio can hide a low-hit critical endpoint, so break down by key prefix or API.
Improving hit ratio requires better key choice, TTLs, memory sizing, warmup, and avoiding unnecessary invalidation.
Hands-on example
Calculation:
keyspace_hits=900000
keyspace_misses=100000
Hit ratio = 900000 / 1000000 = 90%.
If evictions are high, add memory, reduce value size, or tune TTL/policy.
How do you monitor Redis/ElastiCache (evictions, memory, latency, hit ratio)?Advanced
Answer
I monitor Redis or ElastiCache with hit ratio, latency, CPU, memory, evictions, fragmentation, connections, rejected connections, replication lag, slowlog, command stats, and application fallback rate.
Technical explanation
Global hit ratio can hide bad key patterns; monitor by endpoint or key prefix where possible.
Large keys and hot keys can overload a single Redis thread or shard even when total cluster capacity looks fine.
Use SCAN and UNLINK instead of KEYS and mass DEL; restrict dangerous commands through ACLs or operational policy.
Hands-on example
Operational commands:
INFO stats
INFO memory
INFO commandstats
SLOWLOG GET 10
LATENCY LATEST
redis-cli --bigkeys -h <host> -p 6379
Safe cleanup pattern:
redis-cli --scan --pattern "session:*" | while read key; do redis-cli UNLINK "$key"; done
Pipeline warmup in batches instead of one command per network round trip.
What is the impact of a large key or hot key in Redis, and how do you handle it?Advanced
Answer
A large key increases memory, network, replication, and latency cost; a hot key concentrates too much traffic on one key or shard. I handle them by splitting keys, redesigning data structures, local caching, and request coalescing.
Technical explanation
Large keys increase memory, network, replication, and command latency cost.
Hot keys overload one shard or event loop even if total cluster capacity looks healthy.
Fixes include splitting keys, better data structures, local cache, key sharding, and request coalescing.
Hands-on example
Investigation:
redis-cli --bigkeys -h <host> -p 6379
SCAN 0 MATCH product:* COUNT 1000
MEMORY USAGE product:v2:123
Avoid KEYS in production.
Why are some Redis commands (KEYS, FLUSHALL) dangerous in production?Advanced
Answer
KEYS can block Redis by scanning the full keyspace, and FLUSHALL can delete the entire cache or datastore. In production I use SCAN/UNLINK, namespace isolation, ACL restrictions, and controlled cleanup jobs.
Technical explanation
Global hit ratio can hide bad key patterns; monitor by endpoint or key prefix where possible.
Large keys and hot keys can overload a single Redis thread or shard even when total cluster capacity looks fine.
Use SCAN and UNLINK instead of KEYS and mass DEL; restrict dangerous commands through ACLs or operational policy.
Hands-on example
Operational commands:
INFO stats
INFO memory
INFO commandstats
SLOWLOG GET 10
LATENCY LATEST
redis-cli --bigkeys -h <host> -p 6379
Safe cleanup pattern:
redis-cli --scan --pattern "session:*" | while read key; do redis-cli UNLINK "$key"; done
Pipeline warmup in batches instead of one command per network round trip.
What is pipelining in Redis, and how does it improve throughput?Advanced
Answer
Pipelining sends multiple commands before waiting for replies, reducing network round trips and increasing throughput. It is a transport optimization, not a transaction, and pipeline size must be bounded.
Technical explanation
Pipelining reduces round trips for many small independent commands.
It does not make Redis execute commands in parallel and does not provide transaction semantics.
Pipeline batch size must be capped to avoid memory pressure and tail-latency spikes.
Hands-on example
Python-style warmup:
pipe = redis.pipeline(transaction=False)
for product in products:
pipe.setex(f"product:v2:{product.id}", 600, serialize(product))
pipe.execute()
Execute every 500 commands instead of building an unbounded pipeline.
What is the difference between Redis as a cache and Redis as a primary datastore?Advanced
Answer
Redis as a cache is rebuildable from another source of truth. Redis as a primary datastore is authoritative, so persistence, backup, failover, no unsafe eviction, and recovery testing become mandatory.
Technical explanation
Cache data should be safely rebuildable from a durable source.
Primary-store data needs persistence, backups, no unsafe eviction, HA, and restore testing.
Financial records and durable ledgers should usually live in a transactional database, not only Redis.
Hands-on example
Classification:
Product details copied from RDS: cache.
User sessions: needs datastore-level HA policy.
Billing counters: needs durable queue, DB write, or reconciliation.
Set eviction to noeviction if data must not disappear.
How would you use Redis for rate limiting?Advanced
Answer
For rate limiting, Redis commonly uses atomic counters with INCR and EXPIRE for fixed windows or sorted sets for sliding windows. Lua scripts help make multi-step limiter logic atomic.
Technical explanation
Rate limits can use INCR/EXPIRE or sorted sets, ideally through Lua for atomicity.
Distributed locks should use SET key value NX PX ttl with unique token and safe compare-and-delete release; they are better for efficiency than hard correctness.
Security baseline is private subnets, SG restrictions, TLS, AUTH/ACLs, secret rotation, and no public exposure.
Hands-on example
Rate limit example:
MULTI
INCR rl:tenant:42:minute:202606301015
EXPIRE rl:tenant:42:minute:202606301015 120
EXEC
Lock example:
SET lock:refresh:product:123 <uuid> NX PX 5000
Release only if GET lock value matches your uuid. Never DEL blindly.
How would you use Redis for distributed locking, and what are the caveats?Advanced
Answer
Redis distributed locks commonly use SET key value NX PX ttl with a unique token and compare-and-delete release. They are useful for efficiency locks but require caution for correctness-critical workflows.
Technical explanation
Rate limits can use INCR/EXPIRE or sorted sets, ideally through Lua for atomicity.
Distributed locks should use SET key value NX PX ttl with unique token and safe compare-and-delete release; they are better for efficiency than hard correctness.
Security baseline is private subnets, SG restrictions, TLS, AUTH/ACLs, secret rotation, and no public exposure.
Hands-on example
Rate limit example:
MULTI
INCR rl:tenant:42:minute:202606301015
EXPIRE rl:tenant:42:minute:202606301015 120
EXEC
Lock example:
SET lock:refresh:product:123 <uuid> NX PX 5000
Release only if GET lock value matches your uuid. Never DEL blindly.
How do you secure Redis (AUTH, TLS, network isolation, no public exposure)?Advanced
Answer
Redis security starts with private networking, no public exposure, security groups, TLS, AUTH or ACLs, least-privilege users, secret rotation, and blocking dangerous commands for application users.
Technical explanation
Network isolation is the first control: Redis should not be public.
AUTH/ACLs limit what application users can do, including blocking admin commands.
TLS, secret rotation, security groups, and private subnets complete the baseline.
Hands-on example
Security baseline:
Private subnets only.
Security group allows 6379 only from app SG.
Transit encryption enabled.
App ACL allows GET/SET but denies FLUSHALL/CONFIG.
Secret stored in Secrets Manager and rotated.
What is the difference between strong and eventual consistency, and where does caching sit?Advanced
Answer
Strong consistency means reads reflect the latest committed write; eventual consistency means replicas or caches can be temporarily stale. Caches usually introduce eventual consistency unless reads bypass or refresh after writes.
Technical explanation
Caches can serve older values until TTL expiry or invalidation.
Read-after-write paths often need primary reads or synchronous cache invalidation.
Different data has different consistency tolerance: product descriptions differ from balances.
Hands-on example
Profile update flow:
UPDATE users SET name = ... WHERE id = 123;
After commit: DEL user:v2:123.
Return updated profile directly from the write response or bypass cache briefly for that user.
How do you keep a cache and database consistent?Advanced
Answer
I keep cache and DB consistent by treating the DB as source of truth, using TTLs, invalidating after commits, versioning keys, and using outbox-driven invalidation so cache events are not lost.
Technical explanation
Database should remain the source of truth unless the cache is intentionally designed as a datastore.
Keep cache and DB consistent through TTLs, delete-after-commit, versioned keys, outbox-driven invalidation, and primary reads after writes.
Graceful degradation requires low cache timeouts, circuit breakers, stale-if-safe responses, DB fallback limits, and cache warming.
Hands-on example
Outbox invalidation example:
BEGIN;
UPDATE products SET price = 19.99 WHERE id = 123;
INSERT INTO outbox(event_type, payload) VALUES ("ProductUpdated", "{id:123}");
COMMIT;
Worker reads outbox, DEL product:v2:123, publishes event, marks outbox processed.
Warm top keys after failover with a throttled job and stop if RDS CPU or app latency rises.
What is read-your-writes consistency, and how can caching break it?Advanced
Answer
Read-your-writes means a user sees their own update immediately. Caching can break it by returning stale cached values or replica reads, so I bypass cache or invalidate/update it on write.
Technical explanation
Read-your-writes is strongest for the user who just performed the write.
Cache and read replicas can both break it immediately after update.
Solutions include delete/update cache on write, primary read after write, version checks, or short cache-bypass windows.
Hands-on example
Example:
PUT /profile updates RDS and deletes user:v2:123.
GET /profile for the same user during the next 10 seconds reads primary DB or rebuilt cache.
Public profile reads can remain cache-aside if stale reads are acceptable.
How would you design caching for a read-heavy microservice?Advanced
Answer
For a read-heavy microservice, I cache hot, expensive, safe-to-stale data using clear key naming, TTL plus jitter, explicit invalidation, request coalescing, warming, and fallback protection for the database.
Technical explanation
Start from metrics: hot endpoints, expensive queries, result size, and allowed staleness.
Use Redis/Valkey shared cache for common data and local cache for tiny ultra-hot config.
Protect the database from miss storms with coalescing, warmup, and fallback rate limits.
Hands-on example
Read-heavy product cache:
product:v3:{id} TTL 15m+jitter
product-price:v1:{id} TTL 60s plus explicit invalidation
Redis timeout 50ms.
On miss, coalesce requests and rate-limit DB fallback.
How do you decide what to cache and what not to cache?Advanced
Answer
I cache data that is frequently read, expensive to fetch or compute, reasonably small, and tolerant of staleness. I avoid caching rapidly changing, sensitive, authorization-heavy, cheap, or strict-consistency data.
Technical explanation
Good cache candidates are hot, expensive, small, and stale-tolerant.
Bad candidates are rapidly changing, sensitive, huge, cheap to query, or strict-consistency data.
Authorization-heavy data requires special care because stale permissions are a security risk.
Hands-on example
Decision examples:
Cache product category list: yes.
Cache payment balance: no, read primary.
Cache user permissions for 1 hour: risky; prefer short TTL or token/revocation design.
How do you handle cache warming after a restart or failover?Advanced
Answer
Cache warming preloads hot keys after restart, failover, or new cluster creation so user traffic does not create a cold-cache miss storm. It must be throttled and monitored to avoid overloading the database.
Technical explanation
Cold caches can shift traffic to RDS and create a secondary outage.
Warm only known hot keys and throttle the job.
Stop warming if source DB or application SLOs degrade.
Hands-on example
Warmup job:
for key in $(cat top_keys.txt); do curl -s "https://product-service/internal/cache-warm?key=$key" >/dev/null; sleep 0.05; done
Stop if RDS CPU > 70% or app p95 exceeds SLO.
What is the blast radius of a cache outage, and how do you make the app degrade gracefully?Advanced
Answer
A cache outage can cause anything from minor latency to full outage. I reduce blast radius with low timeouts, circuit breakers, DB fallback limits, stale-if-safe responses, local last-known-good data, and graceful feature degradation.
Technical explanation
Cache failure should not automatically become total service failure unless the cache is an intentional primary store.
Use low timeouts and circuit breakers to avoid tying up app threads.
Degrade non-critical features and protect DB fallback with rate limits.
Hands-on example
Failure policy:
Redis connect_timeout=50ms, read_timeout=50ms, max_retries=1.
Product miss: DB fallback with limiter.
Recommendations failure: omit module.
Feature config failure: use last-known-good local copy.
How do you size an ElastiCache cluster (memory, nodes, shards)?Advanced
Answer
I size ElastiCache from working-set memory, key/value overhead, fragmentation, CPU, network, connection count, replica needs, shard distribution, failover headroom, and growth forecast.
Technical explanation
Failover testing must measure the whole application recovery path, not just service events.
Connection storms are prevented with pool caps, rolling deploys, jittered startup, exponential backoff, and readiness gates.
For interview stories, state the business driver, migration plan, validation, rollback, and measurable result such as p95 latency, hit ratio, cost, or error-rate improvement.
Hands-on example
Capacity example:
Current RDS free storage = 500 GB
Growth = 20 GB/day
Days remaining = 25
Action threshold = 45 days, so act now.
SQL size check:
SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) FROM pg_class WHERE relkind = 'r' ORDER BY pg_total_relation_size(oid) DESC LIMIT 20;
Plan: enable storage autoscaling with max cap, archive old data, review index bloat, test restore time, and evaluate partitioning.
What backup and restore options exist for ElastiCache Redis?Advanced
Answer
ElastiCache Redis or Valkey backup options include automatic backups, manual snapshots, and snapshot restore into a new cluster where supported. Pure caches may be rebuilt; authoritative data needs tested restore.
Technical explanation
Automatic backups and manual snapshots protect Redis/Valkey data where supported.
Restore usually creates or populates a new cluster and must be validated.
Memcached is volatile; rebuild rather than restore is normally the model.
Hands-on example
Restore test:
1. Take manual snapshot before upgrade.
2. Restore to staging.
3. Connect test app.
4. Verify key count and sample keys.
5. Run failover test.
6. Document restore time and data loss window.
How do you test a Redis or database failover without impacting users?Advanced
Answer
I test database or Redis failover through staging and approved game days, measuring application recovery, reconnect behavior, error rate, latency, and alerting. The complete app path matters more than the service event alone.
Technical explanation
Failover testing must include clients, retries, DNS/endpoints, connection pools, and alerts.
Staging comes first; production tests need approval, abort criteria, and rollback ownership.
Success is measured by application recovery, not only the managed-service event completing.
Hands-on example
Game day:
Pre-check backups and replicas.
Trigger RDS or ElastiCache failover in a controlled window.
Watch p95 latency, 5xx, reconnect logs, DB/cache metrics, and alerts.
Pass if app recovers without manual pod restart.
How do you handle connection storms against the cache after a deploy?Advanced
Answer
Connection storms happen when many clients reconnect simultaneously after deploy, failover, scaling, or network issues. I prevent them with pool caps, rolling deploys, jittered startup, exponential backoff, and readiness gates.
Technical explanation
Pods, functions, or clients can all reconnect at once and exceed DB/cache limits.
Backoff with jitter and pool caps prevent synchronized retries.
Rolling deploy settings and readiness probes must not hammer dependencies.
Hands-on example
Kubernetes guardrails:
maxSurge: 10%
maxUnavailable: 0
DB_POOL_MAX=10
REDIS_POOL_MAX=20
CONNECT_BACKOFF_JITTER=true
Add random startup sleep before opening warm connections.
What is the difference between a write-through and write-behind cache regarding data loss?Advanced
Answer
Write-through updates cache and database synchronously, giving stronger consistency but higher write latency. Write-behind writes to cache first and flushes later, improving latency but risking data loss without durable buffering.
Technical explanation
Write-through favors consistency but couples success to both cache and database write paths.
Write-behind favors latency and burst absorption but can lose data before flush.
Write-behind needs durable queueing, idempotent workers, retries, DLQ, and reconciliation.
Hands-on example
Use write-through or direct DB transaction for inventory updates.
Use write-behind only for tolerable-loss events such as page-view counters.
If used: append to Kafka/SQS, update DB idempotently, monitor queue lag, and reconcile totals.
How would you debug rising latency on a service that depends on Redis and RDS together?Advanced
Answer
To debug latency involving both Redis and RDS, I break the path into cache hit/miss, Redis latency, DB query time, connection wait, and app processing, then correlate hit ratio, Redis metrics, RDS DB load, and top SQL.
Technical explanation
First determine whether Redis is slow, cache misses increased, or RDS is slow independently.
Break request latency into cache time, hit/miss, DB time, connection wait, and app time.
Correlate deploy markers with cache hit ratio, Redis latency, RDS DB load, and top SQL.
Hands-on example
Triage graph:
service p95/p99 latency
Redis hit ratio and evictions
Redis command latency
RDS CPU, connections, DB load
Performance Insights top SQL
If hit ratio drops and RDS load rises, investigate TTL, invalidation, warmup, or eviction.
What recent database or caching change have you made, and what did it improve?Advanced
Answer
A strong recent-change answer should describe business driver, risk controls, migration or upgrade plan, validation, rollback, and measured improvement. Redis to Valkey migration or RDS major version upgrade are good examples.
Technical explanation
The story should show ownership from problem framing through validation.
Metrics make the answer credible: p95 latency, hit ratio, error rate, cost, recovery time, or incident reduction.
Mention rollback and what you monitored during and after the change.
Hands-on example
STAR example:
Situation: Redis lifecycle and license direction required a supported Valkey path.
Task: migrate without user impact.
Action: inventory, staging Valkey, traffic replay, failover test, TTL reduction, hot-key warmup, endpoint cutover.
Result: stable p95, stable hit ratio, no incident, lower platform risk.
How do you plan capacity and storage autoscaling for a growing RDS database?Advanced
Answer
For RDS capacity, I plan storage, IOPS, throughput, CPU, memory, connections, WAL/binlog growth, backup/restore time, and data growth. Storage autoscaling is useful but does not replace forecasting and schema/data lifecycle work.
Technical explanation
RDS storage autoscaling can prevent out-of-space incidents but does not shrink storage.
Growth affects backup time, restore time, vacuum, indexes, migrations, and query plans.
Capacity planning should forecast storage, IOPS, throughput, CPU, memory, connections, and logs.
Hands-on example
Forecast:
Free storage = 500 GB
Growth = 20 GB/day
Days remaining = 25
Threshold = 45 days, so act now.
SQL:
SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) FROM pg_class WHERE relkind = 'r' ORDER BY pg_total_relation_size(oid) DESC LIMIT 20;
Final Revision Checklist
Answer the first section in 60 to 90 seconds.
Use hands-on examples to prove production experience.
Mention monitoring, rollback, and validation for every risky change.
Tie caching answers to consistency and failure modes.
Tie RDS answers to HA, backups, performance, and safe operations.