Interview questions › Databases & Caching

Databases & Caching interview questions & answers

100 Databases & Caching interview questions, each answered three ways: a concise spoken answer, a technical explanation, and a hands-on example.

Tip: paste the job description + your resume into our free resume checker to see which of these skills the role actually requires.

All questions

What is Amazon RDS, and what does it manage for you versus self-managed databases?
What database engines does RDS support?
What is the difference between RDS and Aurora?
What is Multi-AZ in RDS, and how does automatic failover work?
How long does an RDS Multi-AZ failover typically take, and what triggers it?
What is the difference between Multi-AZ and a read replica?
When would you use a read replica, and can it become a standalone database?
Can a read replica be in a different region, and why would you do that?
What is the difference between automated backups and manual snapshots in RDS?
What is the backup retention period, and what is point-in-time recovery?
How does point-in-time recovery work under the hood (snapshots plus transaction logs)?
What is an RDS parameter group, and give an example of a parameter you would tune?
What is an RDS option group?
How do you perform a major version upgrade of RDS with minimal downtime?
Walk me through a safe RDS PostgreSQL/MySQL upgrade with a rollback plan (as you did at Intuit).
What is a blue-green deployment for RDS, and how does it reduce upgrade risk?
How do you test that an application is compatible before a database major upgrade?
How do you encrypt an RDS instance at rest, and can you encrypt an existing unencrypted one?
How do you enforce encryption in transit (SSL/TLS) to RDS?
How do you rotate database credentials, and how does Secrets Manager help?
How do you monitor RDS performance (CloudWatch, Performance Insights, Enhanced Monitoring)?
What is Performance Insights, and what does it help you find?
What are common causes of high RDS CPU, and how do you investigate?
How do you scale an RDS database vertically and what is the downtime impact?
How do you scale reads versus writes on a relational database?
What is connection pooling, and why does it matter for RDS (e.g., RDS Proxy)?
What is RDS Proxy, and what problem does it solve?
What is the difference between PostgreSQL and MySQL from an operator's view?
What is MVCC, and how does PostgreSQL implement it?
What is VACUUM in PostgreSQL, and why is autovacuum important?
What is transaction wraparound in PostgreSQL, and why is it dangerous?
What is the difference between a clustered and a non-clustered index?
What is an index, and what are the trade-offs of adding one?
How do you decide which columns to index?
What is a composite index, and does column order matter?
What is a covering index?
How do you find and fix a slow query (EXPLAIN / EXPLAIN ANALYZE)?
What does an execution plan tell you, and what is a sequential scan versus an index scan?
What are the SQL isolation levels, and what anomalies does each prevent?
What is the difference between a dirty read, non-repeatable read, and phantom read?
What is a deadlock, and how do databases detect and resolve them?
What is the difference between OLTP and OLAP workloads?
What is database normalisation, and when would you denormalise?
What is the difference between a primary key, a unique key, and a foreign key?
What is the difference between DELETE, TRUNCATE, and DROP?
What is the difference between an INNER JOIN and a LEFT JOIN?
What is a transaction, and what do the ACID properties mean?
What is WAL (write-ahead logging), and how does it support durability and replication?
What is synchronous versus asynchronous replication, and the trade-offs?
How do you migrate a database with near-zero downtime (e.g., using DMS or logical replication)?
What is AWS DMS, and when would you use it?
What is schema migration, and how do you run migrations safely in production?
How do you make a schema change backward compatible during a rolling deploy?
Why are some ALTER TABLE operations dangerous on a large table, and how do you mitigate?
What is connection exhaustion, and how do you prevent it?
What is ElastiCache, and what does it manage?
What is the difference between ElastiCache Redis and Memcached?
What is Redis, and what core data types does it support?
What is Valkey, and why did the community fork it from Redis?
Walk me through your ElastiCache Redis to Valkey migration: motivation, plan, and validation.
How do you migrate from Redis to Valkey with minimal downtime?
How is Valkey wire-compatible with Redis, and what risks remain in a migration?
What are common caching patterns (cache-aside, read-through, write-through, write-behind)?
What is cache-aside (lazy loading), and what are its failure modes?
What is a cache stampede / thundering herd, and how do you prevent it?
What is cache invalidation, and why is it hard?
What are Redis eviction policies (e.g., LRU, LFU, volatile vs allkeys)?
What happens when Redis reaches maxmemory?
What is a TTL, and how do you choose one for cached data?
What is the difference between Redis persistence options RDB and AOF?
What are the durability trade-offs between RDB snapshots and AOF?
What is Redis replication, and how does a replica stay in sync?
What is Redis Sentinel, and what does it provide?
What is Redis Cluster, and how does sharding with hash slots work?
What is the difference between Redis Sentinel and Redis Cluster?
How does ElastiCache provide high availability and automatic failover?
What is a cache hit ratio, and how do you improve it?
How do you monitor Redis/ElastiCache (evictions, memory, latency, hit ratio)?
What is the impact of a large key or hot key in Redis, and how do you handle it?
Why are some Redis commands (KEYS, FLUSHALL) dangerous in production?
What is pipelining in Redis, and how does it improve throughput?
What is the difference between Redis as a cache and Redis as a primary datastore?
How would you use Redis for rate limiting?
How would you use Redis for distributed locking, and what are the caveats?
How do you secure Redis (AUTH, TLS, network isolation, no public exposure)?
What is the difference between strong and eventual consistency, and where does caching sit?
How do you keep a cache and database consistent?
What is read-your-writes consistency, and how can caching break it?
How would you design caching for a read-heavy microservice?
How do you decide what to cache and what not to cache?
How do you handle cache warming after a restart or failover?
What is the blast radius of a cache outage, and how do you make the app degrade gracefully?
How do you size an ElastiCache cluster (memory, nodes, shards)?
What backup and restore options exist for ElastiCache Redis?
How do you test a Redis or database failover without impacting users?
How do you handle connection storms against the cache after a deploy?
What is the difference between a write-through and write-behind cache regarding data loss?
How would you debug rising latency on a service that depends on Redis and RDS together?
What recent database or caching change have you made, and what did it improve?
How do you plan capacity and storage autoscaling for a growing RDS database?

What is Amazon RDS, and what does it manage for you versus self-managed databases?Basic

Answer

Amazon RDS is AWS managed relational database service. AWS handles provisioning, backups, patching, storage operations, monitoring integration, and HA primitives, while the customer still owns schema design, query tuning, access control, data correctness, capacity choices, and application connection behavior.

Technical explanation

RDS removes much of the undifferentiated DBA work but it does not remove database engineering accountability.

The shared responsibility line matters: AWS manages the platform; the team manages workload design, users, queries, parameters, and safe changes.

A strong SRE answer mentions both managed-service benefits and remaining operational risks such as bad SQL, connection storms, and poor indexing.

Hands-on example

Create a small encrypted PostgreSQL instance, then validate what remains your responsibility:

$ aws rds create-db-instance --db-instance-identifier app-dev-pg --engine postgres --db-instance-class db.t4g.micro --allocated-storage 20 --backup-retention-period 7 --storage-encrypted

After creation: create least-privilege app users, restrict the security group to app subnets, enable slow query logging, test restore, and baseline CPU, connections, and query latency.

What database engines does RDS support?Basic

Answer

RDS supports multiple relational engines, including PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Db2. Aurora is part of the RDS family but is a separate AWS-built engine compatible with MySQL and PostgreSQL. I always verify version and feature availability by Region before finalizing design.

Technical explanation

Engine choice affects HA behavior, licensing, extensions, replication, upgrade method, parameters, and operational tooling.

Open-source engines are often simpler from a licensing perspective, while Oracle, SQL Server, and Db2 need edition and option review.

Aurora should be evaluated separately because its storage, replica, failover, and endpoint behavior differs from standard RDS instances.

Hands-on example

Discovery example:

$ aws rds describe-db-engine-versions --engine postgres --query "DBEngineVersions[].EngineVersion" --output text

Decision checklist: required extensions, Region/version support, HA target, licensing, read scaling, backup/restore needs, and team experience.

What is the difference between RDS and Aurora?Basic

Answer

RDS is the managed service umbrella for standard database engines; Aurora is AWS cloud-native relational engine compatible with MySQL and PostgreSQL. Aurora uses a distributed storage layer and cluster endpoints, while standard RDS is closer to a managed DB instance with optional Multi-AZ and replicas.

Technical explanation

Aurora decouples compute from a replicated storage layer more than standard RDS.

Standard RDS can be simpler for lift-and-shift workloads; Aurora can help with read scale, failover behavior, and cloud-native features.

Compatibility testing is mandatory because Aurora-compatible is not the same as identical for every extension, parameter, or query plan.

Hands-on example

Example architecture: use the Aurora writer endpoint for writes and the reader endpoint for read-only traffic:

APP_DB_WRITE_HOST=orders.cluster-abc.us-east-1.rds.amazonaws.com

APP_DB_READ_HOST=orders.cluster-ro-abc.us-east-1.rds.amazonaws.com

Run integration and explain-plan tests before choosing Aurora over standard RDS.

What is Multi-AZ in RDS, and how does automatic failover work?Basic

Answer

RDS Multi-AZ is a high-availability deployment where RDS maintains a standby or alternate writer in another Availability Zone and automatically fails over when the primary cannot serve safely. Traditional Multi-AZ DB instance failover is commonly around 60 to 120 seconds, but application recovery can be longer.

Technical explanation

Multi-AZ is for availability, not read scaling in the traditional DB instance model.

Failover can be triggered by infrastructure failure, AZ disruption, maintenance, storage/network issues, or manual reboot with failover.

Applications must reconnect, retry idempotently, respect DNS, and handle in-flight transaction failure.

Hands-on example

Staging failover test:

$ while true; do date; psql "$DB_URL" -c "select now();" && echo ok || echo failed; sleep 1; done

$ aws rds reboot-db-instance --db-instance-identifier orders-stage --force-failover

Measure first failure, first success after recovery, API 5xx, p95 latency, and whether any pod restart was required.

How long does an RDS Multi-AZ failover typically take, and what triggers it?Basic

Answer

Technical explanation

Multi-AZ is for availability, not read scaling in the traditional DB instance model.

Failover can be triggered by infrastructure failure, AZ disruption, maintenance, storage/network issues, or manual reboot with failover.

Applications must reconnect, retry idempotently, respect DNS, and handle in-flight transaction failure.

Hands-on example

Staging failover test:

$ while true; do date; psql "$DB_URL" -c "select now();" && echo ok || echo failed; sleep 1; done

$ aws rds reboot-db-instance --db-instance-identifier orders-stage --force-failover

Measure first failure, first success after recovery, API 5xx, p95 latency, and whether any pod restart was required.

What is the difference between Multi-AZ and a read replica?Basic

Answer

Multi-AZ protects the writer for HA and failover; read replicas are readable copies used for read scaling, reporting, migration, and DR. Read replicas are generally asynchronous, can lag, can often be cross-Region, and can be promoted to standalone databases when needed.

Technical explanation

Use Multi-AZ for availability and read replicas for read workload isolation or migration targets.

Replica lag means read-your-writes is not guaranteed from replicas.

Cross-Region replicas are useful for DR and locality, but require RPO/RTO, secrets, DNS, KMS, and failover runbooks.

Hands-on example

Read/write routing example:

POST /orders -> primary endpoint

GET /orders/{id} immediately after create -> primary endpoint

GET /orders/report -> read replica endpoint

Promotion command for migration or DR:

$ aws rds promote-read-replica --db-instance-identifier orders-replica-dr

Before promotion, check replica lag and stop writes if a clean cutover is required.

When would you use a read replica, and can it become a standalone database?Basic

Answer

Technical explanation

Use Multi-AZ for availability and read replicas for read workload isolation or migration targets.

Replica lag means read-your-writes is not guaranteed from replicas.

Cross-Region replicas are useful for DR and locality, but require RPO/RTO, secrets, DNS, KMS, and failover runbooks.

Hands-on example

Read/write routing example:

POST /orders -> primary endpoint

GET /orders/{id} immediately after create -> primary endpoint

GET /orders/report -> read replica endpoint

Promotion command for migration or DR:

$ aws rds promote-read-replica --db-instance-identifier orders-replica-dr

Before promotion, check replica lag and stop writes if a clean cutover is required.

Can a read replica be in a different region, and why would you do that?Basic

Answer

Technical explanation

Use Multi-AZ for availability and read replicas for read workload isolation or migration targets.

Replica lag means read-your-writes is not guaranteed from replicas.

Cross-Region replicas are useful for DR and locality, but require RPO/RTO, secrets, DNS, KMS, and failover runbooks.

Hands-on example

Read/write routing example:

POST /orders -> primary endpoint

GET /orders/{id} immediately after create -> primary endpoint

GET /orders/report -> read replica endpoint

Promotion command for migration or DR:

$ aws rds promote-read-replica --db-instance-identifier orders-replica-dr

Before promotion, check replica lag and stop writes if a clean cutover is required.

What is the difference between automated backups and manual snapshots in RDS?Basic

Answer

Automated backups are retained for a configured window and support point-in-time recovery. Manual snapshots are retained until deleted and are useful before risky changes or for long-term copies. PITR restores from a base backup plus transaction logs to create a new database at a chosen timestamp.

Technical explanation

PITR does not rewind the existing DB in place; it restores a new instance that must be validated and cut over or used for data repair.

Retention should be based on business recovery needs and detection time for corruption, not just default cost settings.

Restore testing is part of the backup strategy because backup success does not prove recovery success.

Hands-on example

PITR example:

$ aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier orders-prod --target-db-instance-identifier orders-pitr-20260630 --restore-time 2026-06-30T10:14:30Z

Then compare row counts, extract repaired rows if needed, run smoke tests, and decide between partial repair and full cutover.

What is the backup retention period, and what is point-in-time recovery?Basic

Answer

Technical explanation

PITR does not rewind the existing DB in place; it restores a new instance that must be validated and cut over or used for data repair.

Retention should be based on business recovery needs and detection time for corruption, not just default cost settings.

Restore testing is part of the backup strategy because backup success does not prove recovery success.

Hands-on example

PITR example:

$ aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier orders-prod --target-db-instance-identifier orders-pitr-20260630 --restore-time 2026-06-30T10:14:30Z

Then compare row counts, extract repaired rows if needed, run smoke tests, and decide between partial repair and full cutover.

How does point-in-time recovery work under the hood (snapshots plus transaction logs)?Basic

Answer

Technical explanation

PITR does not rewind the existing DB in place; it restores a new instance that must be validated and cut over or used for data repair.

Retention should be based on business recovery needs and detection time for corruption, not just default cost settings.

Restore testing is part of the backup strategy because backup success does not prove recovery success.

Hands-on example

PITR example:

$ aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier orders-prod --target-db-instance-identifier orders-pitr-20260630 --restore-time 2026-06-30T10:14:30Z

Then compare row counts, extract repaired rows if needed, run smoke tests, and decide between partial repair and full cutover.

What is an RDS parameter group, and give an example of a parameter you would tune?Basic

Answer

An RDS parameter group is a managed set of database engine configuration parameters. It controls behavior such as logging, memory-related settings, connection limits, TLS enforcement flags, query planner settings, and extension preload settings.

Technical explanation

Some parameters are dynamic; others require reboot, so the apply type must be checked before production change.

Parameter groups should be managed through IaC and promoted through environments.

Changing max_connections, logging, or planner parameters can have broad performance impact and should be measured.

Hands-on example

PostgreSQL slow query visibility example:

Set log_min_duration_statement = 500

Set shared_preload_libraries = pg_stat_statements

Then run:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

SELECT query, calls, total_exec_time, mean_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;

What is an RDS option group?Basic

Answer

An RDS option group enables engine-specific optional capabilities that are not simple parameter settings, especially for Oracle, SQL Server, and some MySQL/MariaDB features. It can attach features such as native backup/restore, audit plugins, or other engine options depending on support.

Technical explanation

Parameter groups tune engine settings; option groups attach optional engine features or integrations.

Some options require IAM roles, S3 access, ports, license review, or downtime.

Treat option group changes as production changes with compatibility and rollback review.

Hands-on example

Evaluation example:

$ aws rds describe-option-group-options --engine-name sqlserver-se

Before enabling an option, confirm engine edition/version support, required IAM role, port exposure, downtime/reboot requirement, and whether the option can be removed later.

How do you perform a major version upgrade of RDS with minimal downtime?Basic

Answer

For a major RDS upgrade, I avoid untested in-place changes on critical systems. I prefer blue/green deployment, logical replication, DMS, or a replica-style approach so most upgrade work and validation happens away from production before a controlled cutover.

Technical explanation

Major upgrades can change query plans, parameters, extensions, authentication behavior, reserved words, and client compatibility.

Blue/green reduces risk by creating a synchronized green environment where the new version can be tested before switchover.

Rollback must be designed before cutover; after writes hit the new version, rollback may require restore or reverse migration.

Hands-on example

Upgrade runbook:

1. Restore prod snapshot to staging and upgrade it.

2. Run app regression, migrations, batch jobs, and top SQL comparisons.

3. Create blue/green or logical target.

4. Freeze schema changes.

5. Cut over during approved window.

6. Monitor p95 latency, DB load, locks, error rate, and slow queries.

7. Keep old environment read-only until rollback window expires.

Walk me through a safe RDS PostgreSQL/MySQL upgrade with a rollback plan (as you did at Intuit).Basic

Answer

Technical explanation

Major upgrades can change query plans, parameters, extensions, authentication behavior, reserved words, and client compatibility.

Blue/green reduces risk by creating a synchronized green environment where the new version can be tested before switchover.

Rollback must be designed before cutover; after writes hit the new version, rollback may require restore or reverse migration.

Hands-on example

Upgrade runbook:

1. Restore prod snapshot to staging and upgrade it.

2. Run app regression, migrations, batch jobs, and top SQL comparisons.

3. Create blue/green or logical target.

4. Freeze schema changes.

5. Cut over during approved window.

6. Monitor p95 latency, DB load, locks, error rate, and slow queries.

7. Keep old environment read-only until rollback window expires.

What is a blue-green deployment for RDS, and how does it reduce upgrade risk?Basic

Answer

Technical explanation

Major upgrades can change query plans, parameters, extensions, authentication behavior, reserved words, and client compatibility.

Blue/green reduces risk by creating a synchronized green environment where the new version can be tested before switchover.

Rollback must be designed before cutover; after writes hit the new version, rollback may require restore or reverse migration.

Hands-on example

Upgrade runbook:

1. Restore prod snapshot to staging and upgrade it.

2. Run app regression, migrations, batch jobs, and top SQL comparisons.

3. Create blue/green or logical target.

4. Freeze schema changes.

5. Cut over during approved window.

6. Monitor p95 latency, DB load, locks, error rate, and slow queries.

7. Keep old environment read-only until rollback window expires.

How do you test that an application is compatible before a database major upgrade?Basic

Answer

Technical explanation

Major upgrades can change query plans, parameters, extensions, authentication behavior, reserved words, and client compatibility.

Blue/green reduces risk by creating a synchronized green environment where the new version can be tested before switchover.

Rollback must be designed before cutover; after writes hit the new version, rollback may require restore or reverse migration.

Hands-on example

Upgrade runbook:

1. Restore prod snapshot to staging and upgrade it.

2. Run app regression, migrations, batch jobs, and top SQL comparisons.

3. Create blue/green or logical target.

4. Freeze schema changes.

5. Cut over during approved window.

6. Monitor p95 latency, DB load, locks, error rate, and slow queries.

7. Keep old environment read-only until rollback window expires.

How do you encrypt an RDS instance at rest, and can you encrypt an existing unencrypted one?Basic

Answer

New RDS instances should be encrypted at creation with KMS. Existing unencrypted RDS storage cannot simply be encrypted in place; the normal method is snapshot, copy snapshot with encryption enabled, restore a new encrypted instance, validate, and cut over.

Technical explanation

Encryption covers storage and related backups/snapshots according to RDS and KMS behavior.

KMS key policy and cross-account or cross-Region permissions must be planned.

Because restore creates a new endpoint, cutover and rollback must be part of the migration plan.

Hands-on example

Encryption conversion example:

$ aws rds create-db-snapshot --db-instance-identifier legacy-orders --db-snapshot-identifier legacy-orders-pre-encryption

$ aws rds copy-db-snapshot --source-db-snapshot-identifier legacy-orders-pre-encryption --target-db-snapshot-identifier legacy-orders-encrypted --kms-key-id alias/rds-prod

$ aws rds restore-db-instance-from-db-snapshot --db-instance-identifier orders-encrypted --db-snapshot-identifier legacy-orders-encrypted

How do you enforce encryption in transit (SSL/TLS) to RDS?Basic

Answer

Encryption in transit is enforced by requiring TLS from clients and configuring the engine to reject non-TLS connections where supported. For PostgreSQL on RDS, rds.force_ssl plus client sslmode settings are common. For MySQL, require SSL for users and trust the RDS CA bundle.

Technical explanation

At-rest encryption and in-transit encryption solve different problems and both are expected for sensitive workloads.

Clients should verify the server certificate, not just turn on encryption without validation.

CA rotation must be planned because old clients may fail if trust stores are stale.

Hands-on example

PostgreSQL TLS validation:

$ psql "host=orders.xxxxxx.us-east-1.rds.amazonaws.com dbname=orders user=app sslmode=verify-full sslrootcert=/etc/ssl/rds-ca.pem"

SELECT ssl, version, cipher FROM pg_stat_ssl WHERE pid = pg_backend_pid();

Expected: ssl = true with an approved TLS version/cipher.

How do you rotate database credentials, and how does Secrets Manager help?Basic

Answer

I rotate DB credentials through AWS Secrets Manager or an equivalent secrets system. The rotation process updates the database user password, updates secret versions, and ensures applications pick up the new credential without hardcoded values or manual redeploy risk.

Technical explanation

Use least-privilege application users instead of the master user.

Rotation must account for connection pools and long-lived pods using old credentials.

Alternating-user rotation can provide lower downtime because one credential remains valid while the other rotates.

Hands-on example

Rotation checklist:

1. Store prod/orders/db in Secrets Manager.

2. Grant only the app role secretsmanager:GetSecretValue.

3. Enable tested rotation Lambda.

4. Restart or reload app pools safely.

5. Monitor failed logins and connection errors.

CLI:

$ aws secretsmanager describe-secret --secret-id prod/orders/db

How do you monitor RDS performance (CloudWatch, Performance Insights, Enhanced Monitoring)?Basic

Answer

RDS performance monitoring should combine CloudWatch, Performance Insights, Enhanced Monitoring, database-native views, logs, and application SLOs. CloudWatch shows resource symptoms; Performance Insights shows database load, waits, and top SQL; the app tells whether users are impacted.

Technical explanation

Key metrics include CPU, connections, free memory, free storage, IOPS, read/write latency, disk queue depth, replica lag, and DB load.

High CPU is investigated through top SQL, waits, connection count, recent deployments, locks, and query plans.

The best incident response correlates application latency with database load and the exact SQL or wait event causing it.

Hands-on example

Incident triage SQL:

SELECT state, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;

Then open Performance Insights, identify top SQL by DB load, run EXPLAIN (ANALYZE, BUFFERS) in staging, and mitigate with query change, index, pool cap, feature flag, or scale-up.

What is Performance Insights, and what does it help you find?Basic

Answer

Technical explanation

Key metrics include CPU, connections, free memory, free storage, IOPS, read/write latency, disk queue depth, replica lag, and DB load.

High CPU is investigated through top SQL, waits, connection count, recent deployments, locks, and query plans.

The best incident response correlates application latency with database load and the exact SQL or wait event causing it.

Hands-on example

Incident triage SQL:

SELECT state, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;

Then open Performance Insights, identify top SQL by DB load, run EXPLAIN (ANALYZE, BUFFERS) in staging, and mitigate with query change, index, pool cap, feature flag, or scale-up.

What are common causes of high RDS CPU, and how do you investigate?Basic

Answer

Technical explanation

Key metrics include CPU, connections, free memory, free storage, IOPS, read/write latency, disk queue depth, replica lag, and DB load.

High CPU is investigated through top SQL, waits, connection count, recent deployments, locks, and query plans.

The best incident response correlates application latency with database load and the exact SQL or wait event causing it.

Hands-on example

Incident triage SQL:

SELECT state, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;

Then open Performance Insights, identify top SQL by DB load, run EXPLAIN (ANALYZE, BUFFERS) in staging, and mitigate with query change, index, pool cap, feature flag, or scale-up.

How do you scale an RDS database vertically and what is the downtime impact?Basic

Answer

Vertical scaling changes the RDS DB instance class to add CPU, memory, or network capacity. It can require downtime or a failover depending on engine, class change, Multi-AZ setup, and whether the change is applied immediately or in a maintenance window.

Technical explanation

Scale-up can buy time but does not fix bad SQL, locks, missing indexes, or connection storms.

Multi-AZ can reduce impact but still requires app reconnect and retry behavior.

Scale-down should be tested because less memory can reduce cache hit ratio and change query behavior.

Hands-on example

Scale-up example:

$ aws rds modify-db-instance --db-instance-identifier orders-prod --db-instance-class db.r7g.2xlarge --apply-immediately

Watch RDS events, p95 latency, DB load, CPU, free memory, slow queries, and application connection errors during the change.

How do you scale reads versus writes on a relational database?Basic

Answer

Reads can be scaled with replicas, caching, materialized views, query optimization, denormalized read models, and analytics offload. Writes are harder because consistency and uniqueness usually concentrate around one writer, so write scaling may require partitioning, sharding, queues, batching, or architecture changes.

Technical explanation

Read scaling is safe only when stale reads are acceptable or read-after-write paths use the primary.

Write scaling is constrained by transactions, constraints, referential integrity, and cross-shard behavior.

Most teams should tune queries and cache reads before redesigning writes.

Hands-on example

Example split:

POST /orders -> primary RDS writer

GET /products -> Redis/Valkey cache

GET /reports -> read replica or warehouse

Heavy post-processing -> queue and worker

This keeps OLTP writes short and moves expensive reads away from the writer.

What is connection pooling, and why does it matter for RDS (e.g., RDS Proxy)?Basic

Answer

Connection pooling reuses a bounded set of database connections and prevents unbounded clients from exhausting the database. RDS Proxy is a managed pooling proxy that helps with bursty traffic, Lambda concurrency, failover recovery, and credential rotation.

Technical explanation

Each DB connection consumes memory and CPU; in Kubernetes total connections equal pod count times pool size.

RDS Proxy multiplexes when session state allows it, but pinned sessions reduce benefits.

Prevention includes pool caps, backoff with jitter, timeouts, RDS Proxy/PgBouncer, and avoiding unbounded concurrency.

Hands-on example

Pool sizing example:

DB safe app budget = 850 connections

App pods = 50

Max pool per pod = floor(850/50) = 17

Choose 10 to 15 and load test.

PostgreSQL check:

SELECT usename, application_name, state, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;

What is RDS Proxy, and what problem does it solve?Basic

Answer

Technical explanation

Each DB connection consumes memory and CPU; in Kubernetes total connections equal pod count times pool size.

RDS Proxy multiplexes when session state allows it, but pinned sessions reduce benefits.

Prevention includes pool caps, backoff with jitter, timeouts, RDS Proxy/PgBouncer, and avoiding unbounded concurrency.

Hands-on example

Pool sizing example:

DB safe app budget = 850 connections

App pods = 50

Max pool per pod = floor(850/50) = 17

Choose 10 to 15 and load test.

PostgreSQL check:

SELECT usename, application_name, state, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;

What is the difference between PostgreSQL and MySQL from an operator's view?Basic

Answer

PostgreSQL and MySQL differ operationally in storage behavior, MVCC implementation, replication, indexing, planner behavior, extensions, and maintenance. PostgreSQL MVCC keeps row versions for concurrent snapshots, and VACUUM/autovacuum is essential to clean dead tuples and prevent wraparound.

Technical explanation

PostgreSQL readers do not usually block writers, but old row versions create dead tuples and bloat if vacuum falls behind.

Long-running transactions can prevent cleanup and increase wraparound risk.

Operational checks should include dead tuples, transaction age, autovacuum timing, slow queries, and extension compatibility.

Hands-on example

PostgreSQL health checks:

SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;

SELECT datname, age(datfrozenxid) AS xid_age FROM pg_database ORDER BY xid_age DESC;

If values are high, investigate long transactions, tune autovacuum per hot table, and schedule safe VACUUM work.

What is MVCC, and how does PostgreSQL implement it?Basic

Answer

Technical explanation

PostgreSQL readers do not usually block writers, but old row versions create dead tuples and bloat if vacuum falls behind.

Long-running transactions can prevent cleanup and increase wraparound risk.

Operational checks should include dead tuples, transaction age, autovacuum timing, slow queries, and extension compatibility.

Hands-on example

PostgreSQL health checks:

SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;

SELECT datname, age(datfrozenxid) AS xid_age FROM pg_database ORDER BY xid_age DESC;

If values are high, investigate long transactions, tune autovacuum per hot table, and schedule safe VACUUM work.

What is VACUUM in PostgreSQL, and why is autovacuum important?Basic

Answer

Technical explanation

PostgreSQL readers do not usually block writers, but old row versions create dead tuples and bloat if vacuum falls behind.

Long-running transactions can prevent cleanup and increase wraparound risk.

Operational checks should include dead tuples, transaction age, autovacuum timing, slow queries, and extension compatibility.

Hands-on example

PostgreSQL health checks:

SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;

SELECT datname, age(datfrozenxid) AS xid_age FROM pg_database ORDER BY xid_age DESC;

If values are high, investigate long transactions, tune autovacuum per hot table, and schedule safe VACUUM work.

What is transaction wraparound in PostgreSQL, and why is it dangerous?Basic

Answer

Technical explanation

PostgreSQL readers do not usually block writers, but old row versions create dead tuples and bloat if vacuum falls behind.

Long-running transactions can prevent cleanup and increase wraparound risk.

Operational checks should include dead tuples, transaction age, autovacuum timing, slow queries, and extension compatibility.

Hands-on example

PostgreSQL health checks:

SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 20;

SELECT datname, age(datfrozenxid) AS xid_age FROM pg_database ORDER BY xid_age DESC;

If values are high, investigate long transactions, tune autovacuum per hot table, and schedule safe VACUUM work.

What is the difference between a clustered and a non-clustered index?Basic

Answer

A clustered index controls or reflects the physical row order in engines that support it; a non-clustered index is a separate lookup structure that points to the table row. In MySQL InnoDB the primary key is clustered, while PostgreSQL normally uses separate indexes and only physically reorders with CLUSTER when explicitly run.

Technical explanation

Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.

EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.

Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.

Hands-on example

Index tuning example:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;

CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);

Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.

What is an index, and what are the trade-offs of adding one?Basic

Answer

An index is a structure that lets the database find rows without scanning the full table. It improves selected reads, joins, ordering, and uniqueness checks, but it costs storage and slows writes because the index must be maintained.

Technical explanation

Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.

EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.

Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.

Hands-on example

Index tuning example:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;

CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);

Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.

How do you decide which columns to index?Intermediate

Answer

I decide indexes from real workload evidence: WHERE predicates, JOIN keys, ORDER BY, GROUP BY, uniqueness rules, and top expensive queries. I prioritize selective columns and query-specific composite indexes, then verify with EXPLAIN ANALYZE.

Technical explanation

Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.

EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.

Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.

Hands-on example

Index tuning example:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;

CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);

Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.

What is a composite index, and does column order matter?Intermediate

Answer

A composite index covers multiple columns, and column order matters because the optimizer can use the leftmost prefix most effectively. Equality filters usually come before range or sort columns, but the final decision is based on the exact query and data distribution.

Technical explanation

Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.

EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.

Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.

Hands-on example

Index tuning example:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;

CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);

Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.

What is a covering index?Intermediate

Answer

A covering index includes all columns needed by a query so the database can answer from the index without reading the base table, when visibility rules allow it. In PostgreSQL, INCLUDE columns are commonly used for this pattern.

Technical explanation

Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.

EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.

Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.

Hands-on example

Index tuning example:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;

CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);

Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.

How do you find and fix a slow query (EXPLAIN / EXPLAIN ANALYZE)?Intermediate

Answer

I find a slow query from traces, slow logs, Performance Insights, or pg_stat_statements, then inspect it with EXPLAIN or EXPLAIN ANALYZE. I fix the root cause through better indexes, query rewrite, updated statistics, batching, or schema changes.

Technical explanation

Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.

EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.

Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.

Hands-on example

Index tuning example:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;

CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);

Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.

What does an execution plan tell you, and what is a sequential scan versus an index scan?Intermediate

Answer

An execution plan shows how the database will access, join, sort, and filter data. A sequential scan reads the table broadly; an index scan uses an index to find matching rows. The right choice depends on table size, selectivity, and cost.

Technical explanation

Use real workload evidence from pg_stat_statements, slow logs, Performance Insights, or traces before adding indexes.

EXPLAIN shows the plan; EXPLAIN ANALYZE runs the query and compares estimated versus actual rows and timing.

Sequential scans are not always bad; for small tables or low-selectivity filters they may be optimal.

Hands-on example

Index tuning example:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 20;

CREATE INDEX CONCURRENTLY idx_orders_customer_created ON orders(customer_id, created_at DESC);

Re-run EXPLAIN and confirm lower execution time, fewer buffers read, and no large sort. For covering reads, add INCLUDE columns where appropriate.

What are the SQL isolation levels, and what anomalies does each prevent?Intermediate

Answer

Isolation levels define what concurrent transactions are allowed to see. The standard levels are Read Uncommitted, Read Committed, Repeatable Read, and Serializable; stronger isolation prevents more anomalies but can increase locking or retry needs.

Technical explanation

Stronger isolation can improve correctness but may increase blocking or retry requirements.

Deadlocks are handled by aborting a victim transaction; the application must retry the whole transaction safely.

Short transactions, consistent lock ordering, proper indexes, and idempotent retry logic are essential.

Hands-on example

Safe transfer example:

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;

SELECT id FROM accounts WHERE id IN (1,2) ORDER BY id FOR UPDATE;

UPDATE accounts SET balance = balance - 100 WHERE id = 1;

UPDATE accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;

On serialization failure or deadlock, retry the whole transaction, not just one statement.

What is the difference between a dirty read, non-repeatable read, and phantom read?Intermediate

Answer

A dirty read sees uncommitted data, a non-repeatable read sees the same row change between reads in one transaction, and a phantom read sees the result set of a predicate change because rows were inserted, deleted, or updated by another transaction.

Technical explanation

Stronger isolation can improve correctness but may increase blocking or retry requirements.

Deadlocks are handled by aborting a victim transaction; the application must retry the whole transaction safely.

Short transactions, consistent lock ordering, proper indexes, and idempotent retry logic are essential.

Hands-on example

Safe transfer example:

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;

SELECT id FROM accounts WHERE id IN (1,2) ORDER BY id FOR UPDATE;

UPDATE accounts SET balance = balance - 100 WHERE id = 1;

UPDATE accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;

On serialization failure or deadlock, retry the whole transaction, not just one statement.

What is a deadlock, and how do databases detect and resolve them?Intermediate

Answer

A deadlock occurs when transactions wait on each other in a cycle. Databases detect the cycle and abort one transaction, so applications must retry the full transaction and engineers should fix lock ordering, transaction length, and indexes.

Technical explanation

Stronger isolation can improve correctness but may increase blocking or retry requirements.

Deadlocks are handled by aborting a victim transaction; the application must retry the whole transaction safely.

Short transactions, consistent lock ordering, proper indexes, and idempotent retry logic are essential.

Hands-on example

Safe transfer example:

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;

SELECT id FROM accounts WHERE id IN (1,2) ORDER BY id FOR UPDATE;

UPDATE accounts SET balance = balance - 100 WHERE id = 1;

UPDATE accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;

On serialization failure or deadlock, retry the whole transaction, not just one statement.

What is the difference between OLTP and OLAP workloads?Intermediate

Answer

OLTP workloads serve many small, concurrent transactions with low latency and strong correctness. OLAP workloads serve analytical scans, aggregations, and reports over large historical data, so heavy OLAP should usually be isolated from the OLTP primary.

Technical explanation

OLTP is optimized for short, indexed, concurrent transactions; OLAP is optimized for large scans, joins, and aggregations.

Running OLAP reports on the OLTP primary can consume I/O, CPU, buffers, and locks needed by user traffic.

Common designs isolate analytics through read replicas, CDC pipelines, warehouses, materialized views, or summary tables.

Hands-on example

Example separation:

Checkout API reads/writes orders on the primary RDS instance.

Monthly revenue report runs on a warehouse fed by CDC.

Immediate mitigation for a heavy reporting user:

ALTER ROLE reporting_user SET statement_timeout = "5min";

What is database normalisation, and when would you denormalise?Intermediate

Answer

Normalization reduces duplication and update anomalies by modeling entities in separate related tables. Denormalization intentionally duplicates or precomputes data for read performance, but it adds consistency and reconciliation work.

Technical explanation

Normalization reduces duplicated facts and prevents update anomalies.

Denormalization improves specific reads but creates consistency and reconciliation work.

The normalized OLTP schema can remain source of truth while read models, materialized views, or caches serve optimized reads.

Hands-on example

Example denormalization:

orders(customer_id) references customers(id).

Add orders.customer_name_at_purchase for immutable order history.

Backfill:

UPDATE orders o SET customer_name_at_purchase = c.name FROM customers c WHERE c.id = o.customer_id;

What is the difference between a primary key, a unique key, and a foreign key?Intermediate

Answer

A primary key uniquely identifies each row and cannot be null. A unique key enforces uniqueness for a column or column set. A foreign key enforces that values in one table refer to valid rows in another table.

Technical explanation

Primary keys identify rows and are often referenced by foreign keys.

Unique constraints enforce business uniqueness such as email or tenant plus name.

Foreign keys protect referential integrity but need indexing and migration planning on large tables.

Hands-on example

Example schema:

CREATE TABLE customers (id uuid PRIMARY KEY, email text UNIQUE NOT NULL);

CREATE TABLE orders (id uuid PRIMARY KEY, customer_id uuid NOT NULL REFERENCES customers(id));

Add an index on orders(customer_id) for join and delete/update performance.

What is the difference between DELETE, TRUNCATE, and DROP?Intermediate

Answer

DELETE removes selected rows, usually with a WHERE clause. TRUNCATE removes all rows from a table more quickly but with stronger table-level implications. DROP removes the table object and its definition entirely.

Technical explanation

DELETE is selective and can be batched, but huge deletes create logs, bloat, locks, and replica lag.

TRUNCATE is fast for emptying a table but usually takes stronger locks and affects the whole table.

DROP removes schema and data, so it requires backups, approvals, and migration controls.

Hands-on example

Safe batch delete:

DELETE FROM events WHERE id IN (SELECT id FROM events WHERE created_at < now() - interval '180 days' LIMIT 10000);

Repeat with sleeps and monitor locks, WAL, replica lag, and table bloat.

What is the difference between an INNER JOIN and a LEFT JOIN?Intermediate

Answer

An INNER JOIN returns only matching rows from both tables. A LEFT JOIN returns every row from the left table and matching right-table rows, with NULLs when no right-side match exists.

Technical explanation

INNER JOIN filters to rows that match on both sides.

LEFT JOIN preserves all left-table rows and fills right-side columns with NULL when no match exists.

Putting right-table filters in WHERE can accidentally turn a LEFT JOIN into an INNER JOIN.

Hands-on example

Find customers with no orders:

SELECT c.id, c.email FROM customers c LEFT JOIN orders o ON o.customer_id = c.id WHERE o.id IS NULL;

Support it with:

CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id);

What is a transaction, and what do the ACID properties mean?Intermediate

Answer

A transaction is an all-or-nothing unit of work. ACID means Atomicity, Consistency, Isolation, and Durability: the database commits complete valid changes, controls concurrency, and preserves committed data across crashes.

Technical explanation

Stronger isolation can improve correctness but may increase blocking or retry requirements.

Deadlocks are handled by aborting a victim transaction; the application must retry the whole transaction safely.

Short transactions, consistent lock ordering, proper indexes, and idempotent retry logic are essential.

Hands-on example

Safe transfer example:

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;

SELECT id FROM accounts WHERE id IN (1,2) ORDER BY id FOR UPDATE;

UPDATE accounts SET balance = balance - 100 WHERE id = 1;

UPDATE accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;

On serialization failure or deadlock, retry the whole transaction, not just one statement.

What is WAL (write-ahead logging), and how does it support durability and replication?Intermediate

Answer

WAL, or write-ahead logging, records changes durably before data pages are considered committed. It enables crash recovery, replication, and point-in-time recovery because the database can replay committed changes.

Technical explanation

WAL/binlog volume affects storage, backup, replica lag, and restore behavior.

Near-zero downtime migration requires initial load, CDC, validation, short write drain, cutover, and rollback planning.

DMS helps move data, but schema conversion, application compatibility, and data validation remain engineering responsibilities.

Hands-on example

Migration runbook:

1. Create target database and schema.

2. Start full-load plus CDC using DMS or logical replication.

3. Monitor lag and task errors.

4. Compare row counts and critical aggregates.

5. Stop writes briefly.

6. Wait for lag to reach zero.

7. Switch endpoint/secret.

8. Monitor app errors and DB load.

9. Keep source read-only until rollback window closes.

What is synchronous versus asynchronous replication, and the trade-offs?Intermediate

Answer

Synchronous replication waits for a replica acknowledgement before commit returns, reducing data-loss risk but increasing latency. Asynchronous replication returns faster but permits replica lag and possible data loss during failover.

Technical explanation

WAL/binlog volume affects storage, backup, replica lag, and restore behavior.

Near-zero downtime migration requires initial load, CDC, validation, short write drain, cutover, and rollback planning.

DMS helps move data, but schema conversion, application compatibility, and data validation remain engineering responsibilities.

Hands-on example

Migration runbook:

1. Create target database and schema.

2. Start full-load plus CDC using DMS or logical replication.

3. Monitor lag and task errors.

4. Compare row counts and critical aggregates.

5. Stop writes briefly.

6. Wait for lag to reach zero.

7. Switch endpoint/secret.

8. Monitor app errors and DB load.

9. Keep source read-only until rollback window closes.

How do you migrate a database with near-zero downtime (e.g., using DMS or logical replication)?Intermediate

Answer

Near-zero downtime migration uses initial load plus ongoing change replication, then a short write drain and controlled cutover. The core tasks are schema prep, CDC, validation, lag monitoring, endpoint switch, and rollback planning.

Technical explanation

WAL/binlog volume affects storage, backup, replica lag, and restore behavior.

Near-zero downtime migration requires initial load, CDC, validation, short write drain, cutover, and rollback planning.

DMS helps move data, but schema conversion, application compatibility, and data validation remain engineering responsibilities.

Hands-on example

Migration runbook:

1. Create target database and schema.

2. Start full-load plus CDC using DMS or logical replication.

3. Monitor lag and task errors.

4. Compare row counts and critical aggregates.

5. Stop writes briefly.

6. Wait for lag to reach zero.

7. Switch endpoint/secret.

8. Monitor app errors and DB load.

9. Keep source read-only until rollback window closes.

What is AWS DMS, and when would you use it?Intermediate

Answer

AWS DMS is a managed migration and replication service that can perform full loads and change data capture between supported sources and targets. I use it for cloud migrations, heterogeneous migrations, and low-downtime moves when native replication is not ideal.

Technical explanation

WAL/binlog volume affects storage, backup, replica lag, and restore behavior.

Near-zero downtime migration requires initial load, CDC, validation, short write drain, cutover, and rollback planning.

DMS helps move data, but schema conversion, application compatibility, and data validation remain engineering responsibilities.

Hands-on example

Migration runbook:

1. Create target database and schema.

2. Start full-load plus CDC using DMS or logical replication.

3. Monitor lag and task errors.

4. Compare row counts and critical aggregates.

5. Stop writes briefly.

6. Wait for lag to reach zero.

7. Switch endpoint/secret.

8. Monitor app errors and DB load.

9. Keep source read-only until rollback window closes.

What is schema migration, and how do you run migrations safely in production?Intermediate

Answer

Schema migration is a controlled change to database structure or data. I run it safely by versioning migrations, testing on production-like data, using backward-compatible steps, limiting locks, batching backfills, and monitoring rollout.

Technical explanation

Old and new app versions may run at the same time, so schema must support both during rollout.

Large table rewrites, blocking locks, foreign-key validation, and index builds can cause outages if done naively.

Always rehearse on production-sized data and set lock_timeout and statement_timeout.

Hands-on example

Expand-contract example:

ALTER TABLE users ADD COLUMN full_name text;

Deploy app that dual-writes name and full_name and reads COALESCE(full_name, name).

Backfill in batches.

After all old app versions are gone, add constraints and later drop the old column.

Foreign key safer pattern:

ALTER TABLE orders ADD CONSTRAINT orders_customer_fk FOREIGN KEY (customer_id) REFERENCES customers(id) NOT VALID;

ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_fk;

How do you make a schema change backward compatible during a rolling deploy?Intermediate

Answer

For rolling deploys, I make schema changes backward compatible with expand-and-contract: add new structures first, deploy code that supports old and new, backfill, switch reads, then remove old structures later.

Technical explanation

Old and new app versions may run at the same time, so schema must support both during rollout.

Large table rewrites, blocking locks, foreign-key validation, and index builds can cause outages if done naively.

Always rehearse on production-sized data and set lock_timeout and statement_timeout.

Hands-on example

Expand-contract example:

ALTER TABLE users ADD COLUMN full_name text;

Deploy app that dual-writes name and full_name and reads COALESCE(full_name, name).

Backfill in batches.

After all old app versions are gone, add constraints and later drop the old column.

Foreign key safer pattern:

ALTER TABLE orders ADD CONSTRAINT orders_customer_fk FOREIGN KEY (customer_id) REFERENCES customers(id) NOT VALID;

ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_fk;

Why are some ALTER TABLE operations dangerous on a large table, and how do you mitigate?Intermediate

Answer

Large ALTER TABLE operations are dangerous because they can lock tables, rewrite data, generate huge WAL/binlogs, increase replica lag, and exhaust storage. I mitigate with concurrent operations, batches, NOT VALID constraints, and rehearsals.

Technical explanation

Old and new app versions may run at the same time, so schema must support both during rollout.

Large table rewrites, blocking locks, foreign-key validation, and index builds can cause outages if done naively.

Always rehearse on production-sized data and set lock_timeout and statement_timeout.

Hands-on example

Expand-contract example:

ALTER TABLE users ADD COLUMN full_name text;

Deploy app that dual-writes name and full_name and reads COALESCE(full_name, name).

Backfill in batches.

After all old app versions are gone, add constraints and later drop the old column.

Foreign key safer pattern:

ALTER TABLE orders ADD CONSTRAINT orders_customer_fk FOREIGN KEY (customer_id) REFERENCES customers(id) NOT VALID;

ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_fk;

What is connection exhaustion, and how do you prevent it?Intermediate

Answer

Technical explanation

Each DB connection consumes memory and CPU; in Kubernetes total connections equal pod count times pool size.

RDS Proxy multiplexes when session state allows it, but pinned sessions reduce benefits.

Prevention includes pool caps, backoff with jitter, timeouts, RDS Proxy/PgBouncer, and avoiding unbounded concurrency.

Hands-on example

Pool sizing example:

DB safe app budget = 850 connections

App pods = 50

Max pool per pod = floor(850/50) = 17

Choose 10 to 15 and load test.

PostgreSQL check:

SELECT usename, application_name, state, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY count(*) DESC;

What is ElastiCache, and what does it manage?Intermediate

Answer

ElastiCache is AWS managed in-memory caching for Valkey, Redis OSS, and Memcached. AWS manages nodes, patching, monitoring integration, replication/failover features, and backups where supported; I own keys, TTLs, consistency, and fallback behavior.

Technical explanation

Managed cache still requires correct key design, TTLs, invalidation, memory sizing, security, and fallback behavior.

Redis/Valkey is better for counters, rate limits, sorted sets, locks, streams, and HA patterns; Memcached is good for simple ephemeral object cache.

The cache should not become a hidden source of truth unless durability, backup, and failover are deliberately designed.

Hands-on example

Cache-aside example:

GET product:v1:123

If hit, return cached value.

If miss, query RDS, then SETEX product:v1:123 300 <json>.

Redis data type examples:

INCR api:tenant:42:requests

HSET user:123 tier premium

ZADD leaderboard 9100 user:123

XADD orders:events * order_id 1001 status created

What is the difference between ElastiCache Redis and Memcached?Intermediate

Answer

Redis or Valkey supports rich data structures, persistence-related features, replication, clustering, Lua, streams, and pub/sub. Memcached is simpler and best for volatile key-value object caching without complex data structures.

Technical explanation

Managed cache still requires correct key design, TTLs, invalidation, memory sizing, security, and fallback behavior.

Redis/Valkey is better for counters, rate limits, sorted sets, locks, streams, and HA patterns; Memcached is good for simple ephemeral object cache.

The cache should not become a hidden source of truth unless durability, backup, and failover are deliberately designed.

Hands-on example

Cache-aside example:

GET product:v1:123

If hit, return cached value.

If miss, query RDS, then SETEX product:v1:123 300 <json>.

Redis data type examples:

INCR api:tenant:42:requests

HSET user:123 tier premium

ZADD leaderboard 9100 user:123

XADD orders:events * order_id 1001 status created

What is Redis, and what core data types does it support?Intermediate

Answer

Redis is an in-memory data structure server used for caching, counters, queues, rate limits, sessions, and coordination. Core types include strings, hashes, lists, sets, sorted sets, streams, bitmaps, HyperLogLog, and geospatial indexes.

Technical explanation

Managed cache still requires correct key design, TTLs, invalidation, memory sizing, security, and fallback behavior.

Redis/Valkey is better for counters, rate limits, sorted sets, locks, streams, and HA patterns; Memcached is good for simple ephemeral object cache.

The cache should not become a hidden source of truth unless durability, backup, and failover are deliberately designed.

Hands-on example

Cache-aside example:

GET product:v1:123

If hit, return cached value.

If miss, query RDS, then SETEX product:v1:123 300 <json>.

Redis data type examples:

INCR api:tenant:42:requests

HSET user:123 tier premium

ZADD leaderboard 9100 user:123

XADD orders:events * order_id 1001 status created

What is Valkey, and why did the community fork it from Redis?Intermediate

Answer

Valkey is an open-source Redis-compatible datastore forked after Redis changed its license. It preserves a community-governed open-source path while aiming to keep Redis protocol and command compatibility.

Technical explanation

Wire compatibility does not guarantee every feature, module, metric, or managed-service setting behaves identically.

If the workload is pure cache, cutover can tolerate misses; if Redis holds authoritative data, migration must be much stricter.

Validation should compare hit ratio, latency, evictions, memory, CPU, connection count, app errors, and failover behavior.

Hands-on example

Migration runbook:

1. Inventory Redis version, commands, clients, Lua scripts, modules, ACLs, cluster mode, and parameters.

2. Create equivalent Valkey cluster.

3. Restore backup or use online migration if supported.

4. Test staging app and replay synthetic traffic.

5. Lower TTLs and warm hot keys.

6. Switch endpoint/secret.

7. Monitor p95 latency, hit ratio, evictions, CPU, memory, and errors.

8. Keep old cluster for rollback window.

Walk me through your ElastiCache Redis to Valkey migration: motivation, plan, and validation.Intermediate

Answer

A Redis to Valkey migration should be driven by support, licensing, cost, or platform-standardization goals, then executed through inventory, compatibility testing, staging validation, cutover, metrics comparison, and rollback planning.

Technical explanation

Wire compatibility does not guarantee every feature, module, metric, or managed-service setting behaves identically.

If the workload is pure cache, cutover can tolerate misses; if Redis holds authoritative data, migration must be much stricter.

Validation should compare hit ratio, latency, evictions, memory, CPU, connection count, app errors, and failover behavior.

Hands-on example

Migration runbook:

1. Inventory Redis version, commands, clients, Lua scripts, modules, ACLs, cluster mode, and parameters.

2. Create equivalent Valkey cluster.

3. Restore backup or use online migration if supported.

4. Test staging app and replay synthetic traffic.

5. Lower TTLs and warm hot keys.

6. Switch endpoint/secret.

7. Monitor p95 latency, hit ratio, evictions, CPU, memory, and errors.

8. Keep old cluster for rollback window.

How do you migrate from Redis to Valkey with minimal downtime?Intermediate

Answer

For minimal downtime Redis to Valkey migration, I first classify whether the cache is disposable or authoritative. Pure caches can use warmup and endpoint cutover; authoritative data needs replication, backup/restore, dual-write, or a write freeze.

Technical explanation

Wire compatibility does not guarantee every feature, module, metric, or managed-service setting behaves identically.

If the workload is pure cache, cutover can tolerate misses; if Redis holds authoritative data, migration must be much stricter.

Validation should compare hit ratio, latency, evictions, memory, CPU, connection count, app errors, and failover behavior.

Hands-on example

Migration runbook:

1. Inventory Redis version, commands, clients, Lua scripts, modules, ACLs, cluster mode, and parameters.

2. Create equivalent Valkey cluster.

3. Restore backup or use online migration if supported.

4. Test staging app and replay synthetic traffic.

5. Lower TTLs and warm hot keys.

6. Switch endpoint/secret.

7. Monitor p95 latency, hit ratio, evictions, CPU, memory, and errors.

8. Keep old cluster for rollback window.

How is Valkey wire-compatible with Redis, and what risks remain in a migration?Intermediate

Answer

Valkey wire compatibility means normal Redis clients and protocol-level commands usually work, but risks remain around version differences, modules, Lua scripts, cluster behavior, managed-service restrictions, metrics, and ACL assumptions.

Technical explanation

Wire compatibility does not guarantee every feature, module, metric, or managed-service setting behaves identically.

If the workload is pure cache, cutover can tolerate misses; if Redis holds authoritative data, migration must be much stricter.

Validation should compare hit ratio, latency, evictions, memory, CPU, connection count, app errors, and failover behavior.

Hands-on example

Migration runbook:

1. Inventory Redis version, commands, clients, Lua scripts, modules, ACLs, cluster mode, and parameters.

2. Create equivalent Valkey cluster.

3. Restore backup or use online migration if supported.

4. Test staging app and replay synthetic traffic.

5. Lower TTLs and warm hot keys.

6. Switch endpoint/secret.

7. Monitor p95 latency, hit ratio, evictions, CPU, memory, and errors.

8. Keep old cluster for rollback window.

What are common caching patterns (cache-aside, read-through, write-through, write-behind)?Intermediate

Answer

Cache-aside loads from the DB on miss, read-through lets the cache layer load, write-through writes cache and DB synchronously, and write-behind writes to cache first then flushes asynchronously. Each pattern trades consistency, latency, and loss risk differently.

Technical explanation

TTL controls the trade-off between freshness and backend load.

Stampedes are prevented with request coalescing, locks, TTL jitter, stale-while-revalidate, and rate limits.

Invalidation is hard because writes can happen through many services; outbox events make invalidation more reliable after DB commit.

Hands-on example

Cache-aside with stampede protection:

1. GET product:v2:123.

2. On miss, acquire SET product:v2:123:lock 1 NX PX 5000.

3. Lock holder reads RDS and SETEX product:v2:123 300+jitter value.

4. Other callers return stale value or wait briefly.

5. After product update, commit DB then DEL product:v2:123 or publish invalidation via outbox.

What is cache-aside (lazy loading), and what are its failure modes?Intermediate

Answer

Cache-aside means the app checks cache first, reads the database on miss, writes the value to cache with TTL, and returns it. Its failure modes are stale data, stampedes, inconsistent invalidation, and DB overload during cache failure.

Technical explanation

TTL controls the trade-off between freshness and backend load.

Stampedes are prevented with request coalescing, locks, TTL jitter, stale-while-revalidate, and rate limits.

Invalidation is hard because writes can happen through many services; outbox events make invalidation more reliable after DB commit.

Hands-on example

Cache-aside with stampede protection:

1. GET product:v2:123.

2. On miss, acquire SET product:v2:123:lock 1 NX PX 5000.

3. Lock holder reads RDS and SETEX product:v2:123 300+jitter value.

4. Other callers return stale value or wait briefly.

5. After product update, commit DB then DEL product:v2:123 or publish invalidation via outbox.

What is a cache stampede / thundering herd, and how do you prevent it?Intermediate

Answer

A cache stampede happens when many clients miss the same hot key at once and all hit the backend. I prevent it with request coalescing, TTL jitter, locks, stale-while-revalidate, warmup, and fallback rate limits.

Technical explanation

TTL controls the trade-off between freshness and backend load.

Stampedes are prevented with request coalescing, locks, TTL jitter, stale-while-revalidate, and rate limits.

Invalidation is hard because writes can happen through many services; outbox events make invalidation more reliable after DB commit.

Hands-on example

Cache-aside with stampede protection:

1. GET product:v2:123.

2. On miss, acquire SET product:v2:123:lock 1 NX PX 5000.

3. Lock holder reads RDS and SETEX product:v2:123 300+jitter value.

4. Other callers return stale value or wait briefly.

5. After product update, commit DB then DEL product:v2:123 or publish invalidation via outbox.

What is cache invalidation, and why is it hard?Intermediate

Answer

Cache invalidation is deciding when cached data is no longer correct and removing or refreshing it. It is hard because writes can come from many paths and caches can exist at multiple layers.

Technical explanation

TTL controls the trade-off between freshness and backend load.

Stampedes are prevented with request coalescing, locks, TTL jitter, stale-while-revalidate, and rate limits.

Invalidation is hard because writes can happen through many services; outbox events make invalidation more reliable after DB commit.

Hands-on example

Cache-aside with stampede protection:

1. GET product:v2:123.

2. On miss, acquire SET product:v2:123:lock 1 NX PX 5000.

3. Lock holder reads RDS and SETEX product:v2:123 300+jitter value.

4. Other callers return stale value or wait briefly.

5. After product update, commit DB then DEL product:v2:123 or publish invalidation via outbox.

What are Redis eviction policies (e.g., LRU, LFU, volatile vs allkeys)?Advanced

Answer

Redis eviction policies decide what Redis removes when maxmemory is reached. allkeys policies can evict any key; volatile policies evict only TTL keys; LRU and LFU approximate recency and frequency; noeviction makes writes fail.

Technical explanation

LRU approximates least recently used; LFU approximates least frequently used; volatile policies protect non-TTL keys but can fail writes if no TTL keys are eligible.

TTL jitter prevents synchronized expiry and thundering herds.

Memory alerts must fire before maxmemory to avoid eviction storms or write failures.

Hands-on example

Policy example for pure cache:

maxmemory-policy allkeys-lfu

SETEX product:v2:123 642 <json>

642 seconds = 600 base + random jitter.

Incident checks:

INFO memory

INFO stats

CONFIG GET maxmemory-policy

Watch evicted_keys, keyspace_hits, keyspace_misses, app cache errors, and RDS fallback load.

What happens when Redis reaches maxmemory?Advanced

Answer

When Redis reaches maxmemory, it either evicts eligible keys based on policy or rejects memory-growing writes if no key can be evicted. Reads may still work, but latency and miss rate can degrade quickly.

Technical explanation

LRU approximates least recently used; LFU approximates least frequently used; volatile policies protect non-TTL keys but can fail writes if no TTL keys are eligible.

TTL jitter prevents synchronized expiry and thundering herds.

Memory alerts must fire before maxmemory to avoid eviction storms or write failures.

Hands-on example

Policy example for pure cache:

maxmemory-policy allkeys-lfu

SETEX product:v2:123 642 <json>

642 seconds = 600 base + random jitter.

Incident checks:

INFO memory

INFO stats

CONFIG GET maxmemory-policy

Watch evicted_keys, keyspace_hits, keyspace_misses, app cache errors, and RDS fallback load.

What is a TTL, and how do you choose one for cached data?Advanced

Answer

TTL is the expiration time for a key. I choose it from freshness needs, update frequency, miss cost, database capacity, and user impact of staleness, and I add jitter to avoid synchronized expiry.

Technical explanation

LRU approximates least recently used; LFU approximates least frequently used; volatile policies protect non-TTL keys but can fail writes if no TTL keys are eligible.

TTL jitter prevents synchronized expiry and thundering herds.

Memory alerts must fire before maxmemory to avoid eviction storms or write failures.

Hands-on example

Policy example for pure cache:

maxmemory-policy allkeys-lfu

SETEX product:v2:123 642 <json>

642 seconds = 600 base + random jitter.

Incident checks:

INFO memory

INFO stats

CONFIG GET maxmemory-policy

Watch evicted_keys, keyspace_hits, keyspace_misses, app cache errors, and RDS fallback load.

What is the difference between Redis persistence options RDB and AOF?Advanced

Answer

RDB creates point-in-time snapshot files, while AOF appends write commands to a log for replay. RDB is compact and simple but can lose recent writes; AOF improves durability at the cost of disk and write overhead.

Technical explanation

RDB snapshots are compact point-in-time files and are simple to restore.

AOF logs write commands and can reduce data loss depending on fsync policy.

For a pure cache, persistence may be less important; for authoritative Redis data, persistence and backup testing are mandatory.

Hands-on example

Self-managed example:

save 900 1

save 300 10

appendonly yes

appendfsync everysec

Test by writing known keys, forcing restart in staging, and verifying which keys survive.

What are the durability trade-offs between RDB snapshots and AOF?Advanced

Answer

RDB has lower overhead and faster compact backups but can lose changes after the last snapshot. AOF can reduce loss depending on fsync policy, but increases disk usage, write cost, and rewrite complexity.

Technical explanation

RDB has lower overhead but loses changes since the last snapshot.

AOF with everysec can limit loss but adds disk usage and rewrite overhead.

Neither protects against logical deletes by itself; backups and restore tests are still needed.

Hands-on example

Decision example:

Rate-limit counters: RDB or no persistence may be acceptable.

Shopping cart stored only in Redis: use replication, backups, and stronger persistence, or move source of truth to RDS.

Document the RPO and prove it with a crash test.

What is Redis replication, and how does a replica stay in sync?Advanced

Answer

Redis replication copies data from a primary to replicas through initial synchronization and ongoing write streams. It improves read scale and HA, but is usually asynchronous, so replicas can lag.

Technical explanation

Initial sync copies a snapshot or backlog; ongoing sync applies the primary write stream.

Replication lag can appear during high write volume, network issues, or slow replicas.

Replica reads can be stale, so strongly consistent reads should use the primary.

Hands-on example

Replication check:

INFO replication

Review role, connected_replicas, master_repl_offset, replica offsets, and link status.

Alert on sustained lag or broken replica links.

What is Redis Sentinel, and what does it provide?Advanced

Answer

Redis Sentinel monitors a primary-replica deployment, detects primary failure, elects a new primary, and provides discovery. It gives failover but not sharding.

Technical explanation

Sentinel monitors Redis primary and replicas and decides failover through quorum.

It promotes a replica and tells Sentinel-aware clients where the new primary is.

It does not shard data; it is an HA mechanism for a non-clustered dataset.

Hands-on example

Sentinel failover flow:

1. Sentinels detect primary down.

2. Quorum agrees.

3. A leader selects a replica.

4. Replica is promoted.

5. Other replicas follow it.

6. Clients reconnect through Sentinel discovery.

What is Redis Cluster, and how does sharding with hash slots work?Advanced

Answer

Redis Cluster shards the keyspace across 16,384 hash slots owned by primary nodes, with replicas for HA. Clients must be cluster-aware and multi-key operations need keys in the same slot.

Technical explanation

Redis Cluster splits keys across 16,384 hash slots.

Each primary owns slots and replicas provide HA for those slots.

Clients must handle MOVED/ASK redirects, and multi-key operations need same-slot keys.

Hands-on example

Hash-tag example:

cart:{user123}:items

cart:{user123}:total

CLUSTER KEYSLOT cart:{user123}:items

CLUSTER KEYSLOT cart:{user123}:total

Both should map to the same slot.

What is the difference between Redis Sentinel and Redis Cluster?Advanced

Answer

Sentinel provides failover for one primary dataset without sharding. Redis Cluster provides horizontal sharding plus failover per shard, but requires cluster-aware clients and careful key design.

Technical explanation

Sentinel is simpler when one primary dataset fits on one node and you need failover.

Cluster is used when memory or throughput requires multiple primaries.

Cluster adds client and key-design complexity, especially for multi-key operations.

Hands-on example

Selection example:

20 GB simple cache: primary plus replica with managed failover may be enough.

900 GB tenant cache: cluster mode with shard planning is more appropriate.

Validate client cluster support before choosing Cluster.

How does ElastiCache provide high availability and automatic failover?Advanced

Answer

ElastiCache provides HA through replication groups, Multi-AZ replicas, node health monitoring, automatic failover, and endpoint updates. Applications still need reconnect and retry behavior during failover.

Technical explanation

ElastiCache uses replication groups, replicas, Multi-AZ placement, health checks, and automatic failover.

Cluster-mode enabled deployments fail over per shard.

Applications must handle dropped connections, endpoint updates, and retry behavior.

Hands-on example

Staging test:

Run GET/SET loop with latency measurement.

Trigger test failover.

Monitor CurrConnections, EngineCPUUtilization, ReplicationLag, CacheHitRate, Evictions, and app p95 latency.

What is a cache hit ratio, and how do you improve it?Advanced

Answer

Cache hit ratio is hits divided by total lookups. I improve it by caching the right data, sizing memory, avoiding evictions, using stable keys, warming hot keys, choosing good TTLs, and avoiding unnecessary invalidation.

Technical explanation

Hit ratio equals hits divided by hits plus misses.

A global hit ratio can hide a low-hit critical endpoint, so break down by key prefix or API.

Improving hit ratio requires better key choice, TTLs, memory sizing, warmup, and avoiding unnecessary invalidation.

Hands-on example

Calculation:

keyspace_hits=900000

keyspace_misses=100000

Hit ratio = 900000 / 1000000 = 90%.

If evictions are high, add memory, reduce value size, or tune TTL/policy.

How do you monitor Redis/ElastiCache (evictions, memory, latency, hit ratio)?Advanced

Answer

I monitor Redis or ElastiCache with hit ratio, latency, CPU, memory, evictions, fragmentation, connections, rejected connections, replication lag, slowlog, command stats, and application fallback rate.

Technical explanation

Global hit ratio can hide bad key patterns; monitor by endpoint or key prefix where possible.

Large keys and hot keys can overload a single Redis thread or shard even when total cluster capacity looks fine.

Use SCAN and UNLINK instead of KEYS and mass DEL; restrict dangerous commands through ACLs or operational policy.

Hands-on example

Operational commands:

INFO stats

INFO memory

INFO commandstats

SLOWLOG GET 10

LATENCY LATEST

redis-cli --bigkeys -h <host> -p 6379

Safe cleanup pattern:

redis-cli --scan --pattern "session:*" | while read key; do redis-cli UNLINK "$key"; done

Pipeline warmup in batches instead of one command per network round trip.

What is the impact of a large key or hot key in Redis, and how do you handle it?Advanced

Answer

A large key increases memory, network, replication, and latency cost; a hot key concentrates too much traffic on one key or shard. I handle them by splitting keys, redesigning data structures, local caching, and request coalescing.

Technical explanation

Large keys increase memory, network, replication, and command latency cost.

Hot keys overload one shard or event loop even if total cluster capacity looks healthy.

Fixes include splitting keys, better data structures, local cache, key sharding, and request coalescing.

Hands-on example

Investigation:

redis-cli --bigkeys -h <host> -p 6379

SCAN 0 MATCH product:* COUNT 1000

MEMORY USAGE product:v2:123

Avoid KEYS in production.

Why are some Redis commands (KEYS, FLUSHALL) dangerous in production?Advanced

Answer

KEYS can block Redis by scanning the full keyspace, and FLUSHALL can delete the entire cache or datastore. In production I use SCAN/UNLINK, namespace isolation, ACL restrictions, and controlled cleanup jobs.

Technical explanation

Global hit ratio can hide bad key patterns; monitor by endpoint or key prefix where possible.

Large keys and hot keys can overload a single Redis thread or shard even when total cluster capacity looks fine.

Use SCAN and UNLINK instead of KEYS and mass DEL; restrict dangerous commands through ACLs or operational policy.

Hands-on example

Operational commands:

INFO stats

INFO memory

INFO commandstats

SLOWLOG GET 10

LATENCY LATEST

redis-cli --bigkeys -h <host> -p 6379

Safe cleanup pattern:

redis-cli --scan --pattern "session:*" | while read key; do redis-cli UNLINK "$key"; done

Pipeline warmup in batches instead of one command per network round trip.

What is pipelining in Redis, and how does it improve throughput?Advanced

Answer

Pipelining sends multiple commands before waiting for replies, reducing network round trips and increasing throughput. It is a transport optimization, not a transaction, and pipeline size must be bounded.

Technical explanation

Pipelining reduces round trips for many small independent commands.

It does not make Redis execute commands in parallel and does not provide transaction semantics.

Pipeline batch size must be capped to avoid memory pressure and tail-latency spikes.

Hands-on example

Python-style warmup:

pipe = redis.pipeline(transaction=False)

for product in products:

pipe.setex(f"product:v2:{product.id}", 600, serialize(product))

pipe.execute()

Execute every 500 commands instead of building an unbounded pipeline.

What is the difference between Redis as a cache and Redis as a primary datastore?Advanced

Answer

Redis as a cache is rebuildable from another source of truth. Redis as a primary datastore is authoritative, so persistence, backup, failover, no unsafe eviction, and recovery testing become mandatory.

Technical explanation

Cache data should be safely rebuildable from a durable source.

Primary-store data needs persistence, backups, no unsafe eviction, HA, and restore testing.

Financial records and durable ledgers should usually live in a transactional database, not only Redis.

Hands-on example

Classification:

Product details copied from RDS: cache.

User sessions: needs datastore-level HA policy.

Billing counters: needs durable queue, DB write, or reconciliation.

Set eviction to noeviction if data must not disappear.

How would you use Redis for rate limiting?Advanced

Answer

For rate limiting, Redis commonly uses atomic counters with INCR and EXPIRE for fixed windows or sorted sets for sliding windows. Lua scripts help make multi-step limiter logic atomic.

Technical explanation

Rate limits can use INCR/EXPIRE or sorted sets, ideally through Lua for atomicity.

Distributed locks should use SET key value NX PX ttl with unique token and safe compare-and-delete release; they are better for efficiency than hard correctness.

Security baseline is private subnets, SG restrictions, TLS, AUTH/ACLs, secret rotation, and no public exposure.

Hands-on example

Rate limit example:

MULTI

INCR rl:tenant:42:minute:202606301015

EXPIRE rl:tenant:42:minute:202606301015 120

EXEC

Lock example:

SET lock:refresh:product:123 <uuid> NX PX 5000

Release only if GET lock value matches your uuid. Never DEL blindly.

How would you use Redis for distributed locking, and what are the caveats?Advanced

Answer

Redis distributed locks commonly use SET key value NX PX ttl with a unique token and compare-and-delete release. They are useful for efficiency locks but require caution for correctness-critical workflows.

Technical explanation

Rate limits can use INCR/EXPIRE or sorted sets, ideally through Lua for atomicity.

Distributed locks should use SET key value NX PX ttl with unique token and safe compare-and-delete release; they are better for efficiency than hard correctness.

Security baseline is private subnets, SG restrictions, TLS, AUTH/ACLs, secret rotation, and no public exposure.

Hands-on example

Rate limit example:

MULTI

INCR rl:tenant:42:minute:202606301015

EXPIRE rl:tenant:42:minute:202606301015 120

EXEC

Lock example:

SET lock:refresh:product:123 <uuid> NX PX 5000

Release only if GET lock value matches your uuid. Never DEL blindly.

How do you secure Redis (AUTH, TLS, network isolation, no public exposure)?Advanced

Answer

Redis security starts with private networking, no public exposure, security groups, TLS, AUTH or ACLs, least-privilege users, secret rotation, and blocking dangerous commands for application users.

Technical explanation

Network isolation is the first control: Redis should not be public.

AUTH/ACLs limit what application users can do, including blocking admin commands.

TLS, secret rotation, security groups, and private subnets complete the baseline.

Hands-on example

Security baseline:

Private subnets only.

Security group allows 6379 only from app SG.

Transit encryption enabled.

App ACL allows GET/SET but denies FLUSHALL/CONFIG.

Secret stored in Secrets Manager and rotated.

What is the difference between strong and eventual consistency, and where does caching sit?Advanced

Answer

Strong consistency means reads reflect the latest committed write; eventual consistency means replicas or caches can be temporarily stale. Caches usually introduce eventual consistency unless reads bypass or refresh after writes.

Technical explanation

Caches can serve older values until TTL expiry or invalidation.

Read-after-write paths often need primary reads or synchronous cache invalidation.

Different data has different consistency tolerance: product descriptions differ from balances.

Hands-on example

Profile update flow:

UPDATE users SET name = ... WHERE id = 123;

After commit: DEL user:v2:123.

Return updated profile directly from the write response or bypass cache briefly for that user.

How do you keep a cache and database consistent?Advanced

Answer

I keep cache and DB consistent by treating the DB as source of truth, using TTLs, invalidating after commits, versioning keys, and using outbox-driven invalidation so cache events are not lost.

Technical explanation

Database should remain the source of truth unless the cache is intentionally designed as a datastore.

Keep cache and DB consistent through TTLs, delete-after-commit, versioned keys, outbox-driven invalidation, and primary reads after writes.

Graceful degradation requires low cache timeouts, circuit breakers, stale-if-safe responses, DB fallback limits, and cache warming.

Hands-on example

Outbox invalidation example:

BEGIN;

UPDATE products SET price = 19.99 WHERE id = 123;

INSERT INTO outbox(event_type, payload) VALUES ("ProductUpdated", "{id:123}");

COMMIT;

Worker reads outbox, DEL product:v2:123, publishes event, marks outbox processed.

Warm top keys after failover with a throttled job and stop if RDS CPU or app latency rises.

What is read-your-writes consistency, and how can caching break it?Advanced

Answer

Read-your-writes means a user sees their own update immediately. Caching can break it by returning stale cached values or replica reads, so I bypass cache or invalidate/update it on write.

Technical explanation

Read-your-writes is strongest for the user who just performed the write.

Cache and read replicas can both break it immediately after update.

Solutions include delete/update cache on write, primary read after write, version checks, or short cache-bypass windows.

Hands-on example

Example:

PUT /profile updates RDS and deletes user:v2:123.

GET /profile for the same user during the next 10 seconds reads primary DB or rebuilt cache.

Public profile reads can remain cache-aside if stale reads are acceptable.

How would you design caching for a read-heavy microservice?Advanced

Answer

For a read-heavy microservice, I cache hot, expensive, safe-to-stale data using clear key naming, TTL plus jitter, explicit invalidation, request coalescing, warming, and fallback protection for the database.

Technical explanation

Start from metrics: hot endpoints, expensive queries, result size, and allowed staleness.

Use Redis/Valkey shared cache for common data and local cache for tiny ultra-hot config.

Protect the database from miss storms with coalescing, warmup, and fallback rate limits.

Hands-on example

Read-heavy product cache:

product:v3:{id} TTL 15m+jitter

product-price:v1:{id} TTL 60s plus explicit invalidation

Redis timeout 50ms.

On miss, coalesce requests and rate-limit DB fallback.

How do you decide what to cache and what not to cache?Advanced

Answer

I cache data that is frequently read, expensive to fetch or compute, reasonably small, and tolerant of staleness. I avoid caching rapidly changing, sensitive, authorization-heavy, cheap, or strict-consistency data.

Technical explanation

Good cache candidates are hot, expensive, small, and stale-tolerant.

Bad candidates are rapidly changing, sensitive, huge, cheap to query, or strict-consistency data.

Authorization-heavy data requires special care because stale permissions are a security risk.

Hands-on example

Decision examples:

Cache product category list: yes.

Cache payment balance: no, read primary.

Cache user permissions for 1 hour: risky; prefer short TTL or token/revocation design.

How do you handle cache warming after a restart or failover?Advanced

Answer

Cache warming preloads hot keys after restart, failover, or new cluster creation so user traffic does not create a cold-cache miss storm. It must be throttled and monitored to avoid overloading the database.

Technical explanation

Cold caches can shift traffic to RDS and create a secondary outage.

Warm only known hot keys and throttle the job.

Stop warming if source DB or application SLOs degrade.

Hands-on example

Warmup job:

for key in $(cat top_keys.txt); do curl -s "https://product-service/internal/cache-warm?key=$key" >/dev/null; sleep 0.05; done

Stop if RDS CPU > 70% or app p95 exceeds SLO.

What is the blast radius of a cache outage, and how do you make the app degrade gracefully?Advanced

Answer

A cache outage can cause anything from minor latency to full outage. I reduce blast radius with low timeouts, circuit breakers, DB fallback limits, stale-if-safe responses, local last-known-good data, and graceful feature degradation.

Technical explanation

Cache failure should not automatically become total service failure unless the cache is an intentional primary store.

Use low timeouts and circuit breakers to avoid tying up app threads.

Degrade non-critical features and protect DB fallback with rate limits.

Hands-on example

Failure policy:

Redis connect_timeout=50ms, read_timeout=50ms, max_retries=1.

Product miss: DB fallback with limiter.

Recommendations failure: omit module.

Feature config failure: use last-known-good local copy.

How do you size an ElastiCache cluster (memory, nodes, shards)?Advanced

Answer

I size ElastiCache from working-set memory, key/value overhead, fragmentation, CPU, network, connection count, replica needs, shard distribution, failover headroom, and growth forecast.

Technical explanation

Failover testing must measure the whole application recovery path, not just service events.

Connection storms are prevented with pool caps, rolling deploys, jittered startup, exponential backoff, and readiness gates.

For interview stories, state the business driver, migration plan, validation, rollback, and measurable result such as p95 latency, hit ratio, cost, or error-rate improvement.

Hands-on example

Capacity example:

Current RDS free storage = 500 GB

Growth = 20 GB/day

Days remaining = 25

Action threshold = 45 days, so act now.

SQL size check:

SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) FROM pg_class WHERE relkind = 'r' ORDER BY pg_total_relation_size(oid) DESC LIMIT 20;

Plan: enable storage autoscaling with max cap, archive old data, review index bloat, test restore time, and evaluate partitioning.

What backup and restore options exist for ElastiCache Redis?Advanced

Answer

ElastiCache Redis or Valkey backup options include automatic backups, manual snapshots, and snapshot restore into a new cluster where supported. Pure caches may be rebuilt; authoritative data needs tested restore.

Technical explanation

Automatic backups and manual snapshots protect Redis/Valkey data where supported.

Restore usually creates or populates a new cluster and must be validated.

Memcached is volatile; rebuild rather than restore is normally the model.

Hands-on example

Restore test:

1. Take manual snapshot before upgrade.

2. Restore to staging.

3. Connect test app.

4. Verify key count and sample keys.

5. Run failover test.

6. Document restore time and data loss window.

How do you test a Redis or database failover without impacting users?Advanced

Answer

I test database or Redis failover through staging and approved game days, measuring application recovery, reconnect behavior, error rate, latency, and alerting. The complete app path matters more than the service event alone.

Technical explanation

Failover testing must include clients, retries, DNS/endpoints, connection pools, and alerts.

Staging comes first; production tests need approval, abort criteria, and rollback ownership.

Success is measured by application recovery, not only the managed-service event completing.

Hands-on example

Game day:

Pre-check backups and replicas.

Trigger RDS or ElastiCache failover in a controlled window.

Watch p95 latency, 5xx, reconnect logs, DB/cache metrics, and alerts.

Pass if app recovers without manual pod restart.

How do you handle connection storms against the cache after a deploy?Advanced

Answer

Connection storms happen when many clients reconnect simultaneously after deploy, failover, scaling, or network issues. I prevent them with pool caps, rolling deploys, jittered startup, exponential backoff, and readiness gates.

Technical explanation

Pods, functions, or clients can all reconnect at once and exceed DB/cache limits.

Backoff with jitter and pool caps prevent synchronized retries.

Rolling deploy settings and readiness probes must not hammer dependencies.

Hands-on example

Kubernetes guardrails:

maxSurge: 10%

maxUnavailable: 0

DB_POOL_MAX=10

REDIS_POOL_MAX=20

CONNECT_BACKOFF_JITTER=true

Add random startup sleep before opening warm connections.

What is the difference between a write-through and write-behind cache regarding data loss?Advanced

Answer

Write-through updates cache and database synchronously, giving stronger consistency but higher write latency. Write-behind writes to cache first and flushes later, improving latency but risking data loss without durable buffering.

Technical explanation

Write-through favors consistency but couples success to both cache and database write paths.

Write-behind favors latency and burst absorption but can lose data before flush.

Write-behind needs durable queueing, idempotent workers, retries, DLQ, and reconciliation.

Hands-on example

Use write-through or direct DB transaction for inventory updates.

Use write-behind only for tolerable-loss events such as page-view counters.

If used: append to Kafka/SQS, update DB idempotently, monitor queue lag, and reconcile totals.

How would you debug rising latency on a service that depends on Redis and RDS together?Advanced

Answer

To debug latency involving both Redis and RDS, I break the path into cache hit/miss, Redis latency, DB query time, connection wait, and app processing, then correlate hit ratio, Redis metrics, RDS DB load, and top SQL.

Technical explanation

First determine whether Redis is slow, cache misses increased, or RDS is slow independently.

Break request latency into cache time, hit/miss, DB time, connection wait, and app time.

Correlate deploy markers with cache hit ratio, Redis latency, RDS DB load, and top SQL.

Hands-on example

Triage graph:

service p95/p99 latency

Redis hit ratio and evictions

Redis command latency

RDS CPU, connections, DB load

Performance Insights top SQL

If hit ratio drops and RDS load rises, investigate TTL, invalidation, warmup, or eviction.

What recent database or caching change have you made, and what did it improve?Advanced

Answer

A strong recent-change answer should describe business driver, risk controls, migration or upgrade plan, validation, rollback, and measured improvement. Redis to Valkey migration or RDS major version upgrade are good examples.

Technical explanation

The story should show ownership from problem framing through validation.

Metrics make the answer credible: p95 latency, hit ratio, error rate, cost, recovery time, or incident reduction.

Mention rollback and what you monitored during and after the change.

Hands-on example

STAR example:

Situation: Redis lifecycle and license direction required a supported Valkey path.

Task: migrate without user impact.

Action: inventory, staging Valkey, traffic replay, failover test, TTL reduction, hot-key warmup, endpoint cutover.

Result: stable p95, stable hit ratio, no incident, lower platform risk.

How do you plan capacity and storage autoscaling for a growing RDS database?Advanced

Answer

For RDS capacity, I plan storage, IOPS, throughput, CPU, memory, connections, WAL/binlog growth, backup/restore time, and data growth. Storage autoscaling is useful but does not replace forecasting and schema/data lifecycle work.

Technical explanation

RDS storage autoscaling can prevent out-of-space incidents but does not shrink storage.

Growth affects backup time, restore time, vacuum, indexes, migrations, and query plans.

Capacity planning should forecast storage, IOPS, throughput, CPU, memory, connections, and logs.

Hands-on example

Forecast:

Free storage = 500 GB

Growth = 20 GB/day

Days remaining = 25

Threshold = 45 days, so act now.

SQL:

SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) FROM pg_class WHERE relkind = 'r' ORDER BY pg_total_relation_size(oid) DESC LIMIT 20;

Final Revision Checklist

Answer the first section in 60 to 90 seconds.

Use hands-on examples to prove production experience.

Mention monitoring, rollback, and validation for every risky change.

Tie caching answers to consistency and failure modes.

Tie RDS answers to HA, backups, performance, and safe operations.

← All interview topics