Interview questions › Observability

Observability interview questions & answers

100 Observability interview questions, each answered three ways: a concise spoken answer, a technical explanation, and a hands-on example.

Tip: paste the job description + your resume into our free resume checker to see which of these skills the role actually requires.

All questions

What is observability, and how is it different from traditional monitoring? [Basic]
What are the three pillars of observability (metrics, logs, traces)? [Basic]
What is the difference between monitoring and observability in practice? [Basic]
What are the four golden signals of monitoring? [Basic]
What is the difference between the USE method and the RED method? [Basic]
When would you use the USE method versus the RED method? [Basic]
What is an SLI, an SLO, and an SLA, and how do they relate? [Basic]
How do you choose good SLIs for a service? [Basic]
How do you set an SLO target, and why not just aim for 100%? [Basic]
What is an error budget, and how do you use it to balance reliability and velocity? [Basic]
What happens operationally when an error budget is exhausted? [Basic]
What is a burn rate, and how do you alert on it? [Basic]
Why are multi-window, multi-burn-rate alerts better than a single threshold? [Basic]
What is the difference between a symptom-based and a cause-based alert, and which is better? [Basic]
What makes a good alert, and how do you avoid alert fatigue? [Basic]
What is Prometheus, and what is its data model? [Basic]
What is a time series in Prometheus, and how is it identified (metric name plus labels)? [Basic]
What are the Prometheus metric types (counter, gauge, histogram, summary)? [Basic]
What is the difference between a counter and a gauge? [Basic]
What is the difference between a histogram and a summary, and the trade-offs? [Basic]
How does Prometheus collect metrics — what is the pull model and scraping? [Basic]
What is a scrape target, and how does service discovery find them? [Basic]
What is an exporter, and name a few common ones (node_exporter, etc.)? [Basic]
What is the difference between an exporter and instrumenting your app directly? [Basic]
What is the Pushgateway, and why is it discouraged for most cases? [Basic]
What is PromQL, and what is an instant vector versus a range vector? [Basic]
How does the rate() function work, and why use it on counters? [Basic]
What is the difference between rate() and irate()? [Basic]
How do you compute a 95th percentile latency from a histogram in PromQL (histogram_quantile)? [Basic]
What is aggregation in PromQL (sum, avg, by, without)? [Basic]
What is a recording rule, and when would you use one? [Basic]
What is an alerting rule, and how does it differ from a recording rule? [Basic]
What is Alertmanager, and what does it handle that Prometheus does not? [Basic]
How does Alertmanager grouping, inhibition, and silencing work? [Intermediate]
What is alert routing, and how do you send different alerts to different teams? [Intermediate]
How does Prometheus handle high cardinality, and why is it a problem? [Intermediate]
What causes a cardinality explosion, and how do you prevent it? [Intermediate]
How does Prometheus storage (TSDB) work, and what is the retention model? [Intermediate]
How do you scale Prometheus for long-term storage and high availability (Thanos, Cortex, Mimir)? [Intermediate]
What is the difference between Thanos and Cortex/Mimir at a high level? [Intermediate]
How do you make Prometheus highly available? [Intermediate]
How does Prometheus integrate with Kubernetes service discovery? [Intermediate]
What is the Prometheus Operator, and what are ServiceMonitor and PodMonitor? [Intermediate]
How do you instrument an application with a Prometheus client library? [Intermediate]
What labels should and should not be put on a metric? [Intermediate]
How do you alert on something that should be happening but is not (absence of data)? [Intermediate]
What is the difference between black-box and white-box monitoring? [Intermediate]
What is a synthetic / black-box check, and what would you monitor with one? [Intermediate]
How do you monitor a batch job or cron that runs infrequently? [Intermediate]
What is Grafana, and how does it relate to Prometheus? [Intermediate]
How do you design a useful dashboard, and what is the difference from an alert? [Intermediate]
What is Splunk, and what is it primarily used for? [Intermediate]
What is the Splunk data pipeline (input, parsing, indexing, search)? [Intermediate]
What is an index in Splunk, and how do you decide indexing strategy? [Intermediate]
What is the difference between a Splunk forwarder, indexer, and search head? [Intermediate]
What is the difference between a universal and a heavy forwarder? [Intermediate]
What is SPL (Search Processing Language)? [Intermediate]
What is the difference between search-time and index-time field extraction? [Intermediate]
Why is index-time configuration expensive, and when do you use it? [Intermediate]
How do you write an efficient Splunk search, and why filter early? [Intermediate]
What is the role of the stats, eval, and timechart commands in SPL? [Intermediate]
What is the difference between stats and eventstats? [Intermediate]
What is a Splunk source, sourcetype, and host? [Intermediate]
How do you control Splunk costs and license/ingest volume? [Intermediate]
What is data sampling or filtering at ingest, and why does it matter for cost? [Intermediate]
How do you reduce noisy or low-value log ingestion? [Intermediate]
What is a Splunk saved search and a scheduled alert? [Advanced]
How do you build a Splunk dashboard, and when is it better than Grafana? [Advanced]
What is data retention and a bucket lifecycle (hot, warm, cold, frozen) in Splunk? [Advanced]
How do you correlate logs and metrics during an incident? [Advanced]
What is Wavefront (Tanzu Observability), and what is it used for? [Advanced]
What is the Wavefront data model and query language (WQL)? [Advanced]
How does Wavefront ingest metrics (proxy, direct ingestion, collectors)? [Advanced]
What is the Wavefront proxy, and why use it? [Advanced]
How does Wavefront handle high-cardinality metrics compared to Prometheus? [Advanced]
How do you build alerts in Wavefront, and what is a smart alert? [Advanced]
How do you do anomaly detection in Wavefront? [Advanced]
How would you choose between Prometheus, Splunk, and Wavefront for a given signal? [Advanced]
What is the difference between metrics-based and log-based alerting, and the cost implications? [Advanced]
What is a telemetry pipeline, and why might you put one in front of your backends? [Advanced]
What is OpenTelemetry, and what problem does it solve? [Advanced]
What is the difference between the OpenTelemetry SDK and the Collector? [Advanced]
How would you design ingestion controls to manage observability cost at scale? [Advanced]
How do you decide sampling rates for traces? [Advanced]
What is head-based versus tail-based sampling for traces? [Advanced]
What is distributed tracing, and what is a span and a trace? [Advanced]
How does context propagation work across services in tracing? [Advanced]
What is cardinality cost, and how does it differ between metrics and logs? [Advanced]
How do you design alerts that page a human only when action is required? [Advanced]
What is the difference between a page, a ticket, and a dashboard signal? [Advanced]
How would you reduce alert noise across many teams (deduplication, correlation, AIOps)? [Advanced]
What is event correlation, and how does it reduce incident noise? [Advanced]
How would you measure observability coverage across services? [Advanced]
How do you instrument a service so that an on-call engineer can debug it without code changes? [Advanced]
How would you build an SLO dashboard and tie alerts to error-budget burn? [Advanced]
How do you trigger automated remediation from an observability signal? [Advanced]
How would you investigate a latency spike using metrics, traces, and logs together? [Advanced]
What recent observability practice or tool have you adopted, and what improved? [Advanced]
How do you prevent a single noisy service from blowing up observability costs for everyone? [Advanced]
How would you run a monthly operational review using observability data and SLO trends? [Advanced]

What is observability, and how is it different from traditional monitoring? [Basic]

Answer

Observability is the ability to understand the internal state of a system from the signals it emits. Traditional monitoring tells me whether known checks are healthy; observability lets me ask new questions during unknown failure modes using metrics, logs, traces, events, and context.

Technical explanation

Monitoring is usually built around predefined dashboards and thresholds such as CPU greater than 80 percent or HTTP 5xx greater than 2 percent.

Observability focuses on debuggability: high-quality telemetry, useful dimensions, service ownership, correlation IDs, and enough context to explain why something is happening.

In SRE terms, monitoring is a subset of observability. A mature platform uses both: alerts for known user-impacting symptoms and exploratory telemetry for investigation.

Hands-on example

Hands-on: for a checkout service, expose request_count, request_duration, and error_count metrics, emit structured JSON logs with trace_id and order_id_hash, and propagate W3C trace context. When latency spikes, start from the SLO alert, open the latency dashboard, jump to slow traces for checkout to payment, then inspect only the correlated logs for those trace IDs.

What are the three pillars of observability (metrics, logs, traces)? [Basic]

Answer

The three classic pillars are metrics, logs, and traces. Metrics show numeric trends over time, logs provide event-level detail, and traces show the path and timing of a request across services.

Technical explanation

Metrics are low-cardinality time series, good for alerting, capacity planning, SLOs, and trend analysis.

Logs are discrete records, good for explaining a specific event, error, state transition, or audit trail.

Traces connect spans from multiple services so we can see where time was spent and which dependency caused delay or failure.

Hands-on example

Example: a payment timeout appears as an elevated p95 latency metric, the trace shows checkout spent 1.8 seconds in payment-authorize, and the payment logs for the trace_id show an upstream gateway timeout. The three signals together give both scope and root-cause evidence.

What is the difference between monitoring and observability in practice? [Basic]

Answer

In practice, monitoring answers known questions like 'is the service up?', while observability helps answer unknown questions like 'why is only one tenant seeing latency after this deployment?'. Monitoring is alerting and visibility; observability is investigation capability.

Technical explanation

A monitoring-only setup often has many infrastructure alerts but weak request context, so incidents require guessing.

An observable system includes service-level signals, consistent labels, trace propagation, structured logs, deployment markers, and ownership metadata.

The practical test is whether an on-call engineer can debug a novel production issue without adding code or waiting for another deployment.

Hands-on example

Hands-on checklist: trigger a test failure, then try to answer: affected endpoint, affected version, affected region, customer impact, slow dependency, deploy correlation, and error class. If the telemetry cannot answer these, the system is monitored but not truly observable.

What are the four golden signals of monitoring? [Basic]

Answer

The four golden signals are latency, traffic, errors, and saturation. They are a strong starting point for service monitoring because they map closely to user experience and system limits.

Technical explanation

Latency measures how long requests take, usually with percentiles such as p50, p95, and p99.

Traffic measures demand, such as requests per second, messages per second, or bytes per second.

Errors measure failed work, and saturation measures how close a resource is to exhaustion, such as CPU, memory, thread pools, queues, or connection pools.

Hands-on example

PromQL examples: rate(http_requests_total[5m]) for traffic, sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) for error rate, histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) for latency, and node_cpu_seconds_total or queue_depth for saturation.

What is the difference between the USE method and the RED method? [Basic]

Answer

The USE method focuses on resources: utilization, saturation, and errors. The RED method focuses on request-driven services: rate, errors, and duration. USE is best for infrastructure components; RED is best for user-facing or request-serving services.

Technical explanation

USE asks whether a resource is busy, overloaded, or failing. It fits nodes, disks, NICs, JVM pools, database connections, and queues.

RED asks how much work is coming in, how many requests fail, and how long they take. It fits APIs, microservices, and RPC dependencies.

Both methods complement each other: RED detects user symptoms, while USE helps explain underlying resource causes.

Hands-on example

Example: if checkout p95 latency rises, RED tells you the service symptom. Then USE on the pods shows CPU throttling and queue saturation. The fix may be to raise CPU requests/limits, optimize code, or scale replicas.

When would you use the USE method versus the RED method? [Basic]

Answer

I use RED when I am monitoring a service or API from the user/request perspective, and USE when I am monitoring the health of an underlying resource. During incidents, I usually start with RED and drill into USE.

Technical explanation

For an HTTP service: request rate, error rate, and duration are the primary service health indicators.

For a host, disk, cache, broker, or database pool: utilization, saturation, and errors are more meaningful.

In dashboards, I group them by layer: service RED at the top, dependency RED next, and infrastructure USE below.

Hands-on example

Hands-on: create one dashboard row for checkout RED: RPS, 5xx percent, p95/p99 latency. Create another row for USE: CPU throttling, memory pressure, JVM heap, DB connection pool usage, and Kafka lag. When alerts fire, the layout supports top-down troubleshooting.

What is an SLI, an SLO, and an SLA, and how do they relate? [Basic]

Answer

An SLI is the measurement, an SLO is the target, and an SLA is a contractual promise. For example, request availability is an SLI, 99.9 percent monthly availability is an SLO, and customer credits for missing 99.9 percent are part of an SLA.

Technical explanation

SLIs should measure user-visible reliability, not only infrastructure health.

SLOs are internal reliability objectives that guide engineering decisions and alerting.

SLAs are external agreements with legal or financial consequences, so they are often less aggressive than internal SLOs.

Hands-on example

Example: define successful checkout requests as HTTP 2xx/3xx completed within 500 ms. SLI = good requests / total eligible requests. SLO = 99.5 percent over 28 days. SLA = 99.0 percent availability with customer credits. Alert on burn rate against the SLO, not on raw CPU.

How do you choose good SLIs for a service? [Basic]

Answer

Good SLIs are user-centric, measurable, attributable, and hard to game. I choose SLIs that represent the experience users actually care about: availability, latency, correctness, freshness, and durability depending on the service.

Technical explanation

For synchronous APIs, good SLIs are success ratio and latency below a threshold.

For pipelines, good SLIs include freshness, completeness, and processing delay.

Avoid SLIs that only measure internals, such as pod count or CPU, unless the user impact is direct and proven.

Hands-on example

Hands-on: for an order API, define good events as POST /orders returning 2xx within 750 ms, excluding client 4xx validation errors. In Prometheus, create a numerator for good requests and a denominator for total eligible requests, then graph the ratio by service and environment.

How do you set an SLO target, and why not just aim for 100%? [Basic]

Answer

I set an SLO based on user expectations, business impact, dependency limits, historical performance, and the cost of reliability. I do not aim for 100 percent because perfect reliability is usually impossible, extremely expensive, and slows safe change.

Technical explanation

A good SLO is tighter than current customer pain but realistic enough to allow releases, maintenance, and controlled failure.

100 percent creates a zero error budget, meaning any failure would block change even when users are not materially impacted.

SLOs should be revisited as architecture, traffic, and customer expectations change.

Hands-on example

Example: historical checkout availability is 99.94 percent. Set an initial SLO of 99.9 percent over 28 days, leaving about 43 minutes of error budget per month. If the team burns less than 25 percent of budget for several quarters, consider raising the target.

What is an error budget, and how do you use it to balance reliability and velocity? [Basic]

Answer

An error budget is the allowed amount of unreliability under an SLO. It balances reliability and delivery speed: when the budget is healthy, teams can ship normally; when it is being burned too fast, reliability work takes priority.

Technical explanation

For a 99.9 percent availability SLO, the error budget is 0.1 percent of eligible requests or time in the SLO window.

Error budgets turn reliability from an opinion into an engineering control loop.

They help avoid both extremes: reckless feature velocity and over-investment in unnecessary reliability.

Hands-on example

Hands-on: create a 28-day SLO for checkout. If the burn rate is below 1x, continue normal releases. If a deployment consumes 30 percent of the monthly budget in one hour, freeze risky releases, roll back, and require a post-incident reliability fix before continuing.

What happens operationally when an error budget is exhausted? [Basic]

Answer

When an error budget is exhausted, the team should reduce change risk and focus on restoring reliability. That usually means freezing non-critical releases, prioritizing incident fixes, improving tests or rollback, and reviewing whether the SLO or architecture is appropriate.

Technical explanation

The goal is not punishment; it is a safety mechanism that aligns product and engineering around user impact.

Actions should be predefined in an error-budget policy so decisions are not negotiated during an incident.

Once the burn rate returns to normal and corrective work is complete, normal release velocity can resume.

Hands-on example

Example policy: if budget remaining is below 10 percent, only emergency fixes ship. If budget is negative, require leadership approval for releases, complete root-cause actions, add missing alerts/runbooks, and review top reliability risks in the next weekly ops meeting.

What is a burn rate, and how do you alert on it? [Basic]

Answer

Burn rate is how fast a service is consuming its error budget compared with the allowed rate. A 2x burn rate means the service is consuming budget twice as fast as planned; a 14x burn rate means it can exhaust a 28-day budget in about two days.

Technical explanation

Burn-rate alerts are more SLO-aligned than raw error thresholds because they account for the reliability target and time window.

Fast burn detects severe incidents quickly; slow burn catches smaller issues that accumulate over hours or days.

The alert expression usually divides current error ratio by the allowed error ratio.

Hands-on example

PromQL sketch: error_ratio = sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])). For a 99.9 percent SLO, allowed error ratio is 0.001. burn_rate = error_ratio / 0.001. Page on high burn over short and medium windows.

Why are multi-window, multi-burn-rate alerts better than a single threshold? [Basic]

Answer

Multi-window, multi-burn-rate alerts are better because they catch fast, severe incidents quickly while avoiding noise from brief spikes. They also catch slow burns that would not page immediately but would still exhaust the error budget.

Technical explanation

A single threshold can be too noisy during short spikes or too slow during gradual degradation.

Pairing a short window with a longer confirmation window improves precision.

Common patterns include fast-burn pages and slow-burn tickets or lower-urgency alerts.

Hands-on example

Example: page when burn rate is greater than 14x for both 5 minutes and 1 hour. Create a ticket when burn rate is greater than 3x for both 30 minutes and 6 hours. This catches real budget threats without waking people for harmless one-minute blips.

What is the difference between a symptom-based and a cause-based alert, and which is better? [Basic]

Answer

A symptom-based alert fires on user-visible impact, such as high error rate or missed latency SLO. A cause-based alert fires on a suspected reason, such as CPU high or disk almost full. For paging, symptom-based alerts are usually better; cause alerts are useful for tickets and diagnostics.

Technical explanation

Symptom alerts are less noisy because they correspond to user pain and require action.

Cause alerts can be valuable when a condition will definitely become user-impacting, such as disk full in 30 minutes.

Good alerting separates page-worthy symptoms from dashboard or ticket-worthy causes.

Hands-on example

Example: do not page only because CPU is 85 percent. Page because checkout error-budget burn is high. Put CPU, memory, throttling, DB pool, and queue depth on the runbook dashboard so the responder can find the cause after the page.

What makes a good alert, and how do you avoid alert fatigue? [Basic]

Answer

A good alert is actionable, urgent, owned, accurate, and tied to user impact. To avoid alert fatigue, I page only for conditions that require immediate human action and route non-urgent issues to tickets or dashboards.

Technical explanation

Every page should have a clear owner, severity, runbook, dashboard link, and expected first action.

Deduplicate related alerts and inhibit downstream noise when a known upstream dependency is failing.

Review noisy alerts after incidents and regularly delete alerts that are not useful.

Hands-on example

Hands-on alert review: export last 30 days of pages, group by alert name and service, calculate pages per service and percent actionable, then remove or downgrade alerts with no action taken. Add runbooks to the top 10 remaining alerts.

What is Prometheus, and what is its data model? [Basic]

Answer

Prometheus is an open-source monitoring and alerting system built around a time-series data model. Each sample belongs to a metric name plus a set of labels, and PromQL is used to query and aggregate those series.

Technical explanation

Prometheus scrapes metrics over HTTP, stores them in a local TSDB, evaluates rules, and sends alerts to Alertmanager.

The data model is dimensional: labels such as job, instance, namespace, pod, method, and status allow slicing and aggregation.

It is strong for metrics and alerting, but long-term retention and global querying often require systems such as Thanos, Cortex, or Mimir.

Hands-on example

Minimal scrape config:

scrape_configs:

- job_name: 'checkout'

static_configs:

- targets: ['checkout.default.svc:9100']

Then query: up{job='checkout'} and rate(http_requests_total{job='checkout'}[5m]).

What is a time series in Prometheus, and how is it identified (metric name plus labels)? [Basic]

Answer

A Prometheus time series is a stream of timestamped samples uniquely identified by the metric name and the full label set. Changing any label value creates a different time series.

Technical explanation

For example, http_requests_total{method='GET',status='200',pod='a'} and http_requests_total{method='GET',status='500',pod='a'} are different series.

This model is powerful for aggregation but dangerous if labels have unbounded values.

Prometheus stores numeric samples; labels are metadata used for selection and grouping.

Hands-on example

Hands-on: query count by (__name__)({job='checkout'}) to see metric names, then count without(instance,pod) (http_requests_total) to reduce per-pod series. Never add user_id, request_id, or order_id as metric labels because each value creates a new series.

What are the Prometheus metric types (counter, gauge, histogram, summary)? [Basic]

Answer

The main Prometheus metric types are counter, gauge, histogram, and summary. Counters only increase, gauges go up and down, histograms bucket observations, and summaries calculate quantiles on the client side.

Technical explanation

Counters fit totals such as requests, errors, retries, and bytes processed.

Gauges fit current values such as memory usage, queue depth, in-flight requests, and temperature.

Histograms and summaries are used for distributions such as latency or payload size, with histograms generally preferred when server-side aggregation is needed.

Hands-on example

Instrumentation example: create http_requests_total as a counter, in_flight_requests as a gauge, and http_request_duration_seconds as a histogram with buckets matching SLO thresholds such as 0.1, 0.25, 0.5, 1, and 2.5 seconds.

What is the difference between a counter and a gauge? [Basic]

Answer

A counter represents a cumulative value that only increases except when the process restarts. A gauge represents a current value that can go up or down.

Technical explanation

Use rate() or increase() on counters to calculate activity over time.

Use gauges directly or with avg_over_time, max_over_time, or min_over_time depending on the question.

Using a gauge for request totals or a counter for queue depth creates misleading dashboards and alerts.

Hands-on example

Example: http_requests_total is a counter, so query rate(http_requests_total[5m]). queue_depth is a gauge, so query max_over_time(queue_depth[10m]) or avg(queue_depth) by (queue). Do not apply rate() to queue_depth unless it explicitly represents a cumulative total.

What is the difference between a histogram and a summary, and the trade-offs? [Basic]

Answer

Histograms bucket observations and allow server-side aggregation and percentile calculation with histogram_quantile. Summaries calculate quantiles in the client and are harder to aggregate across instances. I usually prefer histograms for service latency in distributed systems.

Technical explanation

Histograms produce bucket time series such as le='0.5', le='1', and le='+Inf'.

Summaries can provide accurate client-side quantiles for one process but cannot be correctly averaged across replicas.

Histogram bucket choice matters: buckets should align to user-relevant thresholds and SLO objectives.

Hands-on example

PromQL: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)). This gives p95 per service across all replicas, which is a key reason histograms are preferred over summaries for fleet-level dashboards.

How does Prometheus collect metrics — what is the pull model and scraping? [Basic]

Answer

Prometheus normally uses a pull model: it periodically scrapes HTTP endpoints exposed by targets. Each target publishes metrics in the Prometheus exposition format, and Prometheus stores the scraped samples in its TSDB.

Technical explanation

The pull model makes target health visible through the up metric and keeps scraping configuration under platform control.

Service discovery dynamically finds targets from Kubernetes, EC2, Consul, file configs, or static configs.

Push is reserved for limited cases such as short-lived batch jobs that cannot be scraped reliably.

Hands-on example

Example: an app exposes /metrics on port 8080. Prometheus discovers the pod through Kubernetes labels, scrapes every 30 seconds, and records up{job='checkout'} plus application metrics. If scraping fails, up becomes 0 and target status shows the scrape error.

What is a scrape target, and how does service discovery find them? [Basic]

Answer

A scrape target is an endpoint Prometheus will scrape for metrics, typically host:port plus a metrics path. Service discovery finds targets automatically and attaches labels that can be relabeled before storage.

Technical explanation

In Kubernetes, targets can come from pods, services, endpoints, EndpointSlices, or custom resources through the Prometheus Operator.

Relabeling controls which targets are kept and how discovery metadata becomes Prometheus labels.

Good target labels should identify job, instance, namespace, service, pod, and environment without adding unbounded values.

Hands-on example

Hands-on: open Prometheus UI -> Status -> Targets. Verify the checkout target is UP, check last scrape duration and error, then query up{namespace='prod', service='checkout'}. If the target is missing, inspect ServiceMonitor labels and endpoint port names.

What is an exporter, and name a few common ones (node_exporter, etc.)? [Basic]

Answer

An exporter is a process that exposes metrics for software that does not natively expose Prometheus metrics. Common exporters include node_exporter, blackbox_exporter, mysqld_exporter, postgres_exporter, redis_exporter, and kube-state-metrics.

Technical explanation

Exporters translate system, database, queue, or appliance statistics into Prometheus metrics.

They are useful when you cannot or should not modify the monitored application.

Exporter quality matters because bad label design or expensive collection can create operational problems.

Hands-on example

Example: deploy node_exporter as a DaemonSet to expose CPU, memory, disk, and network metrics for every node. Prometheus scrapes each node_exporter, then dashboards show node_filesystem_avail_bytes, node_cpu_seconds_total, and node_network_receive_bytes_total.

What is the difference between an exporter and instrumenting your app directly? [Basic]

Answer

An exporter exposes metrics from an external system or runtime, while direct instrumentation adds metrics inside the application code. Exporters are good for infrastructure and third-party systems; direct instrumentation is better for business and service-level behavior.

Technical explanation

Exporters can show Redis memory, node CPU, or database connection stats, but they cannot know that a checkout failed due to a payment validation rule.

Direct instrumentation captures domain-specific metrics such as orders_created_total, payment_authorization_duration_seconds, and business error classes.

A strong observability design uses both exporter metrics and application metrics.

Hands-on example

Hands-on: use redis_exporter for cache hit ratio and memory fragmentation. Instrument the checkout application directly for checkout_attempts_total, checkout_success_total, and dependency latency. Alert on checkout SLO first, then use Redis exporter metrics for diagnosis.

What is the Pushgateway, and why is it discouraged for most cases? [Basic]

Answer

The Pushgateway lets short-lived jobs push metrics that Prometheus can later scrape. It is discouraged for most cases because it bypasses normal target health semantics, can become a bottleneck, and can leave stale metrics if lifecycle cleanup is not handled.

Technical explanation

Prometheus recommends the pull model for most services because it naturally exposes target availability through up.

Pushgateway is appropriate for service-level batch job results, not per-instance machine metrics or long-running services.

If used, metrics should include grouping keys carefully and be deleted when the job is no longer relevant.

Hands-on example

Example: a nightly reconciliation job pushes reconciliation_last_success_timestamp_seconds and reconciliation_records_processed_total to Pushgateway. Alert if time() - reconciliation_last_success_timestamp_seconds > 27h. Do not push per-pod CPU metrics or per-request metrics through Pushgateway.

What is PromQL, and what is an instant vector versus a range vector? [Basic]

Answer

PromQL is Prometheus's query language. An instant vector is a set of time series at one evaluation timestamp; a range vector is a set of time series with samples over a time window such as 5 minutes.

Technical explanation

Instant vectors are used for current values, aggregations, and alert expressions at a point in time.

Range vectors are required by functions such as rate(), increase(), avg_over_time(), max_over_time(), and histogram calculations.

Understanding vector types prevents common errors such as applying sum directly to a range vector.

Hands-on example

Examples: http_requests_total is an instant vector. http_requests_total[5m] is a range vector. rate(http_requests_total[5m]) converts the range vector into an instant vector of per-second rates, which can then be aggregated with sum by (service).

How does the rate() function work, and why use it on counters? [Basic]

Answer

rate() calculates the per-second average increase of a counter over a range window and accounts for counter resets. It should be used on counters because raw counters only show lifetime totals and are not useful for current traffic or error rate.

Technical explanation

Counters reset when a process restarts, and rate() handles that reset logic.

The range should be several scrape intervals long; for a 30-second scrape interval, 5 minutes is a common starting point.

Use increase() when you want the total increase over the window rather than a per-second rate.

Hands-on example

Example: requests per second = sum(rate(http_requests_total[5m])) by (service). Five-minute error percentage = 100 * sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])).

What is the difference between rate() and irate()? [Basic]

Answer

rate() uses all samples in the range and gives a smoothed average rate. irate() uses the last two samples and reacts faster, but it is noisier. I use rate() for alerts and dashboards, and reserve irate() for short-term debugging graphs.

Technical explanation

rate() is more stable and better for SLOs, alerting, and capacity trends.

irate() can reveal sudden spikes but can also produce false impressions when scrape intervals or traffic are uneven.

For low-traffic services, longer rate windows are usually better than irate().

Hands-on example

Hands-on: graph rate(container_cpu_usage_seconds_total[5m]) for normal CPU trend and irate(container_cpu_usage_seconds_total[1m]) while debugging a sudden CPU burst. Do not page on irate unless you have proven it is stable and actionable.

How do you compute a 95th percentile latency from a histogram in PromQL (histogram_quantile)? [Basic]

Answer

To compute p95 latency from a Prometheus histogram, apply rate() to the _bucket series, aggregate by le and the dimensions you want, then pass that to histogram_quantile(0.95, ...).

Technical explanation

The le label defines bucket boundaries and must be preserved until histogram_quantile runs.

For fleet-level latency, sum bucket rates across instances before calculating the quantile.

Bucket design controls accuracy; include buckets around your SLO thresholds.

Hands-on example

PromQL: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service='checkout'}[5m])) by (le)). For p95 by route: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)).

What is aggregation in PromQL (sum, avg, by, without)? [Basic]

Answer

Aggregation in PromQL combines series using functions such as sum, avg, min, max, count, topk, and quantile. The by clause keeps selected labels; without drops selected labels and groups by the rest.

Technical explanation

sum by (service) groups all matching series into one result per service.

sum without(instance, pod) removes replica-level labels while keeping the other labels.

Correct aggregation is essential to avoid double counting or accidentally hiding a bad instance.

Hands-on example

Examples: sum(rate(http_requests_total[5m])) by (service) gives RPS per service. sum without(pod, instance) (rate(container_cpu_usage_seconds_total[5m])) aggregates away pod identity. topk(10, sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))) finds top CPU consumers.

What is a recording rule, and when would you use one? [Basic]

Answer

A recording rule precomputes and stores the result of a PromQL expression as a new time series. I use recording rules for expensive, frequently used, or standardized queries such as service:error_ratio:rate5m.

Technical explanation

Recording rules improve dashboard performance and make alert expressions simpler and more consistent.

They are evaluated on a schedule by Prometheus and stored like normal metrics.

Naming should be consistent and indicate level, metric, operation, and window.

Hands-on example

Example rule:

groups:

- name: service-slo

rules:

- record: service:http_error_ratio:rate5m

expr: sum(rate(http_requests_total{status=~'5..'}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)

Then alerts and dashboards reuse service:http_error_ratio:rate5m.

What is an alerting rule, and how does it differ from a recording rule? [Basic]

Answer

An alerting rule evaluates a condition and creates an alert when the condition is true for the configured duration. A recording rule stores a query result for reuse; an alerting rule produces notifications through Alertmanager.

Technical explanation

Recording rules are about performance, standardization, and reusable derived metrics.

Alerting rules are about detecting actionable conditions and attaching labels and annotations.

A good pattern is to use recording rules for SLO math, then alerting rules for burn-rate thresholds.

Hands-on example

Example: record service:http_error_ratio:rate5m. Then alert: if service:http_error_ratio:rate5m > 0.02 for 10m, fire HighErrorRate with labels severity='page' and annotations pointing to the runbook and dashboard.

What is Alertmanager, and what does it handle that Prometheus does not? [Basic]

Answer

Alertmanager receives alerts from Prometheus and handles deduplication, grouping, routing, silencing, inhibition, and notification delivery. Prometheus detects alert conditions; Alertmanager decides how and when people are notified.

Technical explanation

Routing sends alerts to teams or tools based on labels such as service, team, severity, and environment.

Grouping prevents a flood of separate notifications for related alerts.

Silencing and inhibition suppress expected or redundant alerts without changing alert rules.

Hands-on example

Hands-on: add labels team='payments' and severity='page' to a Prometheus alert. Configure Alertmanager to route team='payments' to the payments PagerDuty receiver, group by alertname and service, and inhibit lower-severity pod alerts when a service-level page is firing.

How does Alertmanager grouping, inhibition, and silencing work? [Intermediate]

Answer

Grouping bundles related alerts into fewer notifications, inhibition suppresses alerts when another higher-level alert is firing, and silencing temporarily mutes matching alerts for planned work or known issues.

Technical explanation

Grouping is configured with group_by, group_wait, group_interval, and repeat_interval.

Inhibition is rule-based, often suppressing instance or pod alerts when a cluster or service alert is already active.

Silences should be time-bound, labeled, and include a reason so they do not hide real incidents indefinitely.

Hands-on example

Example: during node maintenance, create a silence matching instance='node-17' for two hours. Separately, configure inhibition so PodDown warnings are suppressed when KubernetesNodeNotReady is firing for the same node.

What is alert routing, and how do you send different alerts to different teams? [Intermediate]

Answer

Alert routing maps alerts to receivers based on labels. I route by team, service, severity, environment, and sometimes region so that the right owner receives the alert through the right channel.

Technical explanation

Routing requires consistent alert labels; missing team or service labels usually cause paging chaos.

Severity should control channel: page, ticket, chat, or email.

Routes should have a safe default receiver for unmatched alerts, but the goal is to eliminate unmatched production alerts.

Hands-on example

Alertmanager sketch:

route:

receiver: platform-default

routes:

- matchers: [team='payments', severity='page']

receiver: payments-pager

- matchers: [team='payments', severity='ticket']

receiver: payments-jira

- matchers: [environment='dev']

receiver: dev-slack

How does Prometheus handle high cardinality, and why is it a problem? [Intermediate]

Answer

Prometheus handles high cardinality poorly when too many unique label combinations create too many time series. It increases memory, disk, CPU, query latency, and remote-write cost, and can make Prometheus unstable.

Technical explanation

Cardinality is the number of unique time series, not just the number of metric names.

Labels like user_id, request_id, session_id, full URL, IP address, or order ID can create unbounded series.

Prevention is better than cleanup: enforce metric naming and label-review standards before production.

Hands-on example

Hands-on: run topk(20, count by (__name__)({__name__=~'.+'})) to identify large metrics, and count by (label_name) is not directly available, so use tooling such as promtool, Mimirtool, or TSDB status. Drop bad labels at scrape or instrumentation before retention cost grows.

What causes a cardinality explosion, and how do you prevent it? [Intermediate]

Answer

A cardinality explosion happens when a metric label has many or unbounded values, or when multiple labels multiply together unexpectedly. It is prevented by limiting labels to stable, bounded dimensions and reviewing instrumentation before release.

Technical explanation

Common causes are raw URL paths with IDs, customer IDs, pod UIDs, exception messages, dynamic queue names, and per-request labels.

Use templated routes like /orders/{id} instead of /orders/12345.

Apply metric relabeling carefully, but fixing instrumentation at source is the best solution.

Hands-on example

Example: replace http_requests_total{path='/users/123/orders/456'} with http_requests_total{route='/users/{user_id}/orders/{order_id}', method='GET', status='200'}. For user-level debugging, use logs or traces, not metric labels.

How does Prometheus storage (TSDB) work, and what is the retention model? [Intermediate]

Answer

Prometheus stores samples in its local time-series database, using in-memory head blocks for recent data and immutable blocks on disk for older data. Retention is controlled by time and/or size, after which old blocks are deleted.

Technical explanation

The TSDB is optimized for recent local queries and rule evaluation.

Retention flags such as storage.tsdb.retention.time and storage.tsdb.retention.size define how much data stays locally.

For long-term historical data, remote write or object-storage-based systems such as Thanos, Cortex, or Mimir are commonly used.

Hands-on example

Hands-on: set Prometheus args to --storage.tsdb.retention.time=15d and --storage.tsdb.retention.size=100GB. Monitor prometheus_tsdb_head_series, prometheus_tsdb_storage_blocks_bytes, and disk usage. If head series grows sharply, investigate cardinality before increasing disk.

How do you scale Prometheus for long-term storage and high availability (Thanos, Cortex, Mimir)? [Intermediate]

Answer

To scale Prometheus for long-term storage and HA, I run at least two Prometheus replicas per shard, use remote write or sidecars, and query long-term data through systems such as Thanos, Cortex, or Mimir. The exact choice depends on tenancy, scale, and operational model.

Technical explanation

Prometheus itself is single-node per shard, so horizontal scale usually means functional sharding and federation or remote-write architectures.

Thanos adds sidecar upload, object storage, global querying, compaction, and deduplication around Prometheus.

Cortex and Mimir are horizontally scalable, multi-tenant metrics backends designed for remote-write ingestion and large-scale querying.

Hands-on example

Example design: run two Prometheus replicas for each Kubernetes cluster. Remote write to Mimir for 13-month retention. Use Grafana to query Mimir for historical dashboards and local Prometheus for low-latency rule evaluation. Configure replica labels for deduplication.

What is the difference between Thanos and Cortex/Mimir at a high level? [Intermediate]

Answer

At a high level, Thanos extends existing Prometheus deployments with global query, object-store retention, and replica deduplication, while Cortex and Mimir are distributed, multi-tenant remote-write backends. Thanos often fits Prometheus-first environments; Mimir/Cortex fit centralized SaaS-like metrics platforms.

Technical explanation

Thanos commonly uses sidecars that upload Prometheus blocks to object storage.

Cortex and Mimir ingest samples through remote write and split ingestion, storage, compaction, and query across microservices.

Mimir is Grafana Labs' production-hardened continuation of the Cortex lineage, with strong multi-tenant operational features.

Hands-on example

Decision example: if each cluster already runs Prometheus and you need global views and cheap long retention, choose Thanos. If you operate a central platform for many teams with tenant quotas and remote-write ingestion, choose Mimir or Cortex-style architecture.

How do you make Prometheus highly available? [Intermediate]

Answer

I make Prometheus highly available by running two or more identical replicas scraping the same targets, giving each replica a unique external label, and deduplicating results in the query layer or long-term backend. I also keep Alertmanager highly available.

Technical explanation

Prometheus replicas do not coordinate scraping; they independently scrape the same targets.

Deduplication is handled by tools such as Thanos Query or Mimir, based on replica labels.

Alertmanager clustering prevents duplicate notifications when multiple Prometheus replicas fire the same alert.

Hands-on example

Hands-on: deploy prometheus-k8s-0 and prometheus-k8s-1 with external_labels cluster='prod-a' and replica='0'/'1'. Configure both to send alerts to a clustered Alertmanager. In Thanos Query, set replica-label=replica so dashboards show one deduplicated series.

How does Prometheus integrate with Kubernetes service discovery? [Intermediate]

Answer

Prometheus integrates with Kubernetes service discovery by watching Kubernetes API resources and turning pods, services, endpoints, EndpointSlices, nodes, and ingresses into scrape targets with metadata labels.

Technical explanation

Prometheus can use native kubernetes_sd_configs or higher-level Prometheus Operator objects.

Relabeling maps Kubernetes metadata into stable labels and filters targets based on annotations or selectors.

RBAC must allow Prometheus to list and watch the required resources.

Hands-on example

Example: with the Prometheus Operator, create a Service exposing port name 'metrics' and a ServiceMonitor selecting app='checkout'. Verify targets in Prometheus UI. If missing, check ServiceMonitor namespaceSelector, service labels, endpoint port name, and Prometheus serviceMonitorSelector.

What is the Prometheus Operator, and what are ServiceMonitor and PodMonitor? [Intermediate]

Answer

The Prometheus Operator manages Prometheus, Alertmanager, and related configuration as Kubernetes custom resources. ServiceMonitor and PodMonitor define how Prometheus discovers and scrapes service endpoints or pods.

Technical explanation

ServiceMonitor is usually preferred when a stable Service fronts application pods.

PodMonitor is useful when scraping pods directly, such as sidecars, daemon workloads, or jobs without a service.

The Operator converts these custom resources into Prometheus scrape configuration.

Hands-on example

ServiceMonitor example:

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

name: checkout

spec:

selector:

matchLabels:

app: checkout

endpoints:

- port: metrics

interval: 30s

Then confirm the target appears under Status -> Targets.

How do you instrument an application with a Prometheus client library? [Intermediate]

Answer

To instrument an application, I add a Prometheus client library, define a small set of well-named counters, gauges, and histograms, expose a /metrics endpoint, and ensure labels are bounded and useful.

Technical explanation

Start with service RED metrics: request count, error count, and request duration histogram.

Add domain metrics only when they drive SLOs, capacity planning, or operational debugging.

Protect cardinality by using route templates, status classes, dependency names, and environment labels rather than IDs.

Hands-on example

Python Flask sketch:

from prometheus_client import Counter, Histogram, generate_latest

REQ = Counter('http_requests_total','Requests',['route','method','status'])

LAT = Histogram('http_request_duration_seconds','Latency',['route'])

@app.get('/metrics')

def metrics(): return generate_latest(), 200, {'Content-Type':'text/plain'}

What labels should and should not be put on a metric? [Intermediate]

Answer

Metric labels should describe stable, bounded dimensions that are useful for aggregation, such as service, route, method, status, environment, cluster, and dependency. They should not contain unbounded identifiers such as user_id, request_id, session_id, raw URL, or exception message.

Technical explanation

Labels multiply time-series cardinality, so every new label must have operational value.

Keep label semantics consistent across services so dashboards and alerts can be reused.

Use logs and traces for high-cardinality investigative fields, not metrics.

Hands-on example

Example: good metric: http_requests_total{service='checkout', route='/orders/{id}', method='POST', status='201'}. Bad metric: http_requests_total{url='/orders/abc123', user='u987', request_id='...'}. The bad version is expensive and hard to aggregate.

How do you alert on something that should be happening but is not (absence of data)? [Intermediate]

Answer

To alert on something that should be happening but is not, I use absence checks, stale timestamp gauges, or expected-rate checks. The safest pattern is often a last_success_timestamp metric for jobs and pipelines.

Technical explanation

absent() is useful when an entire metric or target disappears, but it must be scoped carefully.

For cron jobs, export last success time and alert when it is too old.

For event streams, alert on rate below expected baseline only when the business schedule says traffic should exist.

Hands-on example

PromQL examples: absent(up{job='checkout'} == 1) detects missing targets. For a cron: time() - batch_last_success_timestamp_seconds{job='billing-reconcile'} > 26*3600. For Kafka: rate(events_consumed_total[15m]) == 0 during business hours.

What is the difference between black-box and white-box monitoring? [Intermediate]

Answer

Black-box monitoring tests behavior from the outside, like a user or dependency would see it. White-box monitoring uses internal metrics from the system itself. Both are needed: black-box confirms external experience; white-box explains internal causes.

Technical explanation

Black-box checks catch DNS, TLS, routing, load balancer, and end-to-end failures that internal metrics may miss.

White-box metrics show service internals, resource saturation, dependency errors, and code-level behavior.

For paging, black-box or SLO symptom checks are often stronger than infrastructure-only white-box alerts.

Hands-on example

Example: use blackbox_exporter to probe https://app.example.com/health from multiple regions. Use white-box Prometheus metrics to monitor checkout request duration, dependency error rate, and queue depth. If black-box fails but service metrics look fine, check DNS, ingress, CDN, and TLS.

What is a synthetic / black-box check, and what would you monitor with one? [Intermediate]

Answer

A synthetic or black-box check is an automated external probe that simulates a user or client action. I use it for critical user journeys, public endpoints, DNS/TLS validity, login, checkout, and dependency reachability.

Technical explanation

Synthetic checks validate the full path, including network, gateway, routing, certificate, and application availability.

They should be simple enough to be reliable and scoped enough to avoid creating fake business transactions incorrectly.

For complex flows, use test accounts and cleanup logic.

Hands-on example

Hands-on: deploy blackbox_exporter and configure an HTTP probe for /health, a TLS probe for certificate expiry, and a scripted synthetic for login -> add item -> checkout preview. Alert if probes fail from two regions for more than five minutes.

How do you monitor a batch job or cron that runs infrequently? [Intermediate]

Answer

For infrequent batch jobs or cron jobs, I monitor last success time, last completion status, runtime, records processed, and age of output data. I avoid relying only on process-level metrics because the job may not be running when Prometheus scrapes.

Technical explanation

A timestamp gauge is the most reliable SLI for 'has this job succeeded recently?'.

Runtime histograms help catch slow jobs before they miss deadlines.

Pushgateway can be used for service-level job results, but stale metrics and cleanup must be handled.

Hands-on example

Example: a daily billing job exports billing_last_success_timestamp_seconds, billing_last_run_duration_seconds, and billing_records_processed_total. Alert when time() - billing_last_success_timestamp_seconds > 27h or when runtime exceeds the historical p95 by 2x.

What is Grafana, and how does it relate to Prometheus? [Intermediate]

Answer

Grafana is a visualization and dashboarding platform. Prometheus stores and queries metrics; Grafana connects to Prometheus and other data sources to build dashboards, panels, variables, and visual investigation workflows.

Technical explanation

Grafana is not usually the source of truth for metrics; the backend such as Prometheus, Mimir, Thanos, Loki, or Elasticsearch is.

Grafana adds dashboard organization, templating, annotations, alerting options, and cross-source visualization.

Good Grafana dashboards should reflect service ownership and incident workflows, not just random metric panels.

Hands-on example

Hands-on: add Prometheus as a Grafana data source. Create dashboard variables cluster, namespace, and service. Add panels for RPS, error ratio, p95 latency, saturation, and recent deploy annotations. Link each panel to logs or traces using service and trace_id where possible.

How do you design a useful dashboard, and what is the difference from an alert? [Intermediate]

Answer

A useful dashboard is designed for a specific workflow: executive health, service operation, incident triage, or capacity planning. Alerts should wake people only for action; dashboards should provide context for humans who are already investigating.

Technical explanation

Start dashboards with user-impact signals, then dependency signals, then infrastructure causes.

Avoid dashboards with hundreds of unprioritized panels because they slow responders down.

A dashboard can show many conditions, but an alert must represent a clear action and owner.

Hands-on example

Dashboard layout: row 1: SLO compliance, burn rate, current incident status. Row 2: RED metrics by endpoint. Row 3: dependency latency and error rate. Row 4: CPU, memory, throttling, queue depth, DB pool. Add links to runbook, deployment history, Splunk logs, and trace search.

What is Splunk, and what is it primarily used for? [Intermediate]

Answer

Splunk is a platform for ingesting, indexing, searching, analyzing, and alerting on machine data, especially logs and events. It is primarily used for log analytics, security analytics, audit, troubleshooting, and operational intelligence.

Technical explanation

Splunk can ingest many data types, but its strength is searchable event data with powerful SPL queries.

It is commonly used for security investigations, compliance retention, incident debugging, and business-event analytics.

Because ingest and retention can be expensive, data onboarding and filtering strategy are important.

Hands-on example

Example: during a checkout incident, search Splunk for index=app sourcetype=checkout service=checkout trace_id=<id>. Use the trace ID from Grafana or OpenTelemetry to find exact error logs, then aggregate by error_code and deployment_version.

What is the Splunk data pipeline (input, parsing, indexing, search)? [Intermediate]

Answer

The Splunk data pipeline moves data through input, parsing, indexing, and search. Inputs receive data, parsing breaks it into events and applies metadata, indexing stores searchable data, and search heads run SPL queries over the indexes.

Technical explanation

Forwarders collect and send data from hosts or applications.

Parsing includes line breaking, timestamp recognition, source type assignment, and some transformations.

Indexing writes events into buckets, and search heads distribute searches across indexers.

Hands-on example

Hands-on flow: Universal Forwarder tails /var/log/checkout/app.log. Indexers parse timestamps and sourcetype=checkout_json, store events in index=prod_app, and the search head runs: index=prod_app sourcetype=checkout_json level=ERROR | stats count by error_code service.

What is an index in Splunk, and how do you decide indexing strategy? [Intermediate]

Answer

A Splunk index is a logical repository for events and their indexed data. I design indexes around retention, access control, data domain, volume, and compliance requirements, not around every small application.

Technical explanation

Separate indexes when data needs different retention, RBAC, sensitivity, or cost controls.

Use source, sourcetype, host, service, and fields to distinguish data inside an index.

Too many indexes increase management overhead; too few make access and retention difficult.

Hands-on example

Example strategy: index=prod_app for application logs retained 30 days, index=security for auth/security events retained 365 days, index=audit for compliance events retained 7 years, and index=dev_app retained 7 days. Limit team access by index and role.

What is the difference between a Splunk forwarder, indexer, and search head? [Intermediate]

Answer

A Splunk forwarder collects and forwards data, an indexer receives and stores indexed events, and a search head provides the UI/API and coordinates searches across indexers.

Technical explanation

Universal Forwarders are lightweight agents on hosts or nodes.

Indexers handle parsing, indexing, storage, bucket management, and search execution over local data.

Search heads manage SPL execution plans, knowledge objects, dashboards, saved searches, and user access.

Hands-on example

Troubleshooting example: if logs are missing, check forwarder status and outputs.conf first. If data is received but not searchable, check indexer queues and index config. If one user's dashboard fails, inspect search head permissions, macros, and saved searches.

What is the difference between a universal and a heavy forwarder? [Intermediate]

Answer

A Universal Forwarder is lightweight and mainly forwards data with minimal parsing. A Heavy Forwarder is a full Splunk instance that can parse, filter, route, and transform data before sending it onward.

Technical explanation

Universal Forwarders are preferred for most endpoint log collection because they are efficient and easy to operate.

Heavy Forwarders are used when data must be parsed, masked, routed, or enriched before indexing.

Heavy Forwarders require more CPU, memory, maintenance, and configuration governance.

Hands-on example

Example: install Universal Forwarders on application servers to send JSON logs. Use a Heavy Forwarder at the network boundary to mask sensitive fields, drop debug events, and route security logs to index=security and application logs to index=prod_app.

What is SPL (Search Processing Language)? [Intermediate]

Answer

SPL, or Search Processing Language, is Splunk's query language for searching, filtering, transforming, correlating, and visualizing events. It uses a pipeline model where each command processes results from the previous stage.

Technical explanation

SPL starts by selecting indexed data, usually with index, sourcetype, host, and time constraints.

Transforming commands such as stats, chart, timechart, and top summarize events into tables or charts.

eval, rex, fields, where, lookup, transaction, and join add analysis and enrichment, but performance depends heavily on filtering early.

Hands-on example

Example SPL: index=prod_app sourcetype=checkout_json earliest=-15m level=ERROR | stats count by service error_code | sort - count. This quickly answers which service and error code dominate recent failures.

What is the difference between search-time and index-time field extraction? [Intermediate]

Answer

Search-time field extraction happens when a search runs; index-time extraction happens before or during indexing and stores fields in a way that can affect indexing and search performance. Search-time is more flexible; index-time is more permanent and expensive to change.

Technical explanation

Search-time extractions are preferred for most fields because they can be changed without reindexing data.

Index-time extractions are used when fields must be indexed for performance, routing, masking, or compliance requirements.

Poor index-time decisions can increase storage and create long-term maintenance problems.

Hands-on example

Example: extract error_code and order_type at search time from JSON logs. Use index-time processing only to set host, source, sourcetype, timestamp, line breaking, routing, or to mask a secret before data is written.

Why is index-time configuration expensive, and when do you use it? [Intermediate]

Answer

Index-time configuration is expensive because it affects ingestion, storage, license usage, parsing queues, and sometimes requires reindexing to correct historical data. I use it only when there is a strong operational, security, or performance reason.

Technical explanation

Index-time transformations are applied before data is stored, so mistakes can permanently change indexed data.

They can increase CPU load on parsing/indexing tiers and add operational complexity.

Valid uses include timestamp correction, line breaking, routing, nullQueue filtering, sourcetype assignment, and sensitive-data masking.

Hands-on example

Example: if logs contain credit-card numbers, use index-time masking or filtering before indexing because search-time masking is too late. For a normal application field like feature_flag, use search-time extraction instead.

How do you write an efficient Splunk search, and why filter early? [Intermediate]

Answer

An efficient Splunk search filters early with index, sourcetype, host, time range, and selective terms before applying expensive commands. Filtering early reduces the event set and lowers search latency and resource use.

Technical explanation

Always specify the smallest reasonable time window.

Use indexed fields and simple terms before regex, join, transaction, or broad wildcards.

Project fields early with fields when large events are not needed, and summarize with stats instead of returning raw events.

Hands-on example

Poor search: index=* error | regex message='timeout.*payment'. Better: index=prod_app sourcetype=checkout_json service=checkout earliest=-30m error_code=PAYMENT_TIMEOUT | stats count by host, version. The better query narrows data before transformation.

What is the role of the stats, eval, and timechart commands in SPL? [Intermediate]

Answer

In SPL, stats calculates aggregate summaries, eval creates or modifies fields using expressions, and timechart builds time-series aggregations for charts. Together they cover most operational log analytics use cases.

Technical explanation

stats is used for counts, averages, percentiles, distinct counts, and grouping by fields.

eval is used for derived fields such as severity normalization, boolean flags, or latency buckets.

timechart groups results into time buckets, which makes it ideal for trends and incident timelines.

Hands-on example

Examples:

index=prod_app level=ERROR | stats count by service,error_code

index=prod_app | eval is_error=if(level='ERROR',1,0) | stats sum(is_error) by service

index=prod_app service=checkout | timechart span=5m count by level

What is the difference between stats and eventstats? [Intermediate]

Answer

stats transforms events into aggregate results, while eventstats computes aggregates and adds them back to each original event. Use stats when you only need the summary; use eventstats when you still need event-level detail plus the aggregate context.

Technical explanation

stats reduces the result set and removes fields not in the aggregation.

eventstats preserves original events and appends aggregate values such as average latency by service.

eventstats can be more expensive because it keeps many events in the pipeline.

Hands-on example

Example: index=prod_app service=checkout | eventstats avg(duration_ms) as avg_ms by endpoint | where duration_ms > 2*avg_ms. This keeps each slow event while comparing it to its endpoint's average. With stats alone, the raw events would be gone.

What is a Splunk source, sourcetype, and host? [Intermediate]

Answer

In Splunk, source is where the data came from, sourcetype describes the data format, and host identifies the machine or logical source host. These fields are foundational for search, parsing, and governance.

Technical explanation

source may be a file path, API input, stream, or object name.

sourcetype controls parsing behavior and field extraction conventions.

host helps identify origin, but in Kubernetes it may need careful design because pod, node, and container identities differ.

Hands-on example

Example: source=/var/log/checkout/app.log, sourcetype=checkout_json, host=ip-10-0-2-17. A search can start with index=prod_app sourcetype=checkout_json host=ip-10-0-2-17 to inspect logs from one node or container source.

How do you control Splunk costs and license/ingest volume? [Intermediate]

Answer

I control Splunk cost by managing ingest volume, retention, index strategy, data value, filtering, sampling, compression, and search efficiency. The biggest lever is to avoid ingesting low-value or duplicate data in the first place.

Technical explanation

Define log levels and retention by environment: production errors and audits have higher value than dev debug logs.

Filter or route noisy data before indexing when it has no incident, compliance, or analytics value.

Use metrics or traces for high-frequency numeric signals instead of logging every event.

Hands-on example

Hands-on: create a daily ingest report by index, sourcetype, service, and log level. Find top producers with license_usage logs. Reduce DEBUG logs in prod, drop health-check access logs, shorten dev retention, and move high-volume numeric telemetry to metrics.

What is data sampling or filtering at ingest, and why does it matter for cost? [Intermediate]

Answer

Ingest sampling or filtering reduces the amount of data sent to the backend by dropping, transforming, or sampling events before indexing. It matters because high-volume low-value telemetry drives license, storage, and search cost.

Technical explanation

Filtering removes events that are not useful, such as successful health checks or repetitive debug logs.

Sampling keeps a representative subset, useful for high-volume success events but risky for rare errors.

Never sample compliance, security, audit, or error data unless the business has explicitly approved it.

Hands-on example

Example policy: keep 100 percent of ERROR and WARN logs, keep 100 percent of audit events, sample successful access logs at 10 percent for high-volume endpoints, and drop Kubernetes readiness/liveness probe logs at ingestion. Validate savings and investigation impact monthly.

How do you reduce noisy or low-value log ingestion? [Intermediate]

Answer

I reduce noisy log ingestion by fixing log levels at the source, removing duplicate logs, filtering known low-value patterns, using structured logs, and moving repetitive numeric signals to metrics. Governance is more effective than after-the-fact cleanup.

Technical explanation

The application should not log every successful request at high detail unless required.

Infrastructure logs such as health checks, sidecar access logs, and retry noise should be sampled or summarized.

A log contract should define required fields, allowed levels, PII rules, and retention.

Hands-on example

Hands-on: analyze Splunk ingest by source and sourcetype. Identify that 35 percent is /health access logs. Add ingress/collector filtering to drop health checks, change app success logs to INFO summaries, preserve errors, and verify incident debugging still has trace_id and request context.

What is a Splunk saved search and a scheduled alert? [Advanced]

Answer

A Splunk saved search is a stored SPL query with permissions and optional schedule. A scheduled alert is a saved search that runs on a schedule and triggers an action when defined conditions are met.

Technical explanation

Saved searches standardize repeated analysis and can back dashboards, reports, or alerts.

Scheduled alerts should use efficient SPL, bounded time windows, and clear trigger conditions.

Alert actions can notify email, webhook, ITSM, on-call tooling, or custom integrations.

Hands-on example

Example: save a search: index=prod_app service=checkout earliest=-5m level=ERROR | stats count by error_code. Schedule every five minutes. Trigger if count > 100 for error_code=PAYMENT_TIMEOUT, then send a webhook to the incident system with dashboard and runbook links.

How do you build a Splunk dashboard, and when is it better than Grafana? [Advanced]

Answer

A Splunk dashboard is built from SPL searches, panels, inputs, tokens, and visualizations. It is better than Grafana when the primary workflow is log/event investigation, security analytics, audit drill-down, or complex SPL correlation. Grafana is usually better for metric-heavy SLO dashboards.

Technical explanation

Splunk dashboards are excellent for drilling from an aggregate into raw events.

Grafana shines when time-series metrics from Prometheus/Mimir/Thanos are the primary data source.

Many incident workflows use both: Grafana for service symptoms and Splunk for correlated logs.

Hands-on example

Example: create a Splunk dashboard with inputs for service, environment, and trace_id. Panels show error trend, top error codes, recent deploy versions, and raw correlated logs. Link to it from the Grafana SLO dashboard using service and time range variables.

What is data retention and a bucket lifecycle (hot, warm, cold, frozen) in Splunk? [Advanced]

Answer

Splunk stores indexed data in buckets that move through lifecycle stages such as hot, warm, cold, and frozen. Hot buckets are actively written, warm and cold are searchable historical buckets, and frozen data is archived or deleted based on retention policy.

Technical explanation

Retention is controlled by size and time settings such as maxTotalDataSizeMB and frozenTimePeriodInSecs.

Hot and warm storage is usually faster and more expensive; cold storage can be larger and slower.

Frozen is not searchable unless archived data is restored or handled through a separate process.

Hands-on example

Example: prod_app logs retain 30 days, security logs retain 365 days, and audit logs archive to object storage for 7 years. Configure indexes.conf retention settings, monitor bucket growth, and test restore procedures for frozen audit data.

How do you correlate logs and metrics during an incident? [Advanced]

Answer

During an incident, I correlate logs and metrics by aligning time range, service, environment, deployment version, trace_id, request ID, and user-impact dimensions. Metrics show scope and trend; logs explain specific errors and events.

Technical explanation

Start with symptom metrics such as error ratio, latency, and traffic to identify when and where the issue began.

Use labels and annotations to identify service, endpoint, version, and region.

Jump to logs using trace_id or service/time filters to find concrete exceptions, dependency errors, and state changes.

Hands-on example

Hands-on: Grafana shows checkout 5xx started at 10:05 after version v2.3.1. Open Splunk with earliest=10:00 latest=10:20 service=checkout version=v2.3.1 level=ERROR. stats count by error_code shows PAYMENT_TIMEOUT dominating. Traces confirm payment dependency latency.

What is Wavefront (Tanzu Observability), and what is it used for? [Advanced]

Answer

Wavefront, later known as Tanzu Observability by Wavefront and now reflected in Broadcom documentation as DX OpenExplore, is a high-scale streaming observability platform used for metrics, histograms, traces, dashboards, and alerts.

Technical explanation

It is known for high ingest rates, dimensional tags, fast analytics over time-series data, and advanced alerting/anomaly capabilities.

It is often used for infrastructure, Kubernetes, application metrics, and platform-level observability across large estates.

In interviews, I refer to both names because many environments still call it Wavefront while current docs may use newer branding.

Hands-on example

Example use case: send Kubernetes, cloud, JVM, and application metrics through collectors or proxies into Wavefront/DX OpenExplore. Build dashboards for service RED metrics, cluster saturation, and deployment events, then configure alerts on error rate, latency, and anomaly patterns.

What is the Wavefront data model and query language (WQL)? [Advanced]

Answer

Wavefront's data model is dimensional time-series data with metric names, sources, and point tags. WQL, or Wavefront Query Language, is used to query, transform, aggregate, and alert on time series, histograms, and events.

Technical explanation

A typical metric has a name, numeric value, timestamp, source, and tags such as env, service, cluster, or region.

WQL functions support filtering, aggregation, alignment, rates, percentiles, joins, and anomaly-style analysis.

Like Prometheus, data shape and tag cardinality are critical for cost and performance.

Hands-on example

Example WQL-style query: ts(app.checkout.request.latency, env=prod and service=checkout) to chart latency. Use aggregate functions by service or region, then build an alert when p95 latency remains above the SLO threshold for a sustained window.

How does Wavefront ingest metrics (proxy, direct ingestion, collectors)? [Advanced]

Answer

Wavefront/DX OpenExplore can ingest metrics through proxies, direct ingestion APIs, SDKs, agents, collectors, and integrations such as Telegraf, Kubernetes, cloud integrations, or OpenTelemetry pipelines depending on the environment.

Technical explanation

The proxy is common for controlled enterprise ingestion because it centralizes buffering, filtering, preprocessing, and egress control.

Direct ingestion can be useful for cloud integrations or applications that can safely send to the service endpoint.

Collectors normalize data from infrastructure, Kubernetes, applications, and cloud services before forwarding.

Hands-on example

Hands-on ingestion design: run a team of Wavefront proxies behind a load balancer. Agents and app SDKs send metrics to the proxy. The proxy applies preprocessor rules for tag normalization and filtering, then forwards to the observability backend.

What is the Wavefront proxy, and why use it? [Advanced]

Answer

The Wavefront proxy is an ingestion gateway that receives metrics locally and forwards them to Wavefront/DX OpenExplore. I use it to centralize egress, buffer traffic, apply preprocessing rules, normalize tags, and improve reliability.

Technical explanation

A proxy reduces the need for every app or agent to connect directly to the SaaS endpoint.

It can apply point filtering and point alteration rules before data is sent.

In production, multiple proxies behind a load balancer avoid a single point of failure and increase throughput.

Hands-on example

Example: deploy three proxies in Kubernetes, expose them through an internal Service, and point Telegraf and app SDKs to wavefront-proxy.monitoring.svc:2878. Add preprocessor rules to drop dev debug metrics and normalize env tags to prod, stage, or dev.

How does Wavefront handle high-cardinality metrics compared to Prometheus? [Advanced]

Answer

Wavefront-style platforms are designed for high-cardinality dimensional metrics at large ingest scale, while vanilla Prometheus is more sensitive to high series cardinality because each Prometheus server has local memory and TSDB limits. That said, both require disciplined tag and label design.

Technical explanation

Prometheus cardinality directly affects scrape, memory, disk, and query performance on each server or remote backend.

Wavefront/DX OpenExplore is built as a centralized streaming analytics backend, so it can handle larger dimensional datasets, but cost and query performance still depend on data shape.

Neither platform should receive unbounded request IDs or raw user IDs as metric dimensions.

Hands-on example

Design example: allow tags such as service, cluster, region, endpoint_template, and status_class. Reject tags such as request_id, session_id, email, full_url, and stacktrace. Track top metrics by points per second and unique tag combinations monthly.

How do you build alerts in Wavefront, and what is a smart alert? [Advanced]

Answer

Wavefront alerts are built from queries and conditions evaluated over time. A smart alert uses dynamic behavior or noise reduction features to capture real anomalies and reduce false positives compared with simple static thresholds.

Technical explanation

Basic alerts compare a query result to a fixed threshold for a duration.

More advanced alerts use baselines, anomaly detection, missing data handling, composite conditions, or linked alerts.

Like any alerting system, severity and routing should map to required human action.

Hands-on example

Example: create an alert on p95 checkout latency for env=prod. Static condition: p95 > 750 ms for 10 minutes. Smart condition: current latency deviates significantly from normal same-time-of-day baseline and error rate also increases, reducing false pages during normal traffic peaks.

How do you do anomaly detection in Wavefront? [Advanced]

Answer

Anomaly detection in Wavefront is done by comparing current metric behavior against historical or statistical baselines, then alerting when deviation is significant and sustained. It is useful when static thresholds are hard to set.

Technical explanation

Anomaly detection works best on metrics with stable seasonality or predictable patterns.

It should be combined with impact signals so normal business spikes do not page humans.

Validate anomalies against SLOs, deployments, and incidents before trusting them for paging.

Hands-on example

Example: monitor payment authorization latency, which normally rises during business hours. Use an anomaly/baseline query to compare current p95 to the expected band for that time. Page only if anomaly is sustained and checkout error-budget burn is also elevated.

How would you choose between Prometheus, Splunk, and Wavefront for a given signal? [Advanced]

Answer

I choose Prometheus for Kubernetes-native metrics and SLO alerting, Splunk for log/event search and audit investigation, and Wavefront/DX OpenExplore for high-scale dimensional metrics, analytics, and advanced alerting. The right tool depends on the signal and workflow.

Technical explanation

Metrics are best for alerting, SLOs, trends, and capacity. Logs are best for detailed events and forensic analysis. Traces are best for request-path debugging.

Prometheus is excellent close to workloads; Splunk is excellent for logs and security; Wavefront-style platforms shine as centralized metrics analytics backends.

A mature environment often uses all three with correlation links.

Hands-on example

Example decision: use Prometheus for service:error_budget_burn alerts, Grafana for SLO dashboards, Splunk for correlated logs using trace_id, and Wavefront for cross-region infrastructure analytics and anomaly detection at large scale.

What is the difference between metrics-based and log-based alerting, and the cost implications? [Advanced]

Answer

Metrics-based alerting evaluates pre-aggregated numeric time series and is usually cheaper, faster, and more reliable for paging. Log-based alerting searches event data and is useful for rare conditions or specific error patterns, but it can be more expensive and noisy at scale.

Technical explanation

Metrics are compact and purpose-built for alert evaluation, making them ideal for SLO burn, latency, traffic, and saturation.

Logs carry richer context but require high-volume ingestion and search processing.

Use log alerts sparingly for conditions that cannot be represented safely as metrics, such as specific audit violations or unique fatal error signatures.

Hands-on example

Example: page on Prometheus error-budget burn for checkout. Create a lower-volume Splunk alert for a specific security pattern such as repeated admin login failures from one IP. Do not search all application logs every minute for generic 'error' pages.

What is a telemetry pipeline, and why might you put one in front of your backends? [Advanced]

Answer

A telemetry pipeline receives, processes, filters, enriches, samples, routes, and exports observability data before it reaches backends. I put one in front of backends to control cost, quality, security, routing, and vendor flexibility.

Technical explanation

Pipelines can drop noisy data, redact sensitive fields, normalize attributes, and enforce tenant quotas.

They decouple instrumentation from backend choice by supporting multiple exporters.

OpenTelemetry Collector is a common vendor-neutral pipeline component.

Hands-on example

Example: apps send OTLP traces, metrics, and logs to an OpenTelemetry Collector gateway. Processors add environment and team tags, redact PII, sample traces, drop debug logs in prod, then export metrics to Prometheus remote write/Mimir, logs to Splunk, and traces to Tempo or a vendor APM.

What is OpenTelemetry, and what problem does it solve? [Advanced]

Answer

OpenTelemetry is a vendor-neutral observability framework for generating, collecting, processing, and exporting telemetry such as traces, metrics, and logs. It solves the problem of every tool requiring different agents, SDKs, and instrumentation formats.

Technical explanation

OTel provides APIs, SDKs, semantic conventions, instrumentation libraries, the OTLP protocol, and the Collector.

It reduces vendor lock-in because telemetry can be sent to multiple open-source or commercial backends.

It is especially valuable for distributed tracing and consistent resource/service attributes across languages.

Hands-on example

Hands-on: instrument a Java service with the OpenTelemetry Java agent, set OTEL_SERVICE_NAME=checkout and OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317, then configure the Collector to export traces to Tempo, metrics to Prometheus/Mimir, and logs to Splunk.

What is the difference between the OpenTelemetry SDK and the Collector? [Advanced]

Answer

The OpenTelemetry SDK runs in or beside the application to create telemetry, while the Collector is a separate service or agent that receives, processes, and exports telemetry. The SDK is for instrumentation; the Collector is for pipeline control.

Technical explanation

SDKs implement APIs, sampling, span creation, metric readers, resource attributes, and exporters inside the application process.

The Collector supports receivers, processors, exporters, and extensions for routing, batching, filtering, enrichment, and sampling.

Using the Collector prevents every application from needing direct credentials and backend-specific configuration.

Hands-on example

Example: application SDK exports OTLP to a local Collector agent. The agent batches and forwards to a gateway Collector. The gateway performs tail sampling, PII filtering, tenant routing, and exports to observability backends. App teams only configure service name and endpoint.

How would you design ingestion controls to manage observability cost at scale? [Advanced]

Answer

I design ingestion controls with budgets, quotas, sampling, filtering, retention tiers, cardinality limits, and ownership tags. The goal is to preserve high-value debugging and SLO signals while preventing uncontrolled telemetry growth.

Technical explanation

Controls should exist at source, collector, backend, and review-process levels.

Every signal should have an owner, purpose, retention class, and cost visibility.

High-cardinality labels, debug logs, and unsampled traces need explicit approval in production.

Hands-on example

Hands-on design: require service.name, team, env, and cost_center attributes. At the collector, drop health-check logs, hash or remove PII, enforce max label cardinality policies, sample successful traces, keep error traces, and route audit logs to longer-retention Splunk indexes.

How do you decide sampling rates for traces? [Advanced]

Answer

I choose trace sampling rates based on traffic volume, incident value, latency/error risk, compliance needs, and backend cost. I keep all or most errors and rare critical paths, while sampling high-volume successful traffic more aggressively.

Technical explanation

Uniform sampling is simple but can miss rare failures in high-volume systems.

Rules-based sampling can retain errors, slow requests, VIP tenants, or critical endpoints.

Sampling decisions should be reviewed with actual trace volume and incident usefulness, not guessed once and forgotten.

Hands-on example

Example policy: keep 100 percent of traces with error=true, 100 percent of checkout payment flows, 10 percent of normal checkout success traces, and 1 percent of high-volume read-only catalog requests. Revisit rates monthly based on backend cost and debugging gaps.

What is head-based versus tail-based sampling for traces? [Advanced]

Answer

Head-based sampling decides whether to keep a trace at the beginning of the request. Tail-based sampling decides after seeing the complete trace, so it can keep errors, slow traces, or specific outcomes more intelligently.

Technical explanation

Head sampling is simple, cheap, and can run in the SDK, but it may drop the rare bad trace before knowing it is bad.

Tail sampling needs a collector or backend that buffers spans until the trace outcome is known.

Tail sampling gives better incident value but increases pipeline complexity, memory, and latency.

Hands-on example

Example: use head sampling at 10 percent in low-criticality services. For checkout, send all spans to an OTel Collector gateway using tail sampling rules: keep all error traces, all traces above 1 second, and 5 percent of normal successes.

What is distributed tracing, and what is a span and a trace? [Advanced]

Answer

Distributed tracing follows a request across service boundaries. A trace is the full end-to-end request journey; a span is one timed operation within that journey, such as an HTTP handler, database call, queue publish, or downstream RPC.

Technical explanation

Each span has a trace ID, span ID, parent span ID, timestamps, attributes, events, and status.

Traces reveal dependency latency, fan-out, retries, and where errors occur in a call chain.

Tracing is most valuable when service names, routes, status codes, and error attributes follow consistent conventions.

Hands-on example

Example: checkout request trace includes spans: ingress -> checkout POST /orders -> inventory reserve -> payment authorize -> database insert -> Kafka publish. If p95 latency rises, the trace waterfall shows payment authorize consumes most of the time.

How does context propagation work across services in tracing? [Advanced]

Answer

Context propagation passes trace context across service boundaries so spans can be linked into one trace. It usually uses HTTP headers such as W3C traceparent and tracestate, and equivalent metadata for messaging or RPC systems.

Technical explanation

The caller injects trace context into outbound requests; the callee extracts it and creates child spans.

Without propagation, each service creates separate traces and root-cause analysis becomes fragmented.

For asynchronous systems, context should be propagated through message headers while respecting security and privacy rules.

Hands-on example

Hands-on: ensure services use OTel auto-instrumentation or middleware for HTTP/gRPC. Verify outgoing requests contain traceparent. For Kafka, propagate trace context in message headers. In the trace UI, confirm checkout, payment, inventory, and notification spans share the same trace ID.

What is cardinality cost, and how does it differ between metrics and logs? [Advanced]

Answer

Cardinality cost is the resource and financial impact of unique dimensional combinations. In metrics, cardinality creates new time series and directly affects memory, storage, and query cost. In logs, high-cardinality fields are expected, but ingest volume and indexing/search patterns drive cost.

Technical explanation

Metrics should use bounded labels because every unique label set is a new series.

Logs can contain request_id or user_id for investigation, but logging every event at high volume still creates ingest and storage cost.

Traces also have cardinality concerns through attributes, but sampling and backend indexing policies determine cost impact.

Hands-on example

Example: put user_id in logs and traces for controlled debugging, but never in Prometheus labels. For metrics, expose tenant_tier='enterprise' instead of tenant_id. For logs, control retention and indexing of user_id based on privacy and investigation requirements.

How do you design alerts that page a human only when action is required? [Advanced]

Answer

I design human-page alerts around user impact, urgency, ownership, and required action. If no immediate human action is needed, the signal should become a ticket, dashboard annotation, or automated remediation rather than a page.

Technical explanation

Use SLO burn-rate alerts for page-worthy service symptoms.

Require every page to include service, severity, owner, runbook, dashboard, and recent-change links.

Tune alerts with historical page reviews and remove alerts that do not lead to action.

Hands-on example

Hands-on: convert CPUHigh pages into tickets unless CPU saturation is proven to cause SLO burn. Keep pages for checkout high burn rate, payment dependency outage, and data loss risk. Add Alertmanager inhibition so pod-level alerts do not page when the service-level SLO alert is already firing.

What is the difference between a page, a ticket, and a dashboard signal? [Advanced]

Answer

A page is for urgent human action now, a ticket is for non-urgent work that should be tracked, and a dashboard signal is context for investigation or review. Mixing these creates fatigue and poor prioritization.

Technical explanation

Pages should be rare, actionable, and user-impacting or immediately risk-bearing.

Tickets are appropriate for capacity trends, flaky jobs, low-severity policy violations, or slow error-budget burn.

Dashboard signals can be useful without triggering workflow, such as dependency latency trends or deploy markers.

Hands-on example

Example: 20x SLO burn for checkout = page. Disk projected full in seven days = ticket. CPU at 70 percent during expected peak = dashboard signal. One failed synthetic probe from one region for one minute = dashboard or warning, not a page.

How would you reduce alert noise across many teams (deduplication, correlation, AIOps)? [Advanced]

Answer

To reduce alert noise across many teams, I standardize alert labels, deduplicate related alerts, correlate symptoms with causes, use inhibition, enforce ownership metadata, and review noisy alerts as an operational metric. AIOps can help, but good hygiene comes first.

Technical explanation

Normalize labels such as service, team, environment, severity, cluster, and alert_type.

Group alerts by incident context so one dependency outage does not create hundreds of pages.

Use event correlation to identify shared causes such as a bad deployment, region outage, or database failure.

Hands-on example

Hands-on: build a weekly report of alerts by team, alertname, service, and action taken. For top noisy alerts, add grouping or inhibition, convert non-actionable pages to tickets, and update runbooks. Use correlation rules to link pod restarts, 5xx errors, and deploy events.

What is event correlation, and how does it reduce incident noise? [Advanced]

Answer

Event correlation groups related alerts, logs, changes, and topology signals into a smaller number of incident candidates. It reduces noise by showing that many symptoms likely share one cause.

Technical explanation

Correlation can use time proximity, service dependency maps, Kubernetes ownership, deployment events, region, node, or common error signatures.

It improves incident response by reducing duplicate triage and highlighting blast radius.

Correlation should not hide severity; it should preserve evidence while reducing notification volume.

Hands-on example

Example: a node failure triggers PodDown, ReplicaUnavailable, service error-rate, and synthetic alerts. Correlation links them by node, namespace, and time, then presents one incident: 'node-17 failure impacting checkout pods' with related alerts attached.

How would you measure observability coverage across services? [Advanced]

Answer

I measure observability coverage by checking whether every production service has owned metrics, logs, traces, dashboards, alerts, SLOs, runbooks, and correlation metadata. Coverage should be measured against operational outcomes, not just whether an agent is installed.

Technical explanation

Required attributes include service name, owner/team, environment, version, cluster, and runbook links.

Coverage should include signal quality: useful labels, structured logs, trace propagation, and actionable alerts.

Review coverage as part of production readiness and monthly operational reviews.

Hands-on example

Hands-on scorecard: for each service, mark RED metrics present, p95/p99 latency available, structured logs with trace_id, traces across dependencies, SLO defined, burn-rate alert configured, dashboard link, runbook link, and owner label. Track percent complete by team.

How do you instrument a service so that an on-call engineer can debug it without code changes? [Advanced]

Answer

I instrument a service with standardized metrics, structured logs, distributed traces, correlation IDs, deployment metadata, dependency spans, and runbook links so on-call can debug without code changes. The goal is predictable telemetry for every request path.

Technical explanation

Metrics should cover RED, dependency health, queue depth, resource saturation, and business-critical counters.

Logs should be structured, sampled responsibly, and include trace_id, service, version, tenant tier, and error code.

Traces should include meaningful span names and attributes but avoid sensitive data.

Hands-on example

Implementation example: add OTel auto-instrumentation, Prometheus /metrics, JSON logging middleware, trace_id injection into logs, deployment annotations, health/readiness endpoints, and dashboards generated from service templates. Validate by running a failure drill before production launch.

How would you build an SLO dashboard and tie alerts to error-budget burn? [Advanced]

Answer

An SLO dashboard should show SLI value, SLO target, remaining error budget, burn rate across multiple windows, incident links, and the service signals needed to explain budget burn. Alerts should be based on error-budget burn, not unrelated infrastructure thresholds.

Technical explanation

The top of the dashboard should answer: are users impacted, how fast is budget burning, and how much budget remains?

Supporting rows should show latency, error rate, traffic, saturation, dependencies, and recent deploys.

Use recording rules so dashboard and alert math are identical.

Hands-on example

Hands-on: create recording rules for good_requests, total_requests, error_ratio, and burn_rate for 5m, 1h, 6h, and 3d windows. Grafana panels show 28-day compliance, budget remaining, fast-burn page status, slow-burn ticket status, and links to Splunk and traces.

How do you trigger automated remediation from an observability signal? [Advanced]

Answer

Automated remediation should be triggered only from reliable, well-scoped observability signals and guarded with safety controls. I use it for known, reversible actions such as restart, rollback, scale-out, cache flush, or traffic shift, not for ambiguous incidents.

Technical explanation

The signal should be high confidence, ideally symptom plus known cause, and the remediation should be idempotent or reversible.

Guardrails include rate limits, blast-radius limits, approval for risky actions, audit logs, and automatic rollback of remediation if it worsens SLOs.

Runbooks should define when automation is allowed and when humans must be involved.

Hands-on example

Example: if queue_depth > threshold, consumer lag increases, and CPU is below saturation, trigger KEDA/HPA scale-out for workers. If error-budget burn continues after scale-out, stop further automation and page the owning team with remediation actions logged.

How would you investigate a latency spike using metrics, traces, and logs together? [Advanced]

Answer

To investigate a latency spike, I start with metrics to identify scope and timing, use traces to find the slow path or dependency, and use logs to inspect exact errors or state changes for affected requests.

Technical explanation

Metrics answer when, where, how many users, which endpoints, and whether errors or saturation also changed.

Traces answer which service or dependency consumed time and whether retries or fan-out amplified latency.

Logs answer detailed causes such as timeout messages, SQL errors, throttling responses, or bad configuration.

Hands-on example

Runbook: 1) Check p95/p99 by endpoint and region. 2) Compare traffic and error rate. 3) Check recent deploy annotations. 4) Open slow traces around the spike. 5) Identify slow span, such as payment authorize. 6) Search Splunk logs by trace_id and service=payment. 7) Mitigate with rollback, traffic shift, or dependency escalation.

What recent observability practice or tool have you adopted, and what improved? [Advanced]

Answer

A recent observability practice I have adopted is using OpenTelemetry as a standard instrumentation and collection layer, combined with SLO-based alerting. It improved vendor flexibility, trace correlation, and reduced alert noise by focusing pages on user-impacting burn rates.

Technical explanation

OpenTelemetry standardizes service names, resource attributes, trace context, and export paths across languages.

A collector pipeline lets platform teams manage sampling, filtering, enrichment, and routing centrally.

SLO-based alerting moved the team away from CPU-style pages toward user-impacting conditions.

Hands-on example

Interview example: I would describe migrating one service first: enable OTel auto-instrumentation, route telemetry through a Collector, add trace_id to logs, build an SLO dashboard, and replace noisy pod alerts with error-budget burn alerts. The result is faster triage and fewer non-actionable pages.

How do you prevent a single noisy service from blowing up observability costs for everyone? [Advanced]

Answer

I prevent one noisy service from blowing up shared observability costs with quotas, ownership tags, ingestion limits, cardinality policies, sampling, retention tiers, and review gates. Cost must be visible to the service owner.

Technical explanation

Each telemetry stream should include team/service/cost-center attributes so chargeback or showback is possible.

Collectors and backends should enforce limits on bytes, points per second, spans per second, and label cardinality.

Noisy services should be throttled or routed to shorter retention rather than degrading the platform for everyone.

Hands-on example

Hands-on: create per-team telemetry budgets. If service=catalog exceeds its metrics cardinality budget, the collector drops disallowed labels and sends warnings. If logs exceed daily quota, DEBUG logs are dropped first, errors are preserved, and the owning team receives a cost report.

How would you run a monthly operational review using observability data and SLO trends? [Advanced]

Answer

In a monthly operational review, I use observability data to examine SLO compliance, error-budget trends, incidents, alert quality, capacity risks, cost, and top reliability actions. The output should be decisions and owners, not just dashboards.

Technical explanation

Review which services met or missed SLOs, where error budget was spent, and whether incidents had repeat causes.

Analyze alert volume, pages per service, false positives, missing alerts, and mean time to detect/resolve.

Track observability cost and coverage gaps by team, then prioritize improvements for the next month.

Hands-on example

Agenda: 1) SLO scorecard by service. 2) Top five budget burns and incident themes. 3) Alert noise and paging health. 4) Capacity and cost trends. 5) Coverage gaps in metrics/logs/traces. 6) Action register with owners, due dates, and expected reliability impact.

Source Notes

Prometheus metric types: https://prometheus.io/docs/concepts/metric_types/

Prometheus histograms and summaries: https://prometheus.io/docs/practices/histograms/

Prometheus Alertmanager: https://prometheus.io/docs/alerting/latest/alertmanager/

Prometheus Pushgateway guidance: https://prometheus.io/docs/practices/pushing/

Prometheus instrumentation practices: https://prometheus.io/docs/practices/instrumentation/

OpenTelemetry documentation: https://opentelemetry.io/docs/

OpenTelemetry Collector: https://opentelemetry.io/docs/collector/

Splunk data pipeline: https://docs.splunk.com/Splexicon:Datapipeline

Splunk Search Reference / SPL: https://docs.splunk.com/Documentation/Splunk/8.2.12/SearchReference/WhatsInThisManual

Splunk bucket lifecycle: https://docs.splunk.com/Documentation/Splunk/8.2.12/Indexer/HowSplunkstoresindexes

DX OpenExplore / Wavefront overview: https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/dx-openexplore/saas.html

DX OpenExplore Wavefront Query Language reference: https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/dx-openexplore/saas/query-language/query_language_reference.html

DX OpenExplore Wavefront proxy: https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/dx-openexplore/saas/data-and-proxy/proxies.html

← All interview topics