Interview Observability

How do you monitor a batch job or cron that runs infrequently? [Intermediate]

Answer

For infrequent batch jobs or cron jobs, I monitor last success time, last completion status, runtime, records processed, and age of output data. I avoid relying only on process-level metrics because the job may not be running when Prometheus scrapes.

Technical explanation

A timestamp gauge is the most reliable SLI for 'has this job succeeded recently?'.

Runtime histograms help catch slow jobs before they miss deadlines.

Pushgateway can be used for service-level job results, but stale metrics and cleanup must be handled.

Hands-on example

Example: a daily billing job exports billing_last_success_timestamp_seconds, billing_last_run_duration_seconds, and billing_records_processed_total. Alert when time() - billing_last_success_timestamp_seconds > 27h or when runtime exceeds the historical p95 by 2x.

Preparing for an interview?

Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.

More Observability interview questions

← All Observability questions