How do you monitor a batch job or cron that runs infrequently? [Intermediate]

Question

Accepted Answer

For infrequent batch jobs or cron jobs, I monitor last success time, last completion status, runtime, records processed, and age of output data. I avoid relying only on process-level metrics because the job may not be running when Prometheus scrapes. A timestamp gauge is the most reliable SLI for 'has this job succeeded recently?'. Runtime histograms help catch slow jobs before they miss deadlines. Pushgateway can be used for service-level job results, but stale metrics and cleanup must be handled.

How do you monitor a batch job or cron that runs infrequently? [Intermediate]

Answer

Technical explanation

Hands-on example

More Observability interview questions