How do you monitor a batch job or cron that runs infrequently? [Intermediate]
Answer
For infrequent batch jobs or cron jobs, I monitor last success time, last completion status, runtime, records processed, and age of output data. I avoid relying only on process-level metrics because the job may not be running when Prometheus scrapes.
Technical explanation
A timestamp gauge is the most reliable SLI for 'has this job succeeded recently?'.
Runtime histograms help catch slow jobs before they miss deadlines.
Pushgateway can be used for service-level job results, but stale metrics and cleanup must be handled.
Hands-on example
Example: a daily billing job exports billing_last_success_timestamp_seconds, billing_last_run_duration_seconds, and billing_records_processed_total. Alert when time() - billing_last_success_timestamp_seconds > 27h or when runtime exceeds the historical p95 by 2x.
Check how well your resume matches the role with our free resume checker— match score, ATS check, and the skills you're missing.
More Observability interview questions
- What is observability, and how is it different from traditional monitoring? [Basic]
- What are the three pillars of observability (metrics, logs, traces)? [Basic]
- What is the difference between monitoring and observability in practice? [Basic]
- What are the four golden signals of monitoring? [Basic]
- What is the difference between the USE method and the RED method? [Basic]
- When would you use the USE method versus the RED method? [Basic]
- What is an SLI, an SLO, and an SLA, and how do they relate? [Basic]
- How do you choose good SLIs for a service? [Basic]