Interview Questions/System Design/Design a Distributed Job Scheduler
IntermediateSystem-Design
30 min

Design a Distributed Job Scheduler

SchedulingReliabilityDatabasesMessaging
Advertisement
Interview Question

Design a reliable, horizontally scalable scheduler (distributed cron) that supports one-off and recurring jobs with retries and idempotency.

Key Points to Cover
  • Job model: due time, retry policy, idempotency keys
  • Leader election or partitioned ownership; fencing tokens
  • Durable storage with visibility timeouts and leases
  • At-least-once execution with dedupe and DLQs
  • Observability: job states, SLAs, replays, and pause/resume
Evaluation Rubric
Clear job model & idempotency25% weight
Safe partition/lease ownership25% weight
Durable storage & retries/DLQs25% weight
Operational visibility & controls25% weight
Hints
  • 💡Use time wheels or priority queues for timers.
Common Pitfalls to Avoid
  • ⚠️**Lack of Idempotency:** Failing to design for idempotency can lead to data corruption or inconsistent system states when jobs are retried.
  • ⚠️**Single Point of Failure:** Not implementing distributed coordination or leader election for job dispatch can make the scheduler vulnerable to single-node failures.
  • ⚠️**Job Loss During Failures:** Improper handling of worker failures without visibility timeouts or durable queues can result in jobs being lost indefinitely.
  • ⚠️**Clock Skew:** Relying solely on precise `due_time` across distributed systems can be problematic due to clock skew; using relative timers or an event-driven approach might be more robust.
  • ⚠️**Infinite Retries Without Backoff:** Aggressively retrying failed jobs without exponential backoff can overload downstream services and the scheduler itself.
Potential Follow-up Questions
  • How to avoid clock skew issues?
  • How to guarantee exactly-once for certain jobs?
Advertisement