Design a Distributed Job Scheduler

Interview Question

Design a reliable, horizontally scalable scheduler (distributed cron) that supports one-off and recurring jobs with retries and idempotency.

Key Points to Cover

Evaluation Rubric

Clear job model & idempotency25% weight

Safe partition/lease ownership25% weight

Durable storage & retries/DLQs25% weight

Operational visibility & controls25% weight

Hints

Common Pitfalls to Avoid

⚠️**Lack of Idempotency:** Failing to design for idempotency can lead to data corruption or inconsistent system states when jobs are retried.
⚠️**Single Point of Failure:** Not implementing distributed coordination or leader election for job dispatch can make the scheduler vulnerable to single-node failures.
⚠️**Job Loss During Failures:** Improper handling of worker failures without visibility timeouts or durable queues can result in jobs being lost indefinitely.
⚠️**Clock Skew:** Relying solely on precise `due_time` across distributed systems can be problematic due to clock skew; using relative timers or an event-driven approach might be more robust.
⚠️**Infinite Retries Without Backoff:** Aggressively retrying failed jobs without exponential backoff can overload downstream services and the scheduler itself.

Potential Follow-up Questions