IntermediateSystem-Design
30 min
Design a Distributed Job Scheduler
SchedulingReliabilityDatabasesMessaging
Advertisement
Interview Question
Design a reliable, horizontally scalable scheduler (distributed cron) that supports one-off and recurring jobs with retries and idempotency.
Key Points to Cover
- Job model: due time, retry policy, idempotency keys
- Leader election or partitioned ownership; fencing tokens
- Durable storage with visibility timeouts and leases
- At-least-once execution with dedupe and DLQs
- Observability: job states, SLAs, replays, and pause/resume
Evaluation Rubric
Clear job model & idempotency25% weight
Safe partition/lease ownership25% weight
Durable storage & retries/DLQs25% weight
Operational visibility & controls25% weight
Hints
- 💡Use time wheels or priority queues for timers.
Common Pitfalls to Avoid
- ⚠️**Lack of Idempotency:** Failing to design for idempotency can lead to data corruption or inconsistent system states when jobs are retried.
- ⚠️**Single Point of Failure:** Not implementing distributed coordination or leader election for job dispatch can make the scheduler vulnerable to single-node failures.
- ⚠️**Job Loss During Failures:** Improper handling of worker failures without visibility timeouts or durable queues can result in jobs being lost indefinitely.
- ⚠️**Clock Skew:** Relying solely on precise `due_time` across distributed systems can be problematic due to clock skew; using relative timers or an event-driven approach might be more robust.
- ⚠️**Infinite Retries Without Backoff:** Aggressively retrying failed jobs without exponential backoff can overload downstream services and the scheduler itself.
Potential Follow-up Questions
- ❓How to avoid clock skew issues?
- ❓How to guarantee exactly-once for certain jobs?
Advertisement