AdvancedSystem-Design
45 min
Design a Real-Time Analytics Pipeline
StreamingDataMessagingOLAP
Advertisement
Interview Question
Design a pipeline to ingest, process, and query billions of events per day with second-level latency and exactly-once semantics.
Key Points to Cover
- Ingestion: Kafka/Kinesis with partitions sized for throughput
- Processing: Flink/Spark Structured Streaming with checkpoints
- Storage: hot (ClickHouse/Druid/BigQuery) + cold (S3 + Parquet)
- Semantics: idempotency/outbox, watermarking, late data handling
- Serving: pre-aggregations, rollups, multi-tenant isolation
Evaluation Rubric
High-throughput ingestion design25% weight
Fault-tolerant streaming processing25% weight
Hot/cold storage and rollups25% weight
Low-latency query serving and isolation25% weight
Hints
- 💡Plan for backfills and schema evolution.
Common Pitfalls to Avoid
- ⚠️**Underestimating Throughput Requirements:** Incorrectly sizing Kafka partitions or Kinesis shards leading to backpressure and increased latency.
- ⚠️**Failing to Configure Checkpointing Correctly:** Insufficiently frequent or improperly configured checkpoints in Flink/Spark leading to potential data loss or duplication.
- ⚠️**Ignoring Idempotency in Sink Operations:** Processed data being written to the hot store multiple times due to retries without proper idempotency checks.
- ⚠️**Complex Orchestration and Monitoring:** Lack of integrated monitoring and alerting across disparate components, making it hard to diagnose issues quickly.
- ⚠️**Assuming Default Settings are Sufficient:** Not tuning Kafka brokers, Flink/Spark configurations, or database parameters for the extreme scale and low-latency requirements.
Potential Follow-up Questions
- ❓How to handle out-of-order events?
- ❓How to guarantee exactly-once?
Advertisement