Design a Real-Time Analytics Pipeline

StreamingDataMessagingOLAP

Interview Question

Design a pipeline to ingest, process, and query billions of events per day with second-level latency and exactly-once semantics.

Key Points to Cover

Evaluation Rubric

High-throughput ingestion design25% weight

Fault-tolerant streaming processing25% weight

Hot/cold storage and rollups25% weight

Low-latency query serving and isolation25% weight

Hints

Common Pitfalls to Avoid

⚠️**Underestimating Throughput Requirements:** Incorrectly sizing Kafka partitions or Kinesis shards leading to backpressure and increased latency.
⚠️**Failing to Configure Checkpointing Correctly:** Insufficiently frequent or improperly configured checkpoints in Flink/Spark leading to potential data loss or duplication.
⚠️**Ignoring Idempotency in Sink Operations:** Processed data being written to the hot store multiple times due to retries without proper idempotency checks.
⚠️**Complex Orchestration and Monitoring:** Lack of integrated monitoring and alerting across disparate components, making it hard to diagnose issues quickly.
⚠️**Assuming Default Settings are Sufficient:** Not tuning Kafka brokers, Flink/Spark configurations, or database parameters for the extreme scale and low-latency requirements.

Potential Follow-up Questions