Interview Questions/System Design/Design a Real-Time Analytics Pipeline
AdvancedSystem-Design
45 min

Design a Real-Time Analytics Pipeline

StreamingDataMessagingOLAP
Advertisement
Interview Question

Design a pipeline to ingest, process, and query billions of events per day with second-level latency and exactly-once semantics.

Key Points to Cover
  • Ingestion: Kafka/Kinesis with partitions sized for throughput
  • Processing: Flink/Spark Structured Streaming with checkpoints
  • Storage: hot (ClickHouse/Druid/BigQuery) + cold (S3 + Parquet)
  • Semantics: idempotency/outbox, watermarking, late data handling
  • Serving: pre-aggregations, rollups, multi-tenant isolation
Evaluation Rubric
High-throughput ingestion design25% weight
Fault-tolerant streaming processing25% weight
Hot/cold storage and rollups25% weight
Low-latency query serving and isolation25% weight
Hints
  • 💡Plan for backfills and schema evolution.
Common Pitfalls to Avoid
  • ⚠️**Underestimating Throughput Requirements:** Incorrectly sizing Kafka partitions or Kinesis shards leading to backpressure and increased latency.
  • ⚠️**Failing to Configure Checkpointing Correctly:** Insufficiently frequent or improperly configured checkpoints in Flink/Spark leading to potential data loss or duplication.
  • ⚠️**Ignoring Idempotency in Sink Operations:** Processed data being written to the hot store multiple times due to retries without proper idempotency checks.
  • ⚠️**Complex Orchestration and Monitoring:** Lack of integrated monitoring and alerting across disparate components, making it hard to diagnose issues quickly.
  • ⚠️**Assuming Default Settings are Sufficient:** Not tuning Kafka brokers, Flink/Spark configurations, or database parameters for the extreme scale and low-latency requirements.
Potential Follow-up Questions
  • How to handle out-of-order events?
  • How to guarantee exactly-once?
Advertisement