AdvancedTechnical
5 min
Scalable ETL Pipeline Design
ETLData EngineeringScalability
Advertisement
Interview Question
How would you design a scalable ETL pipeline for processing terabytes of data daily with low latency?
Key Points to Cover
- Design data ingestion with Kafka or Kinesis
- Apply distributed processing frameworks (Spark, Flink)
- Ensure schema evolution handling and data validation
- Optimize storage formats (Parquet, ORC) and partitioning
Evaluation Rubric
Defines scalable ingestion strategy30% weight
Explains distributed processing design30% weight
Optimizes storage and query efficiency20% weight
Handles schema evolution and validation20% weight
Hints
- 💡Think streaming vs batch trade-offs.
Potential Follow-up Questions
- ❓How would you monitor ETL latency?
- ❓What about handling corrupt records?
Advertisement