Advertisement
Interview Question
Read replicas are falling minutes behind the primary. How do you diagnose replication lag and remediate it safely?
Key Points to Cover
- Measure lag via pg_stat_replication and WAL positions
- Check I/O saturation, network bandwidth, and fsync settings
- Tune wal_level, max_wal_size, checkpoint settings
- Reduce heavy long-running transactions or VACUUM issues
- Consider logical replication or scaling read traffic
Evaluation Rubric
Accurately measures and reads lag signals30% weight
Identifies I/O/network/transaction bottlenecks30% weight
Applies WAL/checkpoint tuning steps20% weight
Keeps replicas consistent while remediating20% weight
Hints
- 💡Watch for vacuum bloat and autovacuum settings.
Common Pitfalls to Avoid
- ⚠️Focusing solely on the replica without checking the primary's write load.
- ⚠️Assuming the network is always the bottleneck without proper testing.
- ⚠️Aggressively disabling `fsync` without understanding the data durability implications.
- ⚠️Making large, untargeted configuration changes without isolating the impact of each.
- ⚠️Not considering the possibility of specific query or transaction patterns on the primary causing extreme write loads.
Potential Follow-up Questions
- ❓When to switch to logical replication?
- ❓How to failover with minimal data loss?
Advertisement
Related Questions
Questions that share similar topics with this one
Exactly-Once Effects with the Outbox Pattern
Advanced🔬 Technical Deep Dive•5 min•Technical
Design an E-commerce Checkout & Cart
Advanced🏗️ System Design•45 min•System-Design
Design a Distributed Job Scheduler
Intermediate🏗️ System Design•30 min•System-Design
K8s Readiness vs Liveness Probes
Intermediate📞 Phone Screen•2 min•Phone
Ensuring Data Consistency Across Microservices
Advanced🔬 Technical Deep Dive•5 min•Technical