AdvancedTechnical
5 min
PostgreSQL Replication Lag Troubleshooting
PostgreSQLDatabasesReliability
Advertisement
Interview Question
Read replicas are falling minutes behind the primary. How do you diagnose replication lag and remediate it safely?
Key Points to Cover
- Measure lag via pg_stat_replication and WAL positions
- Check I/O saturation, network bandwidth, and fsync settings
- Tune wal_level, max_wal_size, checkpoint settings
- Reduce heavy long-running transactions or VACUUM issues
- Consider logical replication or scaling read traffic
Evaluation Rubric
Accurately measures and reads lag signals30% weight
Identifies I/O/network/transaction bottlenecks30% weight
Applies WAL/checkpoint tuning steps20% weight
Keeps replicas consistent while remediating20% weight
Hints
- 💡Watch for vacuum bloat and autovacuum settings.
Common Pitfalls to Avoid
- ⚠️Focusing solely on the replica without checking the primary's write load.
- ⚠️Assuming the network is always the bottleneck without proper testing.
- ⚠️Aggressively disabling `fsync` without understanding the data durability implications.
- ⚠️Making large, untargeted configuration changes without isolating the impact of each.
- ⚠️Not considering the possibility of specific query or transaction patterns on the primary causing extreme write loads.
Potential Follow-up Questions
- ❓When to switch to logical replication?
- ❓How to failover with minimal data loss?
Advertisement