BeginnerScenario
5 min
Health Check Misconfiguration Causing Flapping
Load BalancingReliabilityOperations
Advertisement
Interview Question
Instances are flapping in and out of load balancers due to aggressive health checks. How do you detect and fix this without masking real failures?
Key Points to Cover
- Correlate LB health check logs with app readiness/liveness
- Adjust thresholds/intervals/timeout to match latency profile
- Differentiate startup vs steady-state checks
- Add graceful shutdown/drain and connection draining
- Monitor false-positive/negative rates after changes
Evaluation Rubric
Shows evidence of flapping and cause30% weight
Tunes health check parameters properly30% weight
Handles startup/shutdown gracefully20% weight
Validates reduction in flapping20% weight
Hints
- 💡Use separate readiness and liveness probes.
Common Pitfalls to Avoid
- ⚠️Simply increasing the health check timeout or interval without understanding the underlying cause.
- ⚠️Failing to correlate load balancer logs with application-specific logs.
- ⚠️Treating startup health checks the same as steady-state health checks.
- ⚠️Not considering the application's inherent latency and resource constraints when configuring health checks.
- ⚠️Ignoring the possibility that the health check endpoint itself might be faulty or experiencing performance issues within the application.
Potential Follow-up Questions
- ❓How to set SLO-aware thresholds?
- ❓Should you gate deploys on health trends?
Advertisement