AdvancedTechnical
5 min
Diagnosing Intermittent p99 Latency Spikes
PerformanceSRENetworking
Advertisement
Interview Question
A critical API has intermittent p99 latency spikes without increased error rates. How would you isolate the cause and stabilize tail latency?
Key Points to Cover
- Correlate spikes with GC, CPU steal, IO wait, or network retransmits
- Instrument queue depths, thread pools, and timeouts
- Add hedged requests or request collapsing where safe
- Apply connection pooling and tune TCP parameters
Evaluation Rubric
Correlates latency spikes to system signals35% weight
Instruments queues/pools/timeouts25% weight
Proposes tail-latency mitigation patterns20% weight
Validates with trace/metrics before/after20% weight
Hints
- 💡Check head-of-line blocking and Nagle/Delayed ACKs.
Common Pitfalls to Avoid
- ⚠️Focusing only on average latency and missing tail latency issues.
- ⚠️Insufficient instrumentation, leading to a lack of visibility into internal application behavior.
- ⚠️Making broad, unverified assumptions about the cause without data correlation.
- ⚠️Implementing mitigation strategies without fully understanding their impact on consistency or other performance metrics.
- ⚠️Not considering the impact of external dependencies or upstream/downstream services on the API's latency.
Potential Follow-up Questions
- ❓When to use hedged requests?
- ❓How do retries affect queues?
Advertisement