Diagnosing Intermittent p99 Latency Spikes

Interview Question

A critical API has intermittent p99 latency spikes without increased error rates. How would you isolate the cause and stabilize tail latency?

Key Points to Cover

Evaluation Rubric

Correlates latency spikes to system signals35% weight

Instruments queues/pools/timeouts25% weight

Proposes tail-latency mitigation patterns20% weight

Validates with trace/metrics before/after20% weight

Hints

Common Pitfalls to Avoid

⚠️Focusing only on average latency and missing tail latency issues.
⚠️Insufficient instrumentation, leading to a lack of visibility into internal application behavior.
⚠️Making broad, unverified assumptions about the cause without data correlation.
⚠️Implementing mitigation strategies without fully understanding their impact on consistency or other performance metrics.
⚠️Not considering the impact of external dependencies or upstream/downstream services on the API's latency.

Potential Follow-up Questions