High CPU Steal Time on VMs

Interview Question

Services on certain VMs show latency spikes correlated with CPU steal time. How do you investigate and mitigate?

Key Points to Cover

Evaluation Rubric

Measures steal time correctly30% weight

Links contention to latency impact30% weight

Applies workload/infra mitigations20% weight

Adds ongoing monitoring/alerts20% weight

Hints

Common Pitfalls to Avoid

⚠️Assuming CPU steal time is always a 'noisy neighbor' issue without ruling out other performance bottlenecks on the VM itself (e.g., I/O wait, memory pressure).
⚠️Not verifying CPU steal time at the hypervisor level if provider metrics are accessible, relying solely on VM-level tools which can sometimes mask the true source.
⚠️Focusing only on migrating the affected VM without considering the possibility of optimizing the workload to reduce its CPU footprint.
⚠️Failing to correlate latency spikes with steal time accurately, leading to misguided troubleshooting efforts.
⚠️Not having appropriate alerting in place for high CPU steal time, leading to reactive rather than proactive problem-solving.

Potential Follow-up Questions