IntermediateScenario
10 min

High CPU Steal Time on VMs

CloudLinuxPerformance
Advertisement
Interview Question

Services on certain VMs show latency spikes correlated with CPU steal time. How do you investigate and mitigate?

Key Points to Cover
  • Confirm steal time via vmstat, top, or hypervisor metrics
  • Correlate with noisy neighbors or host contention
  • Migrate workloads/instances or change instance family/placement
  • Right-size CPU and enable CPU pinning/affinity if applicable
  • Work with provider; add SLO alerts for steal time
Evaluation Rubric
Measures steal time correctly30% weight
Links contention to latency impact30% weight
Applies workload/infra mitigations20% weight
Adds ongoing monitoring/alerts20% weight
Hints
  • 💡Dedicated hosts or CPU credits may help.
Common Pitfalls to Avoid
  • ⚠️Assuming CPU steal time is always a 'noisy neighbor' issue without ruling out other performance bottlenecks on the VM itself (e.g., I/O wait, memory pressure).
  • ⚠️Not verifying CPU steal time at the hypervisor level if provider metrics are accessible, relying solely on VM-level tools which can sometimes mask the true source.
  • ⚠️Focusing only on migrating the affected VM without considering the possibility of optimizing the workload to reduce its CPU footprint.
  • ⚠️Failing to correlate latency spikes with steal time accurately, leading to misguided troubleshooting efforts.
  • ⚠️Not having appropriate alerting in place for high CPU steal time, leading to reactive rather than proactive problem-solving.
Potential Follow-up Questions
  • When to use dedicated instances?
  • How does cgroup CPU quota affect latency?
Advertisement