Advertisement
Interview Question
Services on certain VMs show latency spikes correlated with CPU steal time. How do you investigate and mitigate?
Key Points to Cover
- Confirm steal time via vmstat, top, or hypervisor metrics
- Correlate with noisy neighbors or host contention
- Migrate workloads/instances or change instance family/placement
- Right-size CPU and enable CPU pinning/affinity if applicable
- Work with provider; add SLO alerts for steal time
Evaluation Rubric
Measures steal time correctly30% weight
Links contention to latency impact30% weight
Applies workload/infra mitigations20% weight
Adds ongoing monitoring/alerts20% weight
Hints
- 💡Dedicated hosts or CPU credits may help.
Common Pitfalls to Avoid
- ⚠️Assuming CPU steal time is always a 'noisy neighbor' issue without ruling out other performance bottlenecks on the VM itself (e.g., I/O wait, memory pressure).
- ⚠️Not verifying CPU steal time at the hypervisor level if provider metrics are accessible, relying solely on VM-level tools which can sometimes mask the true source.
- ⚠️Focusing only on migrating the affected VM without considering the possibility of optimizing the workload to reduce its CPU footprint.
- ⚠️Failing to correlate latency spikes with steal time accurately, leading to misguided troubleshooting efforts.
- ⚠️Not having appropriate alerting in place for high CPU steal time, leading to reactive rather than proactive problem-solving.
Potential Follow-up Questions
- ❓When to use dedicated instances?
- ❓How does cgroup CPU quota affect latency?
Advertisement
Related Questions
Questions that share similar topics with this one
CPU Load Average Explained
Intermediate📞 Phone Screen•2 min•Phone
Node CPU Thrashing
Intermediate🔧 Troubleshooting Scenarios•10 min•Scenario
Storage IOPS Throttling
Intermediate🔧 Troubleshooting Scenarios•10 min•Scenario
Cloud Service Models
Beginner📞 Phone Screen•2 min•Phone
Check Disk Space in Linux
Beginner📞 Phone Screen•1 min•Phone