IntermediateScenario
10 min
Node CPU Thrashing
LinuxPerformanceTroubleshooting
Advertisement
Interview Question
One node in your cluster shows 100% CPU usage with context switching spikes. How do you troubleshoot?
Key Points to Cover
- Run top/htop/mpstat to confirm CPU saturation
- Check for high context switches or interrupts
- Identify runaway processes or noisy neighbors
- Adjust CPU limits or reschedule workloads
- Tune kernel params if systemic
Evaluation Rubric
Uses CPU monitoring tools30% weight
Finds process/neighbors causing thrash30% weight
Mitigates via limits/rescheduling20% weight
Mentions tuning or safeguards20% weight
Hints
- 💡Noisy neighbor issues common in shared clusters.
Common Pitfalls to Avoid
- ⚠️Jumping to conclusions and killing processes without proper investigation, potentially causing data corruption or service disruption.
- ⚠️Focusing solely on CPU usage without considering context switching, which often points to the *reason* for high CPU.
- ⚠️Not checking for noisy neighbors in shared environments (VMs, containers), leading to misdiagnosis.
- ⚠️Neglecting to examine system logs and other performance metrics (I/O, network) that might provide crucial context.
- ⚠️Failing to document the troubleshooting steps and the identified root cause, hindering future problem-solving and prevention.
Potential Follow-up Questions
- ❓How do you prevent CPU starvation?
- ❓What about CPU pinning?
Advertisement