IntermediateScenario
10 min

Node CPU Thrashing

LinuxPerformanceTroubleshooting
Advertisement
Interview Question

One node in your cluster shows 100% CPU usage with context switching spikes. How do you troubleshoot?

Key Points to Cover
  • Run top/htop/mpstat to confirm CPU saturation
  • Check for high context switches or interrupts
  • Identify runaway processes or noisy neighbors
  • Adjust CPU limits or reschedule workloads
  • Tune kernel params if systemic
Evaluation Rubric
Uses CPU monitoring tools30% weight
Finds process/neighbors causing thrash30% weight
Mitigates via limits/rescheduling20% weight
Mentions tuning or safeguards20% weight
Hints
  • 💡Noisy neighbor issues common in shared clusters.
Common Pitfalls to Avoid
  • ⚠️Jumping to conclusions and killing processes without proper investigation, potentially causing data corruption or service disruption.
  • ⚠️Focusing solely on CPU usage without considering context switching, which often points to the *reason* for high CPU.
  • ⚠️Not checking for noisy neighbors in shared environments (VMs, containers), leading to misdiagnosis.
  • ⚠️Neglecting to examine system logs and other performance metrics (I/O, network) that might provide crucial context.
  • ⚠️Failing to document the troubleshooting steps and the identified root cause, hindering future problem-solving and prevention.
Potential Follow-up Questions
  • How do you prevent CPU starvation?
  • What about CPU pinning?
Advertisement