Node CPU Thrashing

Interview Question

One node in your cluster shows 100% CPU usage with context switching spikes. How do you troubleshoot?

Key Points to Cover

Evaluation Rubric

Uses CPU monitoring tools30% weight

Finds process/neighbors causing thrash30% weight

Mitigates via limits/rescheduling20% weight

Mentions tuning or safeguards20% weight

Hints

Common Pitfalls to Avoid

⚠️Jumping to conclusions and killing processes without proper investigation, potentially causing data corruption or service disruption.
⚠️Focusing solely on CPU usage without considering context switching, which often points to the *reason* for high CPU.
⚠️Not checking for noisy neighbors in shared environments (VMs, containers), leading to misdiagnosis.
⚠️Neglecting to examine system logs and other performance metrics (I/O, network) that might provide crucial context.
⚠️Failing to document the troubleshooting steps and the identified root cause, hindering future problem-solving and prevention.

Potential Follow-up Questions