Node Disk Full

Interview Question

One of your production nodes is reporting 100% disk usage and workloads are failing. How do you investigate and resolve this?

Key Points to Cover

Evaluation Rubric

Identifies full disk using system tools30% weight

Finds culprit dirs/files30% weight

Applies cleanup/expansion appropriately20% weight

Mentions preventive monitoring20% weight

Hints

Common Pitfalls to Avoid

⚠️Failing to immediately identify the specific filesystem that is full.
⚠️Spending too much time on log analysis without first using `df` and `du` to pinpoint the largest consumers.
⚠️Not considering container runtime storage (image layers, volumes) as a potential cause in a Kubernetes environment.
⚠️Making broad assumptions about the cause without systematically investigating the node and its workloads.
⚠️Deleting files indiscriminately without understanding their purpose, potentially causing further system instability.

Potential Follow-up Questions