AdvancedScenario
15 min
Kernel Panic and Node Reboot Loop
LinuxKernelReliability
Advertisement
Interview Question
A production node repeatedly reboots due to kernel panics under load. Outline your triage and containment steps.
Key Points to Cover
- Isolate node from schedulers and drain workloads
- Collect crash dumps/kdump and kernel logs for signatures
- Check recent kernel/driver updates and hardware errors
- Test with alternative kernel or disable problematic drivers/modules
- Plan staged rollout/rollback and add health automation
Evaluation Rubric
Safely isolates/cordons the node30% weight
Captures crash evidence for RCA30% weight
Proposes kernel/driver mitigations20% weight
Rolls changes safely with automation20% weight
Hints
- 💡Look for hardware ECC/machine check events.
Potential Follow-up Questions
- ❓How do you automate node quarantine?
- ❓What’s your rollback strategy?
Advertisement