Troubleshooting Scenarios
Real-world problem-solving and incident response
Your application's database response times have increased by 300% over the last hour. Users are complaining about slow page loads. How do you investigate and resolve this?
Production dashboards show a sharp increase in HTTP 5xx responses from the web tier over the last 10 minutes, but traffic volume is normal. Describe your step-by-step triage and remediation.
A pod has been stuck in Pending state for over 15 minutes. Walk me through how you would troubleshoot and resolve this issue.
One of your production nodes is reporting 100% disk usage and workloads are failing. How do you investigate and resolve this?
Your production website is suddenly showing SSL errors for users. How do you troubleshoot and fix this?
Your services suddenly cannot resolve domain names, breaking connectivity to dependencies. Walk me through your triage.
A service shows steadily increasing memory usage until it crashes. How do you investigate and remediate this memory leak?
Your application logs show frequent database connection pool exhaustion errors. How do you investigate and fix this?
Users in one region report very slow page loads, but the rest of the world is fine. How do you troubleshoot this CDN performance issue?
Your Kafka consumer groups are showing high lag and messages are processing slowly. How do you investigate and remediate this?
Your APIβs average latency jumped from 100ms to 2s without an increase in traffic. How would you investigate?
A Kubernetes service keeps restarting with CrashLoopBackOff. How do you debug and resolve this?
Half your nodes cannot communicate with the other half due to a suspected network partition. How do you investigate and respond?
Your RabbitMQ/SQS queue has millions of unprocessed messages. What steps do you take?
One node in your cluster shows 100% CPU usage with context switching spikes. How do you troubleshoot?
Multiple services are failing with timeout errors when calling an internal API. How do you approach debugging?
Your cloud provider reports an ongoing outage in one region, affecting your services. How do you respond?
Your centralized logging pipeline stops ingesting logs from multiple services. How do you debug?
A critical dependency is failing and your serviceβs circuit breaker has opened. How do you handle this situation?
On a Linux production node, `df -h` reports the filesystem nearly full, but `du -sh /` shows far less used space. How do you reconcile the discrepancy and free space safely?
Clients intermittently receive 401 Unauthorized even with valid JWTs. Walk through diagnosing and fixing the issue.
A service starts failing with EMFILE errors. Describe how you identify the cause and fix it.
A cache eviction triggers a surge of requests to the origin, causing overload. How do you diagnose and prevent cache stampede?
Nightly maintenance jobs overlap and create resource contention and backlog. Explain your triage and prevention.
Services on certain VMs show latency spikes correlated with CPU steal time. How do you investigate and mitigate?
Throughput bottlenecks appear on a subset of shards. Outline your approach to identify and mitigate hot partitions.
After a certificate rotation, services in the mesh begin failing with 503s. How do you diagnose and restore traffic?
Users report that data they just wrote is not visible when reading immediately. Outline your investigation and mitigation.
Outbound calls from your private subnets start failing intermittently. Investigation suggests SNAT port exhaustion. How do you confirm and fix?
New pods are failing with ImagePullBackOff and registry logs show rate limiting/throttling. How do you restore service quickly and prevent recurrence?
A production node repeatedly reboots due to kernel panics under load. Outline your triage and containment steps.
A distributed database starts rejecting writes or showing anomalies due to detected clock skew on some nodes. How do you diagnose and stabilize?
Canary rollout passed, but a portion of clients hit decommissioned IPs due to stale DNS caches. What do you do?
After tightening TLS settings, some clients fail during handshake. How do you triage and restore compatibility without weakening security?
Disk usage looks low, but new files cannot be created and services fail with ENOSPC. How do you confirm inode exhaustion and fix it?
Sudden latency spikes correlate with saturated server thread pools. How do you diagnose and remediate safely?
A misconfigured deployment invalidates most CDN cache objects at once, flooding the origin. Whatβs your triage and prevention plan?
Instances are flapping in and out of load balancers due to aggressive health checks. How do you detect and fix this without masking real failures?
Background jobs calling a cloud provider API start failing with 429 Too Many Requests during peak hours. How do you stabilize and prevent it?
Your DNS servers are overloaded with suspicious traffic patterns resembling amplification. How do you detect, mitigate, and protect?
Alerts based on log ingestion are delayed by 15 minutes. Walk through diagnosing and fixing pipeline slowness.
An application shows sudden latency spikes due to cloud storage IOPS limits being hit. How do you confirm and fix?
Users face 503 errors during scheduled DB maintenance. How do you minimize impact and handle gracefully?
A container is consistently OOMKilled under normal workload. How do you debug and fix it?
CI pipelines are blocked by flaky integration tests. How do you triage and stabilize pipelines?
Internal and external clients see different DNS answers, causing failures. How do you debug and fix split-horizon issues?
Your Kubernetes etcd cluster shows high fsync latency, causing API server slowness. How do you troubleshoot and resolve?
Clients receive 431 Request Header Fields Too Large errors. Walk me through how you identify and remediate.