🔧

Troubleshooting Scenarios

Real-world problem-solving and incident response

48 Questions
45 min session
All Difficulty Levels
Advertisement
Database Performance Degradation
Intermediate

Your application's database response times have increased by 300% over the last hour. Users are complaining about slow page loads. How do you investigate and resolve this?

10 min•Scenario
View Question→
Sudden 5xx Spike on Web Tier
Intermediate

Production dashboards show a sharp increase in HTTP 5xx responses from the web tier over the last 10 minutes, but traffic volume is normal. Describe your step-by-step triage and remediation.

10 min•Scenario
View Question→
Kubernetes Pod Stuck in Pending
Intermediate

A pod has been stuck in Pending state for over 15 minutes. Walk me through how you would troubleshoot and resolve this issue.

10 min•Scenario
View Question→
Node Disk Full
Intermediate

One of your production nodes is reporting 100% disk usage and workloads are failing. How do you investigate and resolve this?

10 min•Scenario
View Question→
Expired SSL Certificate
Beginner

Your production website is suddenly showing SSL errors for users. How do you troubleshoot and fix this?

5 min•Scenario
View Question→
DNS Resolution Failure
Intermediate

Your services suddenly cannot resolve domain names, breaking connectivity to dependencies. Walk me through your triage.

10 min•Scenario
View Question→
Application Memory Leak
Advanced

A service shows steadily increasing memory usage until it crashes. How do you investigate and remediate this memory leak?

15 min•Scenario
View Question→
Database Connection Exhaustion
Intermediate

Your application logs show frequent database connection pool exhaustion errors. How do you investigate and fix this?

10 min•Scenario
View Question→
Slow CDN Performance
Intermediate

Users in one region report very slow page loads, but the rest of the world is fine. How do you troubleshoot this CDN performance issue?

10 min•Scenario
View Question→
Kafka Consumer Lag
Advanced

Your Kafka consumer groups are showing high lag and messages are processing slowly. How do you investigate and remediate this?

15 min•Scenario
View Question→
API Latency Spike
Intermediate

Your API’s average latency jumped from 100ms to 2s without an increase in traffic. How would you investigate?

10 min•Scenario
View Question→
Service in CrashLoopBackOff
Intermediate

A Kubernetes service keeps restarting with CrashLoopBackOff. How do you debug and resolve this?

10 min•Scenario
View Question→
Network Partition in Distributed System
Advanced

Half your nodes cannot communicate with the other half due to a suspected network partition. How do you investigate and respond?

15 min•Scenario
View Question→
Message Queue Backlog
Intermediate

Your RabbitMQ/SQS queue has millions of unprocessed messages. What steps do you take?

10 min•Scenario
View Question→
Node CPU Thrashing
Intermediate

One node in your cluster shows 100% CPU usage with context switching spikes. How do you troubleshoot?

10 min•Scenario
View Question→
Service Timeout Errors
Intermediate

Multiple services are failing with timeout errors when calling an internal API. How do you approach debugging?

10 min•Scenario
View Question→
Cloud Provider Outage
Advanced

Your cloud provider reports an ongoing outage in one region, affecting your services. How do you respond?

15 min•Scenario
View Question→
Log Ingestion Failure
Intermediate

Your centralized logging pipeline stops ingesting logs from multiple services. How do you debug?

10 min•Scenario
View Question→
Circuit Breaker Tripped
Intermediate

A critical dependency is failing and your service’s circuit breaker has opened. How do you handle this situation?

10 min•Scenario
View Question→
du and df Show Different Disk Usage
Intermediate

On a Linux production node, `df -h` reports the filesystem nearly full, but `du -sh /` shows far less used space. How do you reconcile the discrepancy and free space safely?

10 min•Scenario
View Question→
Intermittent 401s Due to JWT/Clock Skew
Intermediate

Clients intermittently receive 401 Unauthorized even with valid JWTs. Walk through diagnosing and fixing the issue.

10 min•Scenario
View Question→
EMFILE: Too Many Open Files
Intermediate

A service starts failing with EMFILE errors. Describe how you identify the cause and fix it.

10 min•Scenario
View Question→
Cache Stampede / Thundering Herd
Intermediate

A cache eviction triggers a surge of requests to the origin, causing overload. How do you diagnose and prevent cache stampede?

10 min•Scenario
View Question→
Overlapping Cron Jobs Causing Backlog
Beginner

Nightly maintenance jobs overlap and create resource contention and backlog. Explain your triage and prevention.

5 min•Scenario
View Question→
High CPU Steal Time on VMs
Intermediate

Services on certain VMs show latency spikes correlated with CPU steal time. How do you investigate and mitigate?

10 min•Scenario
View Question→
Hot Partition in a Sharded Database
Advanced

Throughput bottlenecks appear on a subset of shards. Outline your approach to identify and mitigate hot partitions.

15 min•Scenario
View Question→
Service Mesh mTLS Certificate Rotation Failure
Advanced

After a certificate rotation, services in the mesh begin failing with 503s. How do you diagnose and restore traffic?

15 min•Scenario
View Question→
Read-After-Write Inconsistency
Intermediate

Users report that data they just wrote is not visible when reading immediately. Outline your investigation and mitigation.

10 min•Scenario
View Question→
SNAT Port Exhaustion via NAT Gateway
Advanced

Outbound calls from your private subnets start failing intermittently. Investigation suggests SNAT port exhaustion. How do you confirm and fix?

15 min•Scenario
View Question→
Container Image Pulls Throttled by Registry
Intermediate

New pods are failing with ImagePullBackOff and registry logs show rate limiting/throttling. How do you restore service quickly and prevent recurrence?

10 min•Scenario
View Question→
Kernel Panic and Node Reboot Loop
Advanced

A production node repeatedly reboots due to kernel panics under load. Outline your triage and containment steps.

15 min•Scenario
View Question→
Clock Skew Breaking Distributed DB Writes
Advanced

A distributed database starts rejecting writes or showing anomalies due to detected clock skew on some nodes. How do you diagnose and stabilize?

15 min•Scenario
View Question→
Clients Using Stale DNS Cache
Intermediate

Canary rollout passed, but a portion of clients hit decommissioned IPs due to stale DNS caches. What do you do?

10 min•Scenario
View Question→
TLS Handshake Failures from Cipher Mismatch
Intermediate

After tightening TLS settings, some clients fail during handshake. How do you triage and restore compatibility without weakening security?

10 min•Scenario
View Question→
Inode Exhaustion on Filesystem
Intermediate

Disk usage looks low, but new files cannot be created and services fail with ENOSPC. How do you confirm inode exhaustion and fix it?

10 min•Scenario
View Question→
Thread Pool Exhaustion Causing Latency
Intermediate

Sudden latency spikes correlate with saturated server thread pools. How do you diagnose and remediate safely?

10 min•Scenario
View Question→
CDN Invalidation Storm Causes Origin Overload
Intermediate

A misconfigured deployment invalidates most CDN cache objects at once, flooding the origin. What’s your triage and prevention plan?

10 min•Scenario
View Question→
Health Check Misconfiguration Causing Flapping
Beginner

Instances are flapping in and out of load balancers due to aggressive health checks. How do you detect and fix this without masking real failures?

5 min•Scenario
View Question→
Cloud API Throttling (429) Causing Failures
Intermediate

Background jobs calling a cloud provider API start failing with 429 Too Many Requests during peak hours. How do you stabilize and prevent it?

10 min•Scenario
View Question→
DNS Amplification DDoS Attack
Advanced

Your DNS servers are overloaded with suspicious traffic patterns resembling amplification. How do you detect, mitigate, and protect?

15 min•Scenario
View Question→
Slow Log Pipeline Delaying Alerts
Intermediate

Alerts based on log ingestion are delayed by 15 minutes. Walk through diagnosing and fixing pipeline slowness.

10 min•Scenario
View Question→
Storage IOPS Throttling
Intermediate

An application shows sudden latency spikes due to cloud storage IOPS limits being hit. How do you confirm and fix?

10 min•Scenario
View Question→
HTTP 503 Errors During DB Maintenance
Beginner

Users face 503 errors during scheduled DB maintenance. How do you minimize impact and handle gracefully?

5 min•Scenario
View Question→
Container OOMKilled Repeatedly
Intermediate

A container is consistently OOMKilled under normal workload. How do you debug and fix it?

10 min•Scenario
View Question→
Flaky Integration Tests Blocking Releases
Intermediate

CI pipelines are blocked by flaky integration tests. How do you triage and stabilize pipelines?

10 min•Scenario
View Question→
DNS Split-Horizon Misconfiguration
Intermediate

Internal and external clients see different DNS answers, causing failures. How do you debug and fix split-horizon issues?

10 min•Scenario
View Question→
etcd Disk Latency Causing Cluster Issues
Advanced

Your Kubernetes etcd cluster shows high fsync latency, causing API server slowness. How do you troubleshoot and resolve?

15 min•Scenario
View Question→
HTTP Header Bloat Causing 431 Errors
Beginner

Clients receive 431 Request Header Fields Too Large errors. Walk me through how you identify and remediate.

5 min•Scenario
View Question→
Advertisement