πŸ”§

Troubleshooting Scenarios

Real-world problem-solving and incident response

48 Questions
45 min session
All Difficulty Levels
Advertisement
Database Performance Degradation
Intermediate
DatabasePerformanceTroubleshooting+1

Your application's database response times have increased by 300% over the last hour. Users are complaining about slow page loads. How do you investigate and resolve this?

10 minβ€’Scenario
View Question→
Sudden 5xx Spike on Web Tier
Intermediate
HTTPWebObservability+1

Production dashboards show a sharp increase in HTTP 5xx responses from the web tier over the last 10 minutes, but traffic volume is normal. Describe your step-by-step triage and remediation.

10 minβ€’Scenario
View Question→
Kubernetes Pod Stuck in Pending
Intermediate
KubernetesContainersScheduling+1

A pod has been stuck in Pending state for over 15 minutes. Walk me through how you would troubleshoot and resolve this issue.

10 minβ€’Scenario
View Question→
Node Disk Full
Intermediate
LinuxStorageKubernetes+1

One of your production nodes is reporting 100% disk usage and workloads are failing. How do you investigate and resolve this?

10 minβ€’Scenario
View Question→
Expired SSL Certificate
Beginner
SecurityCertificatesNetworking

Your production website is suddenly showing SSL errors for users. How do you troubleshoot and fix this?

5 minβ€’Scenario
View Question→
DNS Resolution Failure
Intermediate
DNSNetworkingReliability

Your services suddenly cannot resolve domain names, breaking connectivity to dependencies. Walk me through your triage.

10 minβ€’Scenario
View Question→
Application Memory Leak
Advanced
MemoryProfilingDebugging

A service shows steadily increasing memory usage until it crashes. How do you investigate and remediate this memory leak?

15 minβ€’Scenario
View Question→
Database Connection Exhaustion
Intermediate
DatabasesConnection PoolingTroubleshooting

Your application logs show frequent database connection pool exhaustion errors. How do you investigate and fix this?

10 minβ€’Scenario
View Question→
Slow CDN Performance
Intermediate
CDNNetworkingPerformance

Users in one region report very slow page loads, but the rest of the world is fine. How do you troubleshoot this CDN performance issue?

10 minβ€’Scenario
View Question→
Kafka Consumer Lag
Advanced
KafkaMessagingPerformance

Your Kafka consumer groups are showing high lag and messages are processing slowly. How do you investigate and remediate this?

15 minβ€’Scenario
View Question→
API Latency Spike
Intermediate
APILatencyPerformance+1

Your API’s average latency jumped from 100ms to 2s without an increase in traffic. How would you investigate?

10 minβ€’Scenario
View Question→
Service in CrashLoopBackOff
Intermediate
KubernetesContainersReliability

A Kubernetes service keeps restarting with CrashLoopBackOff. How do you debug and resolve this?

10 minβ€’Scenario
View Question→
Network Partition in Distributed System
Advanced
NetworkingDistributed SystemsReliability

Half your nodes cannot communicate with the other half due to a suspected network partition. How do you investigate and respond?

15 minβ€’Scenario
View Question→
Message Queue Backlog
Intermediate
MessagingQueuesPerformance

Your RabbitMQ/SQS queue has millions of unprocessed messages. What steps do you take?

10 minβ€’Scenario
View Question→
Node CPU Thrashing
Intermediate
LinuxPerformanceTroubleshooting

One node in your cluster shows 100% CPU usage with context switching spikes. How do you troubleshoot?

10 minβ€’Scenario
View Question→
Service Timeout Errors
Intermediate
NetworkingAPIsTimeouts

Multiple services are failing with timeout errors when calling an internal API. How do you approach debugging?

10 minβ€’Scenario
View Question→
Cloud Provider Outage
Advanced
CloudReliabilityIncident Response

Your cloud provider reports an ongoing outage in one region, affecting your services. How do you respond?

15 minβ€’Scenario
View Question→
Log Ingestion Failure
Intermediate
LoggingMonitoringTroubleshooting

Your centralized logging pipeline stops ingesting logs from multiple services. How do you debug?

10 minβ€’Scenario
View Question→
Circuit Breaker Tripped
Intermediate
ResilienceReliabilityMicroservices

A critical dependency is failing and your service’s circuit breaker has opened. How do you handle this situation?

10 minβ€’Scenario
View Question→
du and df Show Different Disk Usage
Intermediate
LinuxStorageTroubleshooting

On a Linux production node, `df -h` reports the filesystem nearly full, but `du -sh /` shows far less used space. How do you reconcile the discrepancy and free space safely?

10 minβ€’Scenario
View Question→
Intermittent 401s Due to JWT/Clock Skew
Intermediate
SecurityAuthTime Sync

Clients intermittently receive 401 Unauthorized even with valid JWTs. Walk through diagnosing and fixing the issue.

10 minβ€’Scenario
View Question→
EMFILE: Too Many Open Files
Intermediate
LinuxLimitsReliability

A service starts failing with EMFILE errors. Describe how you identify the cause and fix it.

10 minβ€’Scenario
View Question→
Cache Stampede / Thundering Herd
Intermediate
CachingPerformanceReliability

A cache eviction triggers a surge of requests to the origin, causing overload. How do you diagnose and prevent cache stampede?

10 minβ€’Scenario
View Question→
Overlapping Cron Jobs Causing Backlog
Beginner
SchedulingOperationsPerformance

Nightly maintenance jobs overlap and create resource contention and backlog. Explain your triage and prevention.

5 minβ€’Scenario
View Question→
High CPU Steal Time on VMs
Intermediate
CloudLinuxPerformance

Services on certain VMs show latency spikes correlated with CPU steal time. How do you investigate and mitigate?

10 minβ€’Scenario
View Question→
Hot Partition in a Sharded Database
Advanced
DatabasesShardingScalability

Throughput bottlenecks appear on a subset of shards. Outline your approach to identify and mitigate hot partitions.

15 minβ€’Scenario
View Question→
Service Mesh mTLS Certificate Rotation Failure
Advanced
Service MeshSecurityNetworking

After a certificate rotation, services in the mesh begin failing with 503s. How do you diagnose and restore traffic?

15 minβ€’Scenario
View Question→
Read-After-Write Inconsistency
Intermediate
ConsistencyCachingDatabases

Users report that data they just wrote is not visible when reading immediately. Outline your investigation and mitigation.

10 minβ€’Scenario
View Question→
SNAT Port Exhaustion via NAT Gateway
Advanced
Cloud NetworkingNATScalability

Outbound calls from your private subnets start failing intermittently. Investigation suggests SNAT port exhaustion. How do you confirm and fix?

15 minβ€’Scenario
View Question→
Container Image Pulls Throttled by Registry
Intermediate
KubernetesContainersSupply Chain+1

New pods are failing with ImagePullBackOff and registry logs show rate limiting/throttling. How do you restore service quickly and prevent recurrence?

10 minβ€’Scenario
View Question→
Kernel Panic and Node Reboot Loop
Advanced
LinuxKernelReliability

A production node repeatedly reboots due to kernel panics under load. Outline your triage and containment steps.

15 minβ€’Scenario
View Question→
Clock Skew Breaking Distributed DB Writes
Advanced
DatabasesDistributed SystemsTime Sync

A distributed database starts rejecting writes or showing anomalies due to detected clock skew on some nodes. How do you diagnose and stabilize?

15 minβ€’Scenario
View Question→
Clients Using Stale DNS Cache
Intermediate
DNSNetworkingRelease Engineering

Canary rollout passed, but a portion of clients hit decommissioned IPs due to stale DNS caches. What do you do?

10 minβ€’Scenario
View Question→
TLS Handshake Failures from Cipher Mismatch
Intermediate
SecurityTLSNetworking

After tightening TLS settings, some clients fail during handshake. How do you triage and restore compatibility without weakening security?

10 minβ€’Scenario
View Question→
Inode Exhaustion on Filesystem
Intermediate
LinuxStorageOperations

Disk usage looks low, but new files cannot be created and services fail with ENOSPC. How do you confirm inode exhaustion and fix it?

10 minβ€’Scenario
View Question→
Thread Pool Exhaustion Causing Latency
Intermediate
PerformanceConcurrencyAPIs

Sudden latency spikes correlate with saturated server thread pools. How do you diagnose and remediate safely?

10 minβ€’Scenario
View Question→
CDN Invalidation Storm Causes Origin Overload
Intermediate
CDNCachingPerformance

A misconfigured deployment invalidates most CDN cache objects at once, flooding the origin. What’s your triage and prevention plan?

10 minβ€’Scenario
View Question→
Health Check Misconfiguration Causing Flapping
Beginner
Load BalancingReliabilityOperations

Instances are flapping in and out of load balancers due to aggressive health checks. How do you detect and fix this without masking real failures?

5 minβ€’Scenario
View Question→
Cloud API Throttling (429) Causing Failures
Intermediate
CloudRate LimitingReliability

Background jobs calling a cloud provider API start failing with 429 Too Many Requests during peak hours. How do you stabilize and prevent it?

10 minβ€’Scenario
View Question→
DNS Amplification DDoS Attack
Advanced
DNSSecurityDDoS

Your DNS servers are overloaded with suspicious traffic patterns resembling amplification. How do you detect, mitigate, and protect?

15 minβ€’Scenario
View Question→
Slow Log Pipeline Delaying Alerts
Intermediate
LoggingMonitoringPipelines

Alerts based on log ingestion are delayed by 15 minutes. Walk through diagnosing and fixing pipeline slowness.

10 minβ€’Scenario
View Question→
Storage IOPS Throttling
Intermediate
StorageCloudPerformance

An application shows sudden latency spikes due to cloud storage IOPS limits being hit. How do you confirm and fix?

10 minβ€’Scenario
View Question→
HTTP 503 Errors During DB Maintenance
Beginner
DatabasesMaintenanceAvailability

Users face 503 errors during scheduled DB maintenance. How do you minimize impact and handle gracefully?

5 minβ€’Scenario
View Question→
Container OOMKilled Repeatedly
Intermediate
ContainersKubernetesMemory

A container is consistently OOMKilled under normal workload. How do you debug and fix it?

10 minβ€’Scenario
View Question→
Flaky Integration Tests Blocking Releases
Intermediate
CI/CDTestingReliability

CI pipelines are blocked by flaky integration tests. How do you triage and stabilize pipelines?

10 minβ€’Scenario
View Question→
DNS Split-Horizon Misconfiguration
Intermediate
DNSNetworkingConfiguration

Internal and external clients see different DNS answers, causing failures. How do you debug and fix split-horizon issues?

10 minβ€’Scenario
View Question→
etcd Disk Latency Causing Cluster Issues
Advanced
KubernetesetcdStorage

Your Kubernetes etcd cluster shows high fsync latency, causing API server slowness. How do you troubleshoot and resolve?

15 minβ€’Scenario
View Question→
HTTP Header Bloat Causing 431 Errors
Beginner
HTTPNetworkingDebugging

Clients receive 431 Request Header Fields Too Large errors. Walk me through how you identify and remediate.

5 minβ€’Scenario
View Question→
Advertisement