🔧

Troubleshooting Scenarios

Real-world problem-solving and incident response

48 Questions

45 min session

All Difficulty Levels

Database Performance Degradation

Intermediate

Database Performance Troubleshooting+1

Your application's database response times have increased by 300% over the last hour. Users are complaining about slow page loads. How do you investigate and resolve this?

10 min•Scenario

View Question→

Sudden 5xx Spike on Web Tier

Intermediate

HTTP Web Observability+1

Production dashboards show a sharp increase in HTTP 5xx responses from the web tier over the last 10 minutes, but traffic volume is normal. Describe your step-by-step triage and remediation.

10 min•Scenario

View Question→

Kubernetes Pod Stuck in Pending

Intermediate

Kubernetes Containers Scheduling+1

A pod has been stuck in Pending state for over 15 minutes. Walk me through how you would troubleshoot and resolve this issue.

10 min•Scenario

View Question→

Node Disk Full

Intermediate

Linux Storage Kubernetes+1

One of your production nodes is reporting 100% disk usage and workloads are failing. How do you investigate and resolve this?

10 min•Scenario

View Question→

Expired SSL Certificate

Beginner

Security Certificates Networking

Your production website is suddenly showing SSL errors for users. How do you troubleshoot and fix this?

5 min•Scenario

View Question→

DNS Resolution Failure

Intermediate

DNS Networking Reliability

Your services suddenly cannot resolve domain names, breaking connectivity to dependencies. Walk me through your triage.

10 min•Scenario

View Question→

Application Memory Leak

Advanced

Memory Profiling Debugging

A service shows steadily increasing memory usage until it crashes. How do you investigate and remediate this memory leak?

15 min•Scenario

View Question→

Database Connection Exhaustion

Intermediate

Databases Connection Pooling Troubleshooting

Your application logs show frequent database connection pool exhaustion errors. How do you investigate and fix this?

10 min•Scenario

View Question→

Slow CDN Performance

Intermediate

CDN Networking Performance

Users in one region report very slow page loads, but the rest of the world is fine. How do you troubleshoot this CDN performance issue?

10 min•Scenario

View Question→

Kafka Consumer Lag

Advanced

Kafka Messaging Performance

Your Kafka consumer groups are showing high lag and messages are processing slowly. How do you investigate and remediate this?

15 min•Scenario

View Question→

API Latency Spike

Intermediate

API Latency Performance+1

Your API’s average latency jumped from 100ms to 2s without an increase in traffic. How would you investigate?

10 min•Scenario

View Question→

Service in CrashLoopBackOff

Intermediate

Kubernetes Containers Reliability

A Kubernetes service keeps restarting with CrashLoopBackOff. How do you debug and resolve this?

10 min•Scenario

View Question→

Network Partition in Distributed System

Advanced

Networking Distributed Systems Reliability

Half your nodes cannot communicate with the other half due to a suspected network partition. How do you investigate and respond?

15 min•Scenario

View Question→

Message Queue Backlog

Intermediate

Messaging Queues Performance

Your RabbitMQ/SQS queue has millions of unprocessed messages. What steps do you take?

10 min•Scenario

View Question→

Node CPU Thrashing

Intermediate

Linux Performance Troubleshooting

One node in your cluster shows 100% CPU usage with context switching spikes. How do you troubleshoot?

10 min•Scenario

View Question→

Service Timeout Errors

Intermediate

Networking APIs Timeouts

Multiple services are failing with timeout errors when calling an internal API. How do you approach debugging?

10 min•Scenario

View Question→

Cloud Provider Outage

Advanced

Cloud Reliability Incident Response

Your cloud provider reports an ongoing outage in one region, affecting your services. How do you respond?

15 min•Scenario

View Question→

Log Ingestion Failure

Intermediate

Logging Monitoring Troubleshooting

Your centralized logging pipeline stops ingesting logs from multiple services. How do you debug?

10 min•Scenario

View Question→

Circuit Breaker Tripped

Intermediate

Resilience Reliability Microservices

A critical dependency is failing and your service’s circuit breaker has opened. How do you handle this situation?

10 min•Scenario

View Question→

du and df Show Different Disk Usage

Intermediate

Linux Storage Troubleshooting

On a Linux production node, `df -h` reports the filesystem nearly full, but `du -sh /` shows far less used space. How do you reconcile the discrepancy and free space safely?

10 min•Scenario

View Question→

Intermittent 401s Due to JWT/Clock Skew

Intermediate

Security Auth Time Sync

Clients intermittently receive 401 Unauthorized even with valid JWTs. Walk through diagnosing and fixing the issue.

10 min•Scenario

View Question→

EMFILE: Too Many Open Files

Intermediate

Linux Limits Reliability

A service starts failing with EMFILE errors. Describe how you identify the cause and fix it.

10 min•Scenario

View Question→

Cache Stampede / Thundering Herd

Intermediate

Caching Performance Reliability

A cache eviction triggers a surge of requests to the origin, causing overload. How do you diagnose and prevent cache stampede?

10 min•Scenario

View Question→

Overlapping Cron Jobs Causing Backlog

Beginner

Scheduling Operations Performance

Nightly maintenance jobs overlap and create resource contention and backlog. Explain your triage and prevention.

5 min•Scenario

View Question→

High CPU Steal Time on VMs

Intermediate

Cloud Linux Performance

Services on certain VMs show latency spikes correlated with CPU steal time. How do you investigate and mitigate?

10 min•Scenario

View Question→

Hot Partition in a Sharded Database

Advanced

Databases Sharding Scalability

Throughput bottlenecks appear on a subset of shards. Outline your approach to identify and mitigate hot partitions.

15 min•Scenario

View Question→

Service Mesh mTLS Certificate Rotation Failure

Advanced

Service Mesh Security Networking

After a certificate rotation, services in the mesh begin failing with 503s. How do you diagnose and restore traffic?

15 min•Scenario

View Question→

Read-After-Write Inconsistency

Intermediate

Consistency Caching Databases

Users report that data they just wrote is not visible when reading immediately. Outline your investigation and mitigation.

10 min•Scenario

View Question→

SNAT Port Exhaustion via NAT Gateway

Advanced

Cloud Networking NAT Scalability

Outbound calls from your private subnets start failing intermittently. Investigation suggests SNAT port exhaustion. How do you confirm and fix?

15 min•Scenario

View Question→

Container Image Pulls Throttled by Registry

Intermediate

Kubernetes Containers Supply Chain+1

New pods are failing with ImagePullBackOff and registry logs show rate limiting/throttling. How do you restore service quickly and prevent recurrence?

10 min•Scenario

View Question→

Kernel Panic and Node Reboot Loop

Advanced

Linux Kernel Reliability

A production node repeatedly reboots due to kernel panics under load. Outline your triage and containment steps.

15 min•Scenario

View Question→

Clock Skew Breaking Distributed DB Writes

Advanced

Databases Distributed Systems Time Sync

A distributed database starts rejecting writes or showing anomalies due to detected clock skew on some nodes. How do you diagnose and stabilize?

15 min•Scenario

View Question→

Clients Using Stale DNS Cache

Intermediate

DNS Networking Release Engineering

Canary rollout passed, but a portion of clients hit decommissioned IPs due to stale DNS caches. What do you do?

10 min•Scenario

View Question→

TLS Handshake Failures from Cipher Mismatch

Intermediate

Security TLS Networking

After tightening TLS settings, some clients fail during handshake. How do you triage and restore compatibility without weakening security?

10 min•Scenario

View Question→

Inode Exhaustion on Filesystem

Intermediate

Linux Storage Operations

Disk usage looks low, but new files cannot be created and services fail with ENOSPC. How do you confirm inode exhaustion and fix it?

10 min•Scenario

View Question→

Thread Pool Exhaustion Causing Latency

Intermediate

Performance Concurrency APIs

Sudden latency spikes correlate with saturated server thread pools. How do you diagnose and remediate safely?

10 min•Scenario

View Question→

CDN Invalidation Storm Causes Origin Overload

Intermediate

CDN Caching Performance

A misconfigured deployment invalidates most CDN cache objects at once, flooding the origin. What’s your triage and prevention plan?

10 min•Scenario

View Question→

Health Check Misconfiguration Causing Flapping

Beginner

Load Balancing Reliability Operations

Instances are flapping in and out of load balancers due to aggressive health checks. How do you detect and fix this without masking real failures?

5 min•Scenario

View Question→

Cloud API Throttling (429) Causing Failures

Intermediate

Cloud Rate Limiting Reliability

Background jobs calling a cloud provider API start failing with 429 Too Many Requests during peak hours. How do you stabilize and prevent it?

10 min•Scenario

View Question→

DNS Amplification DDoS Attack

Advanced

DNS Security DDoS

Your DNS servers are overloaded with suspicious traffic patterns resembling amplification. How do you detect, mitigate, and protect?

15 min•Scenario

View Question→

Slow Log Pipeline Delaying Alerts

Intermediate

Logging Monitoring Pipelines

Alerts based on log ingestion are delayed by 15 minutes. Walk through diagnosing and fixing pipeline slowness.

10 min•Scenario

View Question→

Storage IOPS Throttling

Intermediate

Storage Cloud Performance

An application shows sudden latency spikes due to cloud storage IOPS limits being hit. How do you confirm and fix?

10 min•Scenario

View Question→

HTTP 503 Errors During DB Maintenance

Beginner

Databases Maintenance Availability

Users face 503 errors during scheduled DB maintenance. How do you minimize impact and handle gracefully?

5 min•Scenario

View Question→

Container OOMKilled Repeatedly

Intermediate

Containers Kubernetes Memory

A container is consistently OOMKilled under normal workload. How do you debug and fix it?

10 min•Scenario

View Question→

Flaky Integration Tests Blocking Releases

Intermediate

CI/CD Testing Reliability

CI pipelines are blocked by flaky integration tests. How do you triage and stabilize pipelines?

10 min•Scenario

View Question→

DNS Split-Horizon Misconfiguration

Intermediate

DNS Networking Configuration

Internal and external clients see different DNS answers, causing failures. How do you debug and fix split-horizon issues?

10 min•Scenario

View Question→

etcd Disk Latency Causing Cluster Issues

Advanced

Kubernetes etcd Storage

Your Kubernetes etcd cluster shows high fsync latency, causing API server slowness. How do you troubleshoot and resolve?

15 min•Scenario

View Question→

HTTP Header Bloat Causing 431 Errors

Beginner

HTTP Networking Debugging

Clients receive 431 Request Header Fields Too Large errors. Walk me through how you identify and remediate.

5 min•Scenario

View Question→