Reliability
All interview questions related to Reliability
What is the difference between readiness and liveness probes in Kubernetes?
Read replicas are falling minutes behind the primary. How do you diagnose replication lag and remediate it safely?
You need reliable event publication coupled with database writes. Describe how youβd implement the outbox pattern and ensure idempotency end to end.
In a streaming system under bursty load, how do you implement backpressure to prevent overload and cascading failures?
What does idempotency mean in APIs, and how would you design idempotent operations in REST or gRPC services?
Design a service to send notifications via email, SMS, and push at scale with retries, templates, and user preferences.
Design a highly available checkout/cart system handling flash sales, inventory reservations, payments, and order confirmation.
Design a payment gateway supporting multiple processors, 3-D Secure, refunds, settlements, and PCI concerns.
Design a reliable, horizontally scalable scheduler (distributed cron) that supports one-off and recurring jobs with retries and idempotency.
Design a multi-tenant API gateway that handles routing, auth, rate limiting, request/response transformations, canarying, and observability across regions.
Design a distributed cache that supports eviction policies, consistency across nodes, replication, and client-side failover.
Design a global publish/subscribe system with millions of subscribers, durable delivery, and filtering.
Your services suddenly cannot resolve domain names, breaking connectivity to dependencies. Walk me through your triage.
A Kubernetes service keeps restarting with CrashLoopBackOff. How do you debug and resolve this?
Half your nodes cannot communicate with the other half due to a suspected network partition. How do you investigate and respond?
Your cloud provider reports an ongoing outage in one region, affecting your services. How do you respond?
A critical dependency is failing and your serviceβs circuit breaker has opened. How do you handle this situation?
A service starts failing with EMFILE errors. Describe how you identify the cause and fix it.
A cache eviction triggers a surge of requests to the origin, causing overload. How do you diagnose and prevent cache stampede?
New pods are failing with ImagePullBackOff and registry logs show rate limiting/throttling. How do you restore service quickly and prevent recurrence?
A production node repeatedly reboots due to kernel panics under load. Outline your triage and containment steps.
Instances are flapping in and out of load balancers due to aggressive health checks. How do you detect and fix this without masking real failures?
Background jobs calling a cloud provider API start failing with 429 Too Many Requests during peak hours. How do you stabilize and prevent it?
CI pipelines are blocked by flaky integration tests. How do you triage and stabilize pipelines?
Tell me about a time you discovered a near-miss that could have caused data loss. How did you handle it and what changed?
Tell me about a time when you performed a root cause analysis after a failure. How did you conduct it, and what changes did you implement afterward?