IntermediateScenario
10 min
Sudden 5xx Spike on Web Tier
HTTPWebObservabilityIncident Response
Advertisement
Interview Question
Production dashboards show a sharp increase in HTTP 5xx responses from the web tier over the last 10 minutes, but traffic volume is normal. Describe your step-by-step triage and remediation.
Key Points to Cover
- Confirm scope with metrics (5xx rate, latency, saturation) and traces
- Check recent deploys/feature flags and roll back if correlated
- Inspect app/web logs and upstream dependency health
- Validate config/env changes and TLS/cert validity
- Apply circuit breakers/retries; widen capacity if needed
Evaluation Rubric
Uses metrics/logs/traces to scope incident30% weight
Considers recent changes and flag rollbacks25% weight
Checks upstream/downstream dependencies25% weight
Applies safe mitigation and rollback20% weight
Hints
- 💡Look for config drift, expired certs, or downstream timeouts.
Potential Follow-up Questions
- ❓How would you prevent regressions?
- ❓Which SLOs/alerts would you set?
Advertisement