IntermediateScenario
10 min
Sudden 5xx Spike on Web Tier
HTTPWebObservabilityIncident Response
Advertisement
Interview Question
Production dashboards show a sharp increase in HTTP 5xx responses from the web tier over the last 10 minutes, but traffic volume is normal. Describe your step-by-step triage and remediation.
Key Points to Cover
- Confirm scope with metrics (5xx rate, latency, saturation) and traces
- Check recent deploys/feature flags and roll back if correlated
- Inspect app/web logs and upstream dependency health
- Validate config/env changes and TLS/cert validity
- Apply circuit breakers/retries; widen capacity if needed
Evaluation Rubric
Uses metrics/logs/traces to scope incident30% weight
Considers recent changes and flag rollbacks25% weight
Checks upstream/downstream dependencies25% weight
Applies safe mitigation and rollback20% weight
Hints
- 💡Look for config drift, expired certs, or downstream timeouts.
Common Pitfalls to Avoid
- ⚠️Jumping to conclusions without confirming the scope of the issue.
- ⚠️Failing to check recent changes (deploys, flags) which are often the primary culprit.
- ⚠️Not correlating web tier errors with the health of its upstream dependencies.
- ⚠️Spending too much time in one area (e.g., deep log diving) without exploring other potential causes.
- ⚠️Lack of a clear rollback strategy before attempting a fix or hotfix.
Potential Follow-up Questions
- ❓How would you prevent regressions?
- ❓Which SLOs/alerts would you set?
Advertisement