Advertisement
Interview Question
Production dashboards show a sharp increase in HTTP 5xx responses from the web tier over the last 10 minutes, but traffic volume is normal. Describe your step-by-step triage and remediation.
Key Points to Cover
- Confirm scope with metrics (5xx rate, latency, saturation) and traces
- Check recent deploys/feature flags and roll back if correlated
- Inspect app/web logs and upstream dependency health
- Validate config/env changes and TLS/cert validity
- Apply circuit breakers/retries; widen capacity if needed
Evaluation Rubric
Uses metrics/logs/traces to scope incident30% weight
Considers recent changes and flag rollbacks25% weight
Checks upstream/downstream dependencies25% weight
Applies safe mitigation and rollback20% weight
Hints
- 💡Look for config drift, expired certs, or downstream timeouts.
Common Pitfalls to Avoid
- ⚠️Jumping to conclusions without confirming the scope of the issue.
- ⚠️Failing to check recent changes (deploys, flags) which are often the primary culprit.
- ⚠️Not correlating web tier errors with the health of its upstream dependencies.
- ⚠️Spending too much time in one area (e.g., deep log diving) without exploring other potential causes.
- ⚠️Lack of a clear rollback strategy before attempting a fix or hotfix.
Potential Follow-up Questions
- ❓How would you prevent regressions?
- ❓Which SLOs/alerts would you set?
Advertisement
Related Questions
Questions that share similar topics with this one
HTTP Idempotent Methods
Intermediate📞 Phone Screen•2 min•Phone
HTTP/1.1 vs HTTP/2
Intermediate📞 Phone Screen•2 min•Phone
Common HTTP Status Codes
Beginner📞 Phone Screen•2 min•Phone
How DNS Resolution Works
Intermediate📞 Phone Screen•2 min•Phone
HTTP Keep-Alive & Connection Pooling
Intermediate📞 Phone Screen•2 min•Phone