Sudden 5xx Spike on Web Tier

HTTP Web Observability Incident Response

Interview Question

Production dashboards show a sharp increase in HTTP 5xx responses from the web tier over the last 10 minutes, but traffic volume is normal. Describe your step-by-step triage and remediation.

Key Points to Cover

Confirm scope with metrics (5xx rate, latency, saturation) and traces
Check recent deploys/feature flags and roll back if correlated
Inspect app/web logs and upstream dependency health
Validate config/env changes and TLS/cert validity
Apply circuit breakers/retries; widen capacity if needed

Evaluation Rubric

Uses metrics/logs/traces to scope incident30% weight

Considers recent changes and flag rollbacks25% weight

Checks upstream/downstream dependencies25% weight

Applies safe mitigation and rollback20% weight

Hints

💡Look for config drift, expired certs, or downstream timeouts.

Common Pitfalls to Avoid

⚠️Jumping to conclusions without confirming the scope of the issue.
⚠️Failing to check recent changes (deploys, flags) which are often the primary culprit.
⚠️Not correlating web tier errors with the health of its upstream dependencies.
⚠️Spending too much time in one area (e.g., deep log diving) without exploring other potential causes.
⚠️Lack of a clear rollback strategy before attempting a fix or hotfix.

Potential Follow-up Questions

❓How would you prevent regressions?
❓Which SLOs/alerts would you set?

Sudden 5xx Spike on Web Tier

HTTP Idempotent Methods

HTTP/1.1 vs HTTP/2

Common HTTP Status Codes

How DNS Resolution Works

HTTP Keep-Alive & Connection Pooling