Interview Questions/Troubleshooting Scenarios/Sudden 5xx Spike on Web Tier
IntermediateScenario
10 min

Sudden 5xx Spike on Web Tier

HTTPWebObservabilityIncident Response
Advertisement
Interview Question

Production dashboards show a sharp increase in HTTP 5xx responses from the web tier over the last 10 minutes, but traffic volume is normal. Describe your step-by-step triage and remediation.

Key Points to Cover
  • Confirm scope with metrics (5xx rate, latency, saturation) and traces
  • Check recent deploys/feature flags and roll back if correlated
  • Inspect app/web logs and upstream dependency health
  • Validate config/env changes and TLS/cert validity
  • Apply circuit breakers/retries; widen capacity if needed
Evaluation Rubric
Uses metrics/logs/traces to scope incident30% weight
Considers recent changes and flag rollbacks25% weight
Checks upstream/downstream dependencies25% weight
Applies safe mitigation and rollback20% weight
Hints
  • 💡Look for config drift, expired certs, or downstream timeouts.
Common Pitfalls to Avoid
  • ⚠️Jumping to conclusions without confirming the scope of the issue.
  • ⚠️Failing to check recent changes (deploys, flags) which are often the primary culprit.
  • ⚠️Not correlating web tier errors with the health of its upstream dependencies.
  • ⚠️Spending too much time in one area (e.g., deep log diving) without exploring other potential causes.
  • ⚠️Lack of a clear rollback strategy before attempting a fix or hotfix.
Potential Follow-up Questions
  • How would you prevent regressions?
  • Which SLOs/alerts would you set?
Advertisement