AdvancedSystem-Design
45 min
Design a Monitoring System
System DesignMonitoringObservabilityKubernetes
Advertisement
Interview Question
Design a monitoring and alerting system for a microservices architecture running on Kubernetes. Consider metrics, logs, traces, and alerting.
Key Points to Cover
- Three pillars: Metrics (Prometheus), Logs (ELK/Loki), Traces (Jaeger)
- Service mesh integration for automatic instrumentation
- Alerting hierarchy: severity levels and escalation
- Dashboards for different audiences (dev, ops, business)
- SLI/SLO definition and error budget tracking
- Cost considerations and data retention policies
Evaluation Rubric
Identifies key monitoring components25% weight
Designs scalable architecture25% weight
Shows K8s and microservices integration25% weight
Considers operational aspects (SLOs, alerting)25% weight
Hints
- 💡Think about the three pillars of observability
- 💡Consider different user personas
Common Pitfalls to Avoid
- ⚠️**Alert Fatigue:** Over-alerting or alerts lacking clear actionability, leading engineers to ignore warnings and potentially miss critical incidents.
- ⚠️**Siloed Observability Tools:** Having disconnected tools for metrics, logs, and traces without correlation, making it extremely difficult and time-consuming to piece together a coherent picture during incident investigation.
- ⚠️**Lack of Standardized Instrumentation:** Inconsistent or missing instrumentation across different microservices, particularly for distributed tracing, hindering the ability to track requests end-to-end.
- ⚠️**Over-reliance on Infrastructure Metrics:** Focusing solely on resource utilization (CPU, memory, disk) and neglecting application-specific business metrics or user experience metrics (e.g., transaction success rates, customer journey latency).
- ⚠️**Insufficient Log Context/Structure:** Generating unstructured logs or logs without crucial contextual information (e.g., trace IDs, request IDs, user IDs), making efficient searching, filtering, and debugging nearly impossible.
Potential Follow-up Questions
- ❓How would you handle high-cardinality metrics?
- ❓What about cross-cluster monitoring?
Advertisement