Interview Questions/System Design/Design a Monitoring System
AdvancedSystem-Design
45 min

Design a Monitoring System

System DesignMonitoringObservabilityKubernetes
Advertisement
Interview Question

Design a monitoring and alerting system for a microservices architecture running on Kubernetes. Consider metrics, logs, traces, and alerting.

Key Points to Cover
  • Three pillars: Metrics (Prometheus), Logs (ELK/Loki), Traces (Jaeger)
  • Service mesh integration for automatic instrumentation
  • Alerting hierarchy: severity levels and escalation
  • Dashboards for different audiences (dev, ops, business)
  • SLI/SLO definition and error budget tracking
  • Cost considerations and data retention policies
Evaluation Rubric
Identifies key monitoring components25% weight
Designs scalable architecture25% weight
Shows K8s and microservices integration25% weight
Considers operational aspects (SLOs, alerting)25% weight
Hints
  • 💡Think about the three pillars of observability
  • 💡Consider different user personas
Common Pitfalls to Avoid
  • ⚠️**Alert Fatigue:** Over-alerting or alerts lacking clear actionability, leading engineers to ignore warnings and potentially miss critical incidents.
  • ⚠️**Siloed Observability Tools:** Having disconnected tools for metrics, logs, and traces without correlation, making it extremely difficult and time-consuming to piece together a coherent picture during incident investigation.
  • ⚠️**Lack of Standardized Instrumentation:** Inconsistent or missing instrumentation across different microservices, particularly for distributed tracing, hindering the ability to track requests end-to-end.
  • ⚠️**Over-reliance on Infrastructure Metrics:** Focusing solely on resource utilization (CPU, memory, disk) and neglecting application-specific business metrics or user experience metrics (e.g., transaction success rates, customer journey latency).
  • ⚠️**Insufficient Log Context/Structure:** Generating unstructured logs or logs without crucial contextual information (e.g., trace IDs, request IDs, user IDs), making efficient searching, filtering, and debugging nearly impossible.
Potential Follow-up Questions
  • How would you handle high-cardinality metrics?
  • What about cross-cluster monitoring?
Advertisement