Interview Questions/System Design/Design an ML Model Serving Platform
AdvancedSystem-Design
45 min

Design an ML Model Serving Platform

MLServingMonitoringScalability
Advertisement
Interview Question

Design a platform to serve ML models at scale with versioning, monitoring, A/B testing, and GPU utilization.

Key Points to Cover
  • Model registry/versioning and deployment API
  • Serving infra: CPU vs GPU autoscaling, batching
  • Monitoring: latency, drift, canary/A-B testing
  • Rollbacks, rollouts, and shadow deployments
  • Cost optimization: multi-tenancy, spot GPUs
Evaluation Rubric
Clear registry & versioning strategy25% weight
Efficient GPU/CPU serving design25% weight
Strong monitoring and rollback plan25% weight
Cost optimization and tenancy handling25% weight
Hints
  • 💡Batch inference improves throughput.
Common Pitfalls to Avoid
  • ⚠️Underestimating the complexity of GPU management and autoscaling in a heterogeneous environment.
  • ⚠️Lack of robust versioning strategies leading to deployment conflicts and rollback difficulties.
  • ⚠️Insufficient monitoring of model drift, resulting in silently degrading prediction quality.
  • ⚠️Over-provisioning resources, leading to unnecessary costs, instead of fine-tuning autoscaling parameters.
  • ⚠️Poor integration between the model training pipeline and the serving infrastructure, causing deployment bottlenecks.
Potential Follow-up Questions
  • How do you support online vs offline inference?
  • How to ensure fairness in A/B testing?
Advertisement