System Design Problems
Design Netflix
Netflix serves 250M+ subscribers across 190+ countries with 15,000+ titles. This design covers microservice architecture, content delivery with Open Connect CDN, and ML-powered personalization.
- Scale — 250M+ subscribers, 1B+ hours watched/month
- CDN — Open Connect appliances in 6000+ ISPs
- Personalization — Saves $1B/year in reduced churn
Netflix is a masterclass in building a resilient microservice ecosystem with content delivery at planetary scale.
Requirements Clarification
Functional Requirements
- Browse and search content catalog
- Stream videos with adaptive quality
- Personalized recommendations
- Multiple profiles per account
- Download for offline viewing
- Parental controls
- Multiple device support
Non-Functional Requirements
- Availability: 99.99% uptime
- Latency: Video starts in < 2 seconds
- Throughput: 15% of global internet bandwidth
- Consistency: Eventual for recommendations
- Scale: 250M subscribers, 1B hours/month
Netflix's architecture is a microservice ecosystem with 1000+ services. The key insight: separate content delivery, metadata, recommendations, and billing into independent services.
Back-of-the-Envelope Estimation
Bandwidth Estimation
Here,
- =Hours watched per month
- =Average bitrate
- =Average throughput
Storage Estimation
Content library:
- 15,000 titles x 50 versions = 750,000 files
- Average file size: 5GB
- Total: 750,000 x 5GB = 3.75 PB
With 3x replication: 11.25 PB
High-Level Architecture
Open Connect CDN
DfOpen Connect Architecture
Netflix deploys custom appliances (OCAs) inside 6000+ ISP networks globally. Each OCA stores popular content and serves it directly to end users. The top 1000 titles serve ~90% of traffic.
Netflix pre-positions content on OCAs based on popularity predictions. OCAs are refreshed during off-peak hours. This reduces internet backbone traffic and improves streaming quality.
Recommendation System
DfTwo-Tier Recommendation
Netflix uses a two-tier approach: (1) Row Generation determines which rows to show ("Trending Now", "Because you watched X"), (2) Ranking ranks items within each row by predicted engagement.
Engagement Prediction
Here,
- =Probability of watching
- =Sigmoid activation
- =Feature embedding function
Chaos Engineering
DfChaos Monkey
Netflix pioneered chaos engineering with Chaos Monkey, which randomly terminates production instances. The Simian Army includes Latency Monkey, Conformity Monkey, and Security Monkey for comprehensive resilience testing.
Netflix runs a "Day 2" test every morning simulating region failure. This ensures their multi-region architecture works under real failure conditions.
Data Model
Content Schema
Here,
- =Unique title identifier
- =movie/series/documentary
- =Array of genre tags
Practice Exercises
- CDN Design: How does Netflix decide which content to pre-position on each OCA?
- Recommendations: How would you handle the cold-start problem for new users?
- Resilience: Design a graceful degradation strategy when the recommendation service is down.
- Streaming: Compare Netflix's Open Connect with YouTube's multi-CDN approach.
Key Takeaways:
- Netflix uses 1000+ microservices with Zuul API gateway
- Open Connect CDN deploys appliances inside ISPs for low-latency delivery
- Recommendation system uses two-tier row generation + ranking
- Chaos engineering ensures resilience through deliberate failure injection
- Multi-region active-active deployment with automatic failover
What to Learn Next
-> Design YouTube Video streaming and transcoding pipelines.
-> Design Uber Real-time location and dispatch systems.
-> Circuit Breaker Preventing cascade failures.
-> Sidecar Pattern Service mesh and sidecar proxies.
-> Saga Pattern Distributed transactions.
-> Back Pressure Load management in streaming systems.