System Design Problems
Design WhatsApp
WhatsApp serves 2B+ users with 100B+ messages daily. This design explores building a globally distributed messaging platform with end-to-end encryption, media delivery, and real-time presence.
- Scale — 2B users, 100B messages/day, 50M messages/second peak
- Latency — Message delivery under 100ms for 95% of messages
- Encryption — End-to-end encryption using Signal Protocol
Designing for WhatsApp means solving the hardest problems in distributed messaging at planetary scale.
Requirements Clarification
Functional Requirements
- One-to-one text messaging with delivery/read receipts
- Group messaging (up to 1024 members)
- Media sharing (images, videos, documents up to 2GB)
- Online/offline presence indicators
- Message history and synchronization across devices
- End-to-end encryption (E2EE)
Non-Functional Requirements
- Availability: 99.99% uptime
- Latency: < 100ms for message delivery (P99)
- Durability: Messages must not be lost once sent
- Consistency: Causal ordering within conversations
- Scale: 2B registered users, 500M daily active users
The critical insight: WhatsApp is a store-and-forward system, not a direct connection system. Messages are queued on servers and delivered when recipients come online.
Back-of-the-Envelope Estimation
Message Throughput
Here,
- =Messages per day
- =Seconds in a day
- =Average QPS
Storage Estimation
Average message size: 100 bytes (text), 100KB (media average)
Text storage per day: 100B × 100 bytes = 10 TB/day
Media storage per day (assuming 5% messages have media): 5B × 100KB = 500 TB/day
Total storage per year: (10 TB + 500 TB) × 365 ≈ 186 PB/year
Bandwidth Estimation
Here,
- =Queries per second
- =Average message size in bytes
High-Level Architecture
Core Components Deep Dive
1. Connection Manager
Maintains persistent WebSocket connections with clients:
DfConnection Mapping
Each user maintains exactly one active connection per device. The connection manager uses a mapping: user_id + device_id → server_id. When a message arrives, the router looks up this mapping to find which server holds the connection.
class ConnectionManager:
def __init__(self):
self.user_connections = {} # user_id -> {device_id: server_id}
self.server_connections = {} # server_id -> {connection_id: user_id}
def register(self, user_id, device_id, server_id):
self.user_connections[user_id][device_id] = server_id
self.server_connections[server_id][connection_id] = user_id
def get_servers(self, user_id):
return set(self.user_connections.get(user_id, {}).values())
2. Message Router
Routes messages between senders and receivers:
Message Routing Complexity
Here,
- =Direct message lookup time
- =Group fan-out time
3. Message Flow (1:1 Chat)
4. End-to-End Encryption
DfSignal Protocol
WhatsApp uses the Signal Protocol for E2EE. Each message is encrypted with a unique key derived from a Double Ratchet algorithm. Keys are exchanged using X3DH (Extended Triple Diffie-Hellman) key agreement.
Key Derivation
Here,
- =Message key for message n
- =Root ratchet key
- =Symmetric ratchet chain key
E2EE means the server never sees plaintext messages. This limits server-side features like message search and spam detection. WhatsApp uses sender keys for group encryption to reduce overhead.
5. Group Messaging
DfSender Key Distribution
For groups, the sender distributes a unique sender key to all members. The sender encrypts the message once with this key, then distributes it. Each member uses the sender key to decrypt. When a member joins/leaves, a new sender key is distributed.
Group Encryption Cost
Here,
- =Group size
- =Key distribution cost
- =Per-message encryption cost
6. Presence System
DfPresence Protocol
Presence is maintained using a heartbeat mechanism. Clients send presence updates every 30 seconds. The presence service stores these in Redis with a TTL of 60 seconds. If no heartbeat is received, the user is marked offline.
Data Model
Message Table Schema
Here,
- =Unique message ID (ULID)
- =Conversation identifier
- =Sender user ID
- =Unix timestamp (ms)
- =sent/delivered/read
Scaling Strategies
Message Storage Partitioning
Messages are partitioned by chat_id using consistent hashing:
Partition Assignment
Here,
- =Conversation identifier
- =Total number of partitions
Push vs Pull for Delivery
WhatsApp uses a hybrid approach: Long polling for clients with unstable connections, and push notifications for offline clients. The server maintains a delivery queue per user.
Practice Exercises
-
Design: How would you implement message synchronization across multiple devices for the same user? Consider ordering and conflict resolution.
-
Scale: WhatsApp has 50M concurrent users. Estimate the number of WebSocket connections needed and the memory overhead per connection.
-
Reliability: Design a mechanism to ensure messages are never lost, even if the sender's device crashes after sending but before receiving acknowledgment.
-
Optimization: How would you reduce bandwidth usage for users on slow networks? Propose a compression strategy for text and media.
Key Takeaways:
- WhatsApp uses a store-and-forward model with persistent connections
- End-to-end encryption via Signal Protocol limits server-side processing
- Group messaging uses sender keys to minimize encryption overhead
- Partition by chat_id for message storage scalability
- Hybrid push/pull for delivery optimization
What to Learn Next
-> Design Instagram Photo sharing, feeds, and media delivery at scale.
-> Design Twitter Real-time feeds, fan-out, and timeline generation.
-> Design YouTube Video streaming, transcoding, and CDN delivery.
-> Design Netflix Content delivery, recommendation, and adaptive streaming.
-> Circuit Breaker Pattern Preventing cascade failures in distributed systems.
-> Back Pressure Managing load in message-driven architectures.