System Design Problems
Design a Chat System
A chat system enables real-time messaging between users. Services like WhatsApp, Slack, and Discord handle billions of messages daily with sub-second delivery, presence tracking, and end-to-end encryption.
- Real-time Delivery — Messages delivered in < 100ms between online users
- Persistence — Message history stored durably and searchable
- Group Support — 1-on-1 and group conversations with thousands of members
The core challenge is maintaining persistent WebSocket connections for millions of concurrent users while ensuring message ordering and delivery guarantees.
Requirements
Functional Requirements
- 1-on-1 messaging between users
- Group messaging (up to 1000 members)
- Online/offline presence indicators
- Read receipts and typing indicators
- Media sharing (images, files, voice messages)
- Message history and search
- Push notifications for offline users
Non-Functional Requirements
- Latency: Message delivery < 100ms for online users
- Ordering: Messages within a conversation are strictly ordered
- Durability: No messages lost; persisted to database
- Consistency: Eventual consistency across devices
- Scalability: 500M users, 50M concurrent connections
Chat systems are connection-heavy. Maintaining 50M WebSocket connections requires careful resource management—each connection uses ~10 KB of memory, totaling ~500 GB for connection state alone.
Back-of-the-Envelope Estimation
Chat Traffic Estimation
- 500M total users, 50M daily active, 5M concurrent
- Average messages per user per day: 40
- Messages/day: 50M × 40 = 2B messages
- QPS_write = 2B / 86400 ≈ 23,000 QPS
- Peak write (3x): ~70,000 QPS
- Storage: 2B × 200 bytes × 365 days = ~146 TB/year
API Design
POST /api/v1/conversations
Request: { "type": "group", "members": ["u1", "u2"], "name": "Project Team" }
Response: { "conversation_id": "c_123" }
GET /api/v1/conversations/{id}/messages?before=msg_456&limit=50
Response: { "messages": [...], "has_more": true }
WebSocket: wss://chat.example.com/ws?token={jwt}
→ Send: { "type": "message", "conversation_id": "c_123", "content": "Hello!" }
← Recv: { "type": "message", "from": "u_456", "content": "Hello!" }
High-Level Architecture
Detailed Design
WebSocket Connection Management
Maintaining persistent connections requires a stateful gateway:
DfWebSocket Gateway
The WebSocket gateway manages persistent TCP connections with clients. Each gateway server handles thousands of connections and routes messages to the correct destination using a session lookup service.
Gateway Capacity
Here,
- =Total WebSocket connections
- =Max connections per server (e.g., 100K)
Gateway Scaling
50M concurrent connections / 100K per gateway = 500 gateway servers
Each gateway uses epoll/kqueue for efficient I/O multiplexing.
Message Flow
- Sender's gateway receives message via WebSocket
- Gateway publishes to Kafka topic (partitioned by conversation_id)
- Chat service processes message (validation, persistence)
- Chat service looks up recipient's gateway
- If recipient is online, deliver via their gateway's WebSocket
- If recipient is offline, send push notification
Message Ordering
Ensure messages within a conversation are ordered:
Message Sequence
Here,
- =Monotonically increasing sequence number per conversation
- =Unique conversation identifier
Use Kafka's partition ordering: all messages for the same conversation go to the same partition, guaranteeing order within a conversation.
Presence System
Track online/offline status using Redis:
// User comes online
SET presence:user_123 "online" EX 300 // 5-minute TTL
// Heartbeat (renew every 60 seconds)
EXPIRE presence:user_123 300
// Check presence
GET presence:user_123 // Returns "online" or nil
Use TTL-based presence with heartbeats. If a user's connection drops without a clean disconnect, the TTL expires and the user is marked offline automatically.
Read Receipts
Track which messages have been read:
// Store last read message ID per user per conversation
SET read:user_123:conv_456 msg_789
// When recipient reads messages, update
SETRANGE read:user_123:conv_456 msg_790
// Sender queries read status
GET read:user_123:conv_456 // Returns last read msg ID
Practice Exercises
-
Design: How would you implement end-to-end encryption for a chat system? What are the key management challenges?
-
Scale: If a group chat has 1000 members and one user sends a message, estimate the fan-out cost. How would you optimize for large groups?
-
Reliability: Design a system to ensure exactly-once message delivery. How do you handle duplicate messages from network retries?
-
Search: How would you implement full-text search across millions of chat messages? What indexing strategy would you use?
Key Takeaways:
- WebSocket gateways manage persistent connections; scale horizontally with connection-aware load balancing
- Kafka provides ordered, durable message delivery partitioned by conversation_id
- TTL-based presence with heartbeats enables automatic offline detection
- Cassandra is ideal for message storage due to its write-optimized LSM-tree and time-series data model
- Fan-out for group messages requires careful optimization for large groups
What to Learn Next
-> Networking Fundamentals TCP/IP, WebSockets, and connection management.
-> Message Queues Kafka, RabbitMQ, and event-driven messaging.
-> Design News Feed Fan-out strategies and real-time content delivery.
-> Design Notification System Multi-channel notification delivery and retry logic.
-> Databases Choosing Cassandra for time-series message data.
-> Caching Strategies Session management and presence tracking with Redis.