🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Design a Chat System

System Design ProblemsReal-time Messaging🟢 Free Lesson

Advertisement

System Design Problems

Design a Chat System

A chat system enables real-time messaging between users. Services like WhatsApp, Slack, and Discord handle billions of messages daily with sub-second delivery, presence tracking, and end-to-end encryption.

  • Real-time Delivery — Messages delivered in < 100ms between online users
  • Persistence — Message history stored durably and searchable
  • Group Support — 1-on-1 and group conversations with thousands of members

The core challenge is maintaining persistent WebSocket connections for millions of concurrent users while ensuring message ordering and delivery guarantees.

Requirements

Functional Requirements

  • 1-on-1 messaging between users
  • Group messaging (up to 1000 members)
  • Online/offline presence indicators
  • Read receipts and typing indicators
  • Media sharing (images, files, voice messages)
  • Message history and search
  • Push notifications for offline users

Non-Functional Requirements

  • Latency: Message delivery < 100ms for online users
  • Ordering: Messages within a conversation are strictly ordered
  • Durability: No messages lost; persisted to database
  • Consistency: Eventual consistency across devices
  • Scalability: 500M users, 50M concurrent connections

Chat systems are connection-heavy. Maintaining 50M WebSocket connections requires careful resource management—each connection uses ~10 KB of memory, totaling ~500 GB for connection state alone.

Back-of-the-Envelope Estimation

Chat Traffic Estimation

  • 500M total users, 50M daily active, 5M concurrent
  • Average messages per user per day: 40
  • Messages/day: 50M × 40 = 2B messages
  • QPS_write = 2B / 86400 ≈ 23,000 QPS
  • Peak write (3x): ~70,000 QPS
  • Storage: 2B × 200 bytes × 365 days = ~146 TB/year

API Design

Architecture Diagram
POST /api/v1/conversations
Request:  { "type": "group", "members": ["u1", "u2"], "name": "Project Team" }
Response: { "conversation_id": "c_123" }

GET /api/v1/conversations/{id}/messages?before=msg_456&limit=50
Response: { "messages": [...], "has_more": true }

WebSocket: wss://chat.example.com/ws?token={jwt}
  → Send: { "type": "message", "conversation_id": "c_123", "content": "Hello!" }
  ← Recv: { "type": "message", "from": "u_456", "content": "Hello!" }

High-Level Architecture

ClientsMobileDesktopWebWebSocket GatewayConnection MgrSession StoreMessage RouterPresence TrackerChat ServiceMessage HandlerFan-out ServiceNotification SvcRedisSessionsCassandraMessagesS3MediaChat System Architecture

Detailed Design

WebSocket Connection Management

Maintaining persistent connections requires a stateful gateway:

DfWebSocket Gateway

The WebSocket gateway manages persistent TCP connections with clients. Each gateway server handles thousands of connections and routes messages to the correct destination using a session lookup service.

Gateway Capacity

gateways=concurrent_connectionsconnections_per_gatewaygateways = \frac{concurrent\_connections}{connections\_per\_gateway}

Here,

  • concurrentconnectionsconcurrent_connections=Total WebSocket connections
  • connectionspergatewayconnections_per_gateway=Max connections per server (e.g., 100K)

Gateway Scaling

50M concurrent connections / 100K per gateway = 500 gateway servers

Each gateway uses epoll/kqueue for efficient I/O multiplexing.

Message Flow

SenderGateway A(Sender)MessageQueue(Kafka)Gateway BGateway CReceiverReceiverMessage routing through Kafka
  1. Sender's gateway receives message via WebSocket
  2. Gateway publishes to Kafka topic (partitioned by conversation_id)
  3. Chat service processes message (validation, persistence)
  4. Chat service looks up recipient's gateway
  5. If recipient is online, deliver via their gateway's WebSocket
  6. If recipient is offline, send push notification

Message Ordering

Ensure messages within a conversation are ordered:

Message Sequence

seq_id=atomic_increment(conversation_id)seq\_id = \text{atomic\_increment}(conversation\_id)

Here,

  • seqidseq_id=Monotonically increasing sequence number per conversation
  • conversationidconversation_id=Unique conversation identifier

Use Kafka's partition ordering: all messages for the same conversation go to the same partition, guaranteeing order within a conversation.

Presence System

Track online/offline status using Redis:

Architecture Diagram
// User comes online
SET presence:user_123 "online" EX 300  // 5-minute TTL

// Heartbeat (renew every 60 seconds)
EXPIRE presence:user_123 300

// Check presence
GET presence:user_123  // Returns "online" or nil

Use TTL-based presence with heartbeats. If a user's connection drops without a clean disconnect, the TTL expires and the user is marked offline automatically.

Read Receipts

Track which messages have been read:

Architecture Diagram
// Store last read message ID per user per conversation
SET read:user_123:conv_456 msg_789

// When recipient reads messages, update
SETRANGE read:user_123:conv_456 msg_790

// Sender queries read status
GET read:user_123:conv_456  // Returns last read msg ID

Practice Exercises

  1. Design: How would you implement end-to-end encryption for a chat system? What are the key management challenges?

  2. Scale: If a group chat has 1000 members and one user sends a message, estimate the fan-out cost. How would you optimize for large groups?

  3. Reliability: Design a system to ensure exactly-once message delivery. How do you handle duplicate messages from network retries?

  4. Search: How would you implement full-text search across millions of chat messages? What indexing strategy would you use?

Key Takeaways:

  • WebSocket gateways manage persistent connections; scale horizontally with connection-aware load balancing
  • Kafka provides ordered, durable message delivery partitioned by conversation_id
  • TTL-based presence with heartbeats enables automatic offline detection
  • Cassandra is ideal for message storage due to its write-optimized LSM-tree and time-series data model
  • Fan-out for group messages requires careful optimization for large groups

What to Learn Next

-> Networking Fundamentals TCP/IP, WebSockets, and connection management.

-> Message Queues Kafka, RabbitMQ, and event-driven messaging.

-> Design News Feed Fan-out strategies and real-time content delivery.

-> Design Notification System Multi-channel notification delivery and retry logic.

-> Databases Choosing Cassandra for time-series message data.

-> Caching Strategies Session management and presence tracking with Redis.

Premium Content

Design a Chat System

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement