System Design Problems
Design Google Docs
Google Docs enables multiple users to edit the same document simultaneously with changes visible in real-time. The system must resolve concurrent edits without conflicts, maintain document consistency, and handle hundreds of simultaneous collaborators.
- Real-time Collaboration β Changes visible to all users within 100ms
- Conflict Resolution β Operational Transformation ensures consistency
- Offline Support β Continue editing offline; sync when reconnected
The core challenge is Operational Transformation (OT): transforming concurrent operations so they produce the same result regardless of execution order.
Requirements
Functional Requirements
- Multiple users can edit the same document simultaneously
- Changes appear in real-time (within 100ms)
- Cursor position visible for all collaborators
- Edit history with ability to undo/redo
- Comments and suggestions
- Offline editing with sync on reconnect
- Document sharing with permissions
Non-Functional Requirements
- Latency: Changes visible in < 100ms
- Consistency: All users see the same document state
- Scale: 100 concurrent editors per document
- Durability: Never lose edits
- Availability: 99.99%
Google Docs was originally built on Operational Transformation (OT). Modern systems like Figma use CRDTs (Conflict-free Replicated Data Types) which don't require a central server for conflict resolution.
Back-of-the-Envelope Estimation
Google Docs Capacity
- 2 billion documents, 1 billion active users
- Average document: 20 KB text
- Total storage: 2B Γ 20 KB = 40 TB
- Concurrent editing sessions: 10M (1% of users)
- Operations per second: 10M Γ 10 ops/sec = 100M ops/sec
- WebSocket connections: 10M concurrent
Operational Transformation
DfOperational Transformation (OT)
OT is a technique for consistency in collaborative editing. Each user's edit is represented as an operation. When concurrent operations arrive, they are transformed against each other so they produce the same document state regardless of execution order.
OT Transformation
Here,
- =First operation (e.g., insert 'a' at position 5)
- =Second operation (e.g., insert 'b' at position 3)
- =Transformed op1 (position adjusted for op2)
- =Transformed op2 (position adjusted for op1)
OT Transformation Example
User A: Insert 'a' at position 5 User B: Insert 'b' at position 3
Without transformation:
- A executes first: ...b...a... (A at 5, B at 3)
- B executes first: ...b...a... (B at 3, A at 5) Both produce the same result β
With transformation:
- A receives B's op: Insert 'a' at position 6 (shifted right by 1)
- B receives A's op: Insert 'b' at position 3 (no shift) Both produce: ...b...a... β
OT Properties
DfTP1 (Transformation Property 1)
For any two concurrent operations o1 and o2: transform(o1, o2) applied to state S produces the same result as transform(o2, o1) applied to state S.
DfTP2 (Transformation Property 2)
For any operation o and transformed operations o1', o2': transform(compose(o1', o2'), o) = compose(transform(o1', o), transform(o2', o'))
TP1 ensures convergence (all users reach the same state). TP2 ensures that composing transformed operations is equivalent to transforming composed operations.
Insert Operation Transformation
Insert-Insert Transformation
Here,
- =Position of first insert
- =Position of second insert
- =Characters being inserted
Insert-Insert Transformation
User A: Insert 'X' at position 5 User B: Insert 'Y' at position 3
Transform(A, B):
- A's position 5 >= B's position 3, so A shifts right: Insert 'X' at position 6
Transform(B, A):
- B's position 3 < A's position 5, so B stays: Insert 'Y' at position 3
Both produce: ...Y...X...
Delete Operation Transformation
Delete-Insert Transformation
Here,
- =Position of delete
- =Position of insert
High-Level Architecture
Detailed Design
Operation Model
DfOperation
An operation represents a single atomic edit. Operations are: insert(character, position), delete(position), or retain(count). Each operation has a version number indicating the base document version it was created from.
Operation = {
type: "insert" | "delete" | "retain",
position: int,
character: char (for insert),
version: int, // Base version this op was created from
client_id: string, // Who created this op
timestamp: long
}
Server-side OT Flow
- Client sends operation with base version V
- Server receives operation
- Server transforms operation against all operations after version V
- Server applies transformed operation to document
- Server increments version to V+1
- Server broadcasts transformed operation to all other clients
- Server persists operation to operation log
The server is the single source of truth. Clients send operations; the server transforms and applies them. This central authority prevents divergence.
Client-side OT Flow
- User makes local edit β create operation with current local version
- Apply operation locally immediately (optimistic update)
- Send operation to server
- Receive transformed operation from server
- Adjust local state based on server's transformed operation
Version Vector
DfVersion Vector
A version vector tracks the operation history per client. It enables the server to determine which operations an client has seen and which transformations are needed.
Version Vector: {
client_a: 15, // Client A has applied 15 ops
client_b: 12, // Client B has applied 12 ops
client_c: 18 // Client C has applied 18 ops
}
When Client A (version 15) sends an operation, the server checks which operations after version 15 need to be transformed against. If other clients have applied operations 16-18, A's operation is transformed against those.
Persistence Strategy
| Component | Storage | Reason |
|---|---|---|
| Operation Log | Append-only log | Durable, enables replay |
| Document Snapshots | Periodic snapshots | Fast document loading |
| Operation Buffer | In-memory (Redis) | Recent ops for transformation |
Take snapshots every 100 operations. To load a document, load the latest snapshot and replay subsequent operations. This balances load time with storage cost.
Scaling OT Servers
DfDocument Affinity
Document affinity routes all operations for a document to the same OT server. This ensures all operations for a document are processed in order without distributed coordination.
Server Assignment
Here,
- =Unique document identifier
- =Number of OT servers
Document affinity means one document's operations are processed by one server. If a document has > 100 concurrent editors, the single server may become a bottleneck. Consider splitting very active documents across multiple servers.
Practice Exercises
-
Algorithm: Implement the OT transform function for insert-insert and delete-insert cases. Prove that TP1 and TP2 hold for your implementation.
-
Scale: If a document has 100 concurrent editors, each making 10 operations/second, the server processes 1000 ops/sec for one document. Estimate the server resources needed.
-
Offline: Design a system for offline editing that syncs changes when the user reconnects. How do you handle conflicts with the online version?
-
Comparison: Compare OT and CRDTs for collaborative editing. What are the trade-offs in terms of complexity, server requirements, and undo/redo support?
Key Takeaways:
- OT ensures convergence by transforming concurrent operations against each other
- The server is the single source of truth; clients send operations, server transforms and broadcasts
- Document affinity routes all operations for a document to the same server
- Snapshots every 100 operations balance load time with storage cost
- Version vectors track per-client operation history for transformation
What to Learn Next
-> Design Google Drive File storage, sync, and conflict resolution.
-> Design Chat System Real-time messaging with WebSocket and presence.
-> Distributed Consensus Raft, Paxos, and consistency in distributed systems.
-> Data Replication Replicating document state across servers.
-> Message Queues Broadcasting operations to collaborators.
-> Design News Feed Real-time updates with WebSocket.