Data Systems
Data Replication
Replication copies data across multiple nodes to improve availability, reduce latency, and provide fault tolerance. The challenge is keeping replicas consistent as data changes.
- Availability β Replicas serve reads even if primary fails
- Latency β Users read from geographically close replicas
- Durability β Data survives individual node failures
Replication is the foundation of fault-tolerant distributed systems.
Leader-Follower Replication
The most common replication topology: one leader accepts writes, followers replicate the log.
DfLeader-Follower Replication
In leader-follower (primary-backup) replication, one node (the leader) processes all write operations. Changes are logged and shipped to follower nodes, which apply them asynchronously or synchronously. Followers serve read queries, distributing read load across the cluster.
Synchronous vs Asynchronous Replication
DfSynchronous Replication
Synchronous replication waits for all replicas to confirm a write before acknowledging it to the client. Guarantees strong consistency but increases write latency proportional to the slowest replica.
DfAsynchronous Replication
Asynchronous replication acknowledges writes immediately and propagates changes to replicas in the background. Provides lower latency but allows temporary inconsistency (replication lag).
| Property | Synchronous | Asynchronous |
|---|---|---|
| Consistency | Strong | Eventual |
| Write latency | High (waits for all replicas) | Low (single node) |
| Availability | Lower (all replicas must be up) | Higher |
| Data loss risk | None (on commit) | Up to replication lag |
| Use case | Financial transactions | Social media feeds |
Most production systems use semi-synchronous replication: the leader waits for at least one follower to confirm before acknowledging. This provides durability without the latency cost of waiting for all replicas.
Replication Lag
Replication Lag
Here,
- =Time difference between leader and follower at time t
- =Timestamp of latest committed write on leader
- =Timestamp of latest applied write on follower
Read-After-Write Consistency
A user writes their profile name then immediately reads it from a different server. If that server is a follower with replication lag, they see the old name.
Solution: Route reads-after-writes to the leader, or wait for replication to catch up (read-your-writes consistency).
Multi-Leader Replication
Multiple nodes accept writes, each replicating to all others.
DfMulti-Leader Replication
Multi-leader (multi-master) replication allows multiple nodes to accept write operations independently. Each leader replicates its changes to all other leaders. This enables writes at multiple data centers and offline operation, but introduces conflict resolution challenges.
When to Use Multi-Leader
- Multi-datacenter deployments for low-latency writes
- Offline clients that sync when reconnected
- Collaborative editing where multiple users edit simultaneously
Multi-leader replication is significantly more complex than single-leader. Conflicts must be detected and resolved. Only use multi-leader when single-leader cannot meet your requirements.
Conflict Resolution
When two leaders update the same key concurrently, a conflict occurs.
DfConflict Resolution Strategies
- Last Write Wins (LWW): Timestamp-based, simple but loses data
- Merge: Application-defined merge function (e.g., union of sets)
- Custom: Application-specific resolution based on business logic
Last Write Wins
Here,
- =The write that is kept
- =Wall-clock or logical timestamp of the write
Avoid LWW when data loss is unacceptable. For example, two users adding items to a shopping cart simultaneously β LWW loses one user's items. Instead, merge the carts (union of items).
Replication Topologies
| Topology | Pros | Cons |
|---|---|---|
| Linear | Simple, ordered | Slow propagation, single point of failure |
| Star | Fast propagation from center | Hub is bottleneck |
| Circular | No central bottleneck | Slow, fragile if one node fails |
| All-to-All | Fastest propagation | Conflict resolution complex |
Log-Based Replication
Most modern systems replicate via an append-only log.
DfReplication Log
A replication log is an append-only sequence of events. The leader appends new entries; followers fetch and apply them in order. This guarantees that followers converge to the same state given the same initial state and log. Examples: WAL (PostgreSQL), binlog (MySQL), oplog (MongoDB).
Log-based replication is the foundation for change data capture (CDC) and event sourcing. Tools like Debezium stream database changes as events, enabling real-time data pipelines.
Practice Exercises
-
Conceptual: Explain why synchronous replication is rarely used for all replicas in production. What is the alternative?
-
Design: Design a replication strategy for a global social media app where users write in their local region but must read their own writes immediately. How do you handle replication lag?
-
Conflict Resolution: Two users edit the same document offline. User A saves at 10:00:01, User B saves at 10:00:02. Both sync at 10:00:05. Design a conflict resolution strategy that preserves both users' intent.
-
Analysis: Compare linear, star, and all-to-all replication topologies for a 5-datacenter deployment. Which topology minimizes propagation delay? Which is most resilient?
Key Takeaways:
- Leader-follower replication is the most common topology; followers replicate the leader's log
- Synchronous replication provides strong consistency; asynchronous provides lower latency
- Multi-leader replication enables multi-datacenter writes but requires conflict resolution
- Log-based replication (WAL, binlog) guarantees ordered application of changes
- Replication lag affects consistency guarantees; read-your-writes requires special handling
What to Learn Next
-> Data Partitioning Horizontal partitioning, range vs hash partitioning, and rebalancing.
-> CAP Theorem Consistency models, availability, and partition tolerance.
-> Consistent Hashing Hash rings, virtual nodes, and load distribution.
-> Distributed Consensus Raft, Paxos, and leader election algorithms.
-> Databases SQL vs NoSQL, indexing, replication, and sharding.
-> Event-Driven Architecture Event sourcing, CQRS, and saga patterns.