Data Systems

Data Replication

Replication copies data across multiple nodes to improve availability, reduce latency, and provide fault tolerance. The challenge is keeping replicas consistent as data changes.

Availability — Replicas serve reads even if primary fails
Latency — Users read from geographically close replicas
Durability — Data survives individual node failures

Replication is the foundation of fault-tolerant distributed systems.

Leader-Follower Replication

The most common replication topology: one leader accepts writes, followers replicate the log.

DfLeader-Follower Replication

In leader-follower (primary-backup) replication, one node (the leader) processes all write operations. Changes are logged and shipped to follower nodes, which apply them asynchronously or synchronously. Followers serve read queries, distributing read load across the cluster.

Synchronous vs Asynchronous Replication

DfSynchronous Replication

Synchronous replication waits for all replicas to confirm a write before acknowledging it to the client. Guarantees strong consistency but increases write latency proportional to the slowest replica.

DfAsynchronous Replication

Asynchronous replication acknowledges writes immediately and propagates changes to replicas in the background. Provides lower latency but allows temporary inconsistency (replication lag).

Property	Synchronous	Asynchronous
Consistency	Strong	Eventual
Write latency	High (waits for all replicas)	Low (single node)
Availability	Lower (all replicas must be up)	Higher
Data loss risk	None (on commit)	Up to replication lag
Use case	Financial transactions	Social media feeds

Most production systems use semi-synchronous replication: the leader waits for at least one follower to confirm before acknowledging. This provides durability without the latency cost of waiting for all replicas.

Replication Lag

lag(t) = t_{leader} - t_{follower}

Here,

$lag(t)$ =Time difference between leader and follower at time t
$t_{leader}$ =Timestamp of latest committed write on leader
$t_{follower}$ =Timestamp of latest applied write on follower

Read-After-Write Consistency

A user writes their profile name then immediately reads it from a different server. If that server is a follower with replication lag, they see the old name.

Solution: Route reads-after-writes to the leader, or wait for replication to catch up (read-your-writes consistency).

Multi-Leader Replication

Multiple nodes accept writes, each replicating to all others.

DfMulti-Leader Replication

Multi-leader (multi-master) replication allows multiple nodes to accept write operations independently. Each leader replicates its changes to all other leaders. This enables writes at multiple data centers and offline operation, but introduces conflict resolution challenges.

When to Use Multi-Leader

Multi-datacenter deployments for low-latency writes
Offline clients that sync when reconnected
Collaborative editing where multiple users edit simultaneously

Multi-leader replication is significantly more complex than single-leader. Conflicts must be detected and resolved. Only use multi-leader when single-leader cannot meet your requirements.

Conflict Resolution

When two leaders update the same key concurrently, a conflict occurs.

DfConflict Resolution Strategies

Last Write Wins (LWW): Timestamp-based, simple but loses data
Merge: Application-defined merge function (e.g., union of sets)
Custom: Application-specific resolution based on business logic

Last Write Wins

winner = \arg\max_{write} timestamp(write)

Here,

$winner$ =The write that is kept
$timestamp(write)$ =Wall-clock or logical timestamp of the write

Avoid LWW when data loss is unacceptable. For example, two users adding items to a shopping cart simultaneously — LWW loses one user's items. Instead, merge the carts (union of items).

Replication Topologies

Topology	Pros	Cons
Linear	Simple, ordered	Slow propagation, single point of failure
Star	Fast propagation from center	Hub is bottleneck
Circular	No central bottleneck	Slow, fragile if one node fails
All-to-All	Fastest propagation	Conflict resolution complex

Log-Based Replication

Most modern systems replicate via an append-only log.

DfReplication Log

A replication log is an append-only sequence of events. The leader appends new entries; followers fetch and apply them in order. This guarantees that followers converge to the same state given the same initial state and log. Examples: WAL (PostgreSQL), binlog (MySQL), oplog (MongoDB).

Log-based replication is the foundation for change data capture (CDC) and event sourcing. Tools like Debezium stream database changes as events, enabling real-time data pipelines.

Practice Exercises

Conceptual: Explain why synchronous replication is rarely used for all replicas in production. What is the alternative?
Design: Design a replication strategy for a global social media app where users write in their local region but must read their own writes immediately. How do you handle replication lag?
Conflict Resolution: Two users edit the same document offline. User A saves at 10:00:01, User B saves at 10:00:02. Both sync at 10:00:05. Design a conflict resolution strategy that preserves both users' intent.
Analysis: Compare linear, star, and all-to-all replication topologies for a 5-datacenter deployment. Which topology minimizes propagation delay? Which is most resilient?

Key Takeaways:

Leader-follower replication is the most common topology; followers replicate the leader's log
Synchronous replication provides strong consistency; asynchronous provides lower latency
Multi-leader replication enables multi-datacenter writes but requires conflict resolution
Log-based replication (WAL, binlog) guarantees ordered application of changes
Replication lag affects consistency guarantees; read-your-writes requires special handling

What to Learn Next

-> Data Partitioning Horizontal partitioning, range vs hash partitioning, and rebalancing.

-> CAP Theorem Consistency models, availability, and partition tolerance.

-> Consistent Hashing Hash rings, virtual nodes, and load distribution.

-> Distributed Consensus Raft, Paxos, and leader election algorithms.

-> Databases SQL vs NoSQL, indexing, replication, and sharding.

-> Event-Driven Architecture Event sourcing, CQRS, and saga patterns.

Data Replication

Data Replication

Leader-Follower Replication

DfLeader-Follower Replication

Synchronous vs Asynchronous Replication

DfSynchronous Replication

DfAsynchronous Replication

Replication Lag

Replication Lag

Read-After-Write Consistency

Multi-Leader Replication

DfMulti-Leader Replication

When to Use Multi-Leader

Conflict Resolution

DfConflict Resolution Strategies

Last Write Wins

Replication Topologies

Log-Based Replication

DfReplication Log

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?