Theory
Distributed Consensus
For a distributed system to function correctly, nodes must agree on a single value or sequence of values. Consensus algorithms ensure this agreement despite node failures and network partitions.
- Agreement β All non-faulty nodes decide the same value
- Validity β Decided values were proposed by some node
- Termination β Algorithm eventually terminates
Consensus is the fundamental building block for replicated state machines.
The Consensus Problem
Multiple nodes must agree on a single value, even when some nodes fail or messages are delayed.
DfByzantine Fault Tolerance
A Byzantine fault is any fault where a node can behave arbitrarily β sending different values to different nodes, refusing to respond, or actively colluding. A Byzantine Fault Tolerant (BFT) system can reach consensus even if up to f out of 3f+1 nodes are Byzantine faulty.
DfCrash Fault Tolerance
A crash fault is when a node stops responding entirely. A Crash Fault Tolerant (CFT) system can reach consensus if a majority of nodes are alive. Raft and Paxos are crash fault tolerant.
Most distributed systems use crash fault tolerance (CFT) because Byzantine faults are rare in controlled environments. BFT is used in blockchain and adversarial settings where nodes cannot be trusted.
FLP Impossibility
DfFLP Impossibility
The FLP Impossibility result (Fischer, Lynch, Paterson, 1985) proves that in an asynchronous system where even one node can crash, no deterministic consensus protocol can guarantee termination. Practical algorithms circumvent this by using randomization or partial synchrony assumptions.
Paxos
The foundational consensus algorithm, introduced by Leslie Lamport.
DfPaxos
Paxos is a family of consensus protocols where nodes take roles (proposer, acceptor, learner) to agree on a value. The basic version (Single-Decree Paxos) agrees on a single value. Multi-Paxos agrees on a sequence of values by electing a stable leader. Paxos is provably correct but notoriously difficult to implement.
Paxos Phases
Majority Quorum
Here,
- =Total number of nodes
- =Minimum nodes needed for agreement
Paxos Quorum Calculation
With 5 nodes, quorum = floor(5/2) + 1 = 3. Any two quorums overlap by at least 1 node, ensuring that once a value is accepted by a quorum, any subsequent quorum will see it.
Raft
A consensus algorithm designed for understandability.
DfRaft
Raft achieves consensus by electing a leader that manages replication. The leader receives client requests, appends them to its log, and replicates entries to followers. Once a majority confirms an entry, it is committed. Raft decomposes consensus into leader election, log replication, and safety.
Raft Leader Election
Raft Properties
| Property | Description |
|---|---|
| Election Safety | At most one leader per term |
| Leader Append-Only | Leader never overwrites/deletes entries |
| Log Matching | If two logs contain entry with same index and term, all prior entries are identical |
| Leader Completeness | If entry committed in term T, present in leader's log for all higher terms |
| State Machine Safety | If server applied entry at index i, no other server applies different entry at i |
Raft vs Paxos
| Aspect | Raft | Paxos |
|---|---|---|
| Design goal | Understandability | Theoretical optimality |
| Leader | Strong leader required | Can work without stable leader |
| Log structure | Strict ordering | Flexible |
| Complexity | Moderate | High |
| Implementations | etcd, CockroachDB | Google Chubby, Spanner |
ZAB (ZooKeeper Atomic Broadcast)
Used by Apache ZooKeeper for coordination.
DfZAB Protocol
ZAB is a crash fault tolerant consensus protocol used by ZooKeeper. It uses a leader-based approach where the leader processes write requests and broadcasts changes via a two-phase commit. ZAB provides a total ordering of operations, enabling consistent distributed coordination.
ZAB differs from Raft in that it focuses on atomic broadcast (total order of operations) rather than log replication. ZooKeeper uses ZAB to provide primitives like distributed locks, leader election, and configuration management.
Leader Election
A critical subproblem in distributed systems.
DfLeader Election
Leader election is the process of selecting one node from a group to act as coordinator. A leader manages operations, resolves conflicts, and provides a single point of truth. If the leader fails, a new election is triggered. Common algorithms include Bully, Raft, and ZooKeeper-based election.
Election Timeout
Here,
- =Random election timeout to avoid split votes
Practice Exercises
-
Conceptual: Explain why consensus requires a majority quorum. What happens with a 3-node cluster during a network partition?
-
Raft: Trace through a Raft leader election with 5 nodes. Node 3's election timer fires first. Walk through the message exchanges.
-
Comparison: Compare Raft and Paxos for a distributed key-value store. When would you choose one over the other?
-
FLP Impossibility: The FLP result says deterministic consensus is impossible in asynchronous systems with crash failures. How do practical algorithms like Raft circumvent this?
Key Takeaways:
- Distributed consensus ensures all nodes agree on a single value despite failures
- Paxos is the foundational algorithm; Raft prioritizes understandability
- Leader election centralizes coordination, simplifying replication
- FLP Impossibility proves deterministic consensus cannot guarantee termination in async systems
- Practical algorithms use randomization or partial synchrony assumptions
- Used by etcd, CockroachDB, ZooKeeper, and distributed databases
What to Learn Next
-> CAP Theorem Consistency models, availability, and partition tolerance.
-> Data Replication Leader-follower, multi-leader, and conflict resolution.
-> Consistent Hashing Hash rings, virtual nodes, and load distribution.
-> Databases SQL vs NoSQL, indexing, replication, and sharding.
-> Service Mesh Envoy, Istio, and sidecar proxy patterns.
-> Event-Driven Architecture Event sourcing, CQRS, and saga patterns.