Theory

Distributed Consensus

For a distributed system to function correctly, nodes must agree on a single value or sequence of values. Consensus algorithms ensure this agreement despite node failures and network partitions.

Agreement — All non-faulty nodes decide the same value
Validity — Decided values were proposed by some node
Termination — Algorithm eventually terminates

Consensus is the fundamental building block for replicated state machines.

The Consensus Problem

Multiple nodes must agree on a single value, even when some nodes fail or messages are delayed.

DfByzantine Fault Tolerance

A Byzantine fault is any fault where a node can behave arbitrarily — sending different values to different nodes, refusing to respond, or actively colluding. A Byzantine Fault Tolerant (BFT) system can reach consensus even if up to f out of 3f+1 nodes are Byzantine faulty.

DfCrash Fault Tolerance

A crash fault is when a node stops responding entirely. A Crash Fault Tolerant (CFT) system can reach consensus if a majority of nodes are alive. Raft and Paxos are crash fault tolerant.

Most distributed systems use crash fault tolerance (CFT) because Byzantine faults are rare in controlled environments. BFT is used in blockchain and adversarial settings where nodes cannot be trusted.

FLP Impossibility

DfFLP Impossibility

The FLP Impossibility result (Fischer, Lynch, Paterson, 1985) proves that in an asynchronous system where even one node can crash, no deterministic consensus protocol can guarantee termination. Practical algorithms circumvent this by using randomization or partial synchrony assumptions.

Paxos

The foundational consensus algorithm, introduced by Leslie Lamport.

DfPaxos

Paxos is a family of consensus protocols where nodes take roles (proposer, acceptor, learner) to agree on a value. The basic version (Single-Decree Paxos) agrees on a single value. Multi-Paxos agrees on a sequence of values by electing a stable leader. Paxos is provably correct but notoriously difficult to implement.

Paxos Phases

Majority Quorum

quorum = \lfloor \frac{n}{2} \rfloor + 1

Here,

$n$ =Total number of nodes
$quorum$ =Minimum nodes needed for agreement

Paxos Quorum Calculation

With 5 nodes, quorum = floor(5/2) + 1 = 3. Any two quorums overlap by at least 1 node, ensuring that once a value is accepted by a quorum, any subsequent quorum will see it.

Raft

A consensus algorithm designed for understandability.

DfRaft

Raft achieves consensus by electing a leader that manages replication. The leader receives client requests, appends them to its log, and replicates entries to followers. Once a majority confirms an entry, it is committed. Raft decomposes consensus into leader election, log replication, and safety.

Raft Leader Election

Raft Properties

Property	Description
Election Safety	At most one leader per term
Leader Append-Only	Leader never overwrites/deletes entries
Log Matching	If two logs contain entry with same index and term, all prior entries are identical
Leader Completeness	If entry committed in term T, present in leader's log for all higher terms
State Machine Safety	If server applied entry at index i, no other server applies different entry at i

Raft vs Paxos

Aspect	Raft	Paxos
Design goal	Understandability	Theoretical optimality
Leader	Strong leader required	Can work without stable leader
Log structure	Strict ordering	Flexible
Complexity	Moderate	High
Implementations	etcd, CockroachDB	Google Chubby, Spanner

ZAB (ZooKeeper Atomic Broadcast)

Used by Apache ZooKeeper for coordination.

DfZAB Protocol

ZAB is a crash fault tolerant consensus protocol used by ZooKeeper. It uses a leader-based approach where the leader processes write requests and broadcasts changes via a two-phase commit. ZAB provides a total ordering of operations, enabling consistent distributed coordination.

ZAB differs from Raft in that it focuses on atomic broadcast (total order of operations) rather than log replication. ZooKeeper uses ZAB to provide primitives like distributed locks, leader election, and configuration management.

Leader Election

A critical subproblem in distributed systems.

DfLeader Election

Leader election is the process of selecting one node from a group to act as coordinator. A leader manages operations, resolves conflicts, and provides a single point of truth. If the leader fails, a new election is triggered. Common algorithms include Bully, Raft, and ZooKeeper-based election.

Election Timeout

timeout = random(150ms, 300ms)

Here,

$timeout$ =Random election timeout to avoid split votes

Practice Exercises

Conceptual: Explain why consensus requires a majority quorum. What happens with a 3-node cluster during a network partition?
Raft: Trace through a Raft leader election with 5 nodes. Node 3's election timer fires first. Walk through the message exchanges.
Comparison: Compare Raft and Paxos for a distributed key-value store. When would you choose one over the other?
FLP Impossibility: The FLP result says deterministic consensus is impossible in asynchronous systems with crash failures. How do practical algorithms like Raft circumvent this?

Key Takeaways:

Distributed consensus ensures all nodes agree on a single value despite failures
Paxos is the foundational algorithm; Raft prioritizes understandability
Leader election centralizes coordination, simplifying replication
FLP Impossibility proves deterministic consensus cannot guarantee termination in async systems
Practical algorithms use randomization or partial synchrony assumptions
Used by etcd, CockroachDB, ZooKeeper, and distributed databases

What to Learn Next

-> CAP Theorem Consistency models, availability, and partition tolerance.

-> Data Replication Leader-follower, multi-leader, and conflict resolution.

-> Consistent Hashing Hash rings, virtual nodes, and load distribution.

-> Databases SQL vs NoSQL, indexing, replication, and sharding.

-> Service Mesh Envoy, Istio, and sidecar proxy patterns.

-> Event-Driven Architecture Event sourcing, CQRS, and saga patterns.

Distributed Consensus

Distributed Consensus

The Consensus Problem

DfByzantine Fault Tolerance

DfCrash Fault Tolerance

FLP Impossibility

DfFLP Impossibility

Paxos

DfPaxos

Paxos Phases

Majority Quorum

Paxos Quorum Calculation

Raft

DfRaft

Raft Leader Election

Raft Properties

Raft vs Paxos

ZAB (ZooKeeper Atomic Broadcast)

DfZAB Protocol

Leader Election

DfLeader Election

Election Timeout

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?