Paxos Algorithm was the work by Leslie Lamport. It is the foundation for solving many fundamental problem in distributed system. Famous systems that make use of the algorithm included but not limited to Google Chubby, ZooKeeper. Recently he was awarded the Turing Award 2013 which is a great honour to him & well deserved, this makes a good opportunity for me to go over the Paxos Algorithms again, and explain it a bit here for myself to keep my memory fresh.
What paxos algorithm trying to address is the consensus problem. There’re a few consensus algorithms out there addressing the issue, namely 2-Phase-Commit (2PC) & 3-Phase-Commit (3PC). Though, each of those has its own weakness / scenario that it couldn’t handle, 2PC blocks/stucks when there’s node failure; 3PC upon network partition, will produce inconsistency. That’s where Paxos Algorithm comes into play.
Paxos Algorithm’s behaviour is known to be difficult to grasp, yet the algorithm itself is simple to understand.
The basis for a consensus algorithms are:
- Only a value that has been proposed may be chosen
- Only a single value is chosen
- A process never learns that a value has been chosen unless it actually has been
Paxos suggests the following three classes of nodes to perform the corresponding functions:
- Proposers - To propose a value
- Acceptors - Choose which proposal to accept from proposers
- Learners - Decide which value to be chosen based on acceptors’ acceptance
Choosing a value
With proposers proposing values, acceptor need to decide which value to accept, and initially, acceptor might not have sufficient knowledge or reference to decide whether a value given by a proposer should be rejected or not. So, to start and make progress, the first invariant would be defined as follows:
P1. Acceptor accept first proposal it receives
When acceptors have the knowledge to determine if a proposed value should be accepted or not, quorum is used as the strategy – where majority wins. However, this raised a problem, what if there’s multiple proposers proposing multiple values at about the same time, while one of the acceptor failed and each of the proposed values got hold of the same accept count, failing to learn which value was chosen. In ordering to allow restarting, each of the proposal now are tagged with a unique sequence number which is monotonically increasing, e.g. Values Pairs: (local clock, local id), together with the second invariant to reach agreement:
P2. If v is chosen, every higher proposal chosen has value v
Since, a value to be chosen, a proposal must be accepted by at least one acceptor. So, P2 can be satisfied by satifying:
P2a. If v is chosen, every higher proposal accepted has value v
Here, P2a requires the algorithms to control the future, where every higher-numbered proposal should have value v if it’s chosen. This is impossible in certain scenario e.g. “An acceptor recovered from failure without accepting any value before; while at the same time, a new proposer arose & proposed a new value (the recovered acceptor will accept it due to P1, violating P2a.” With respect to that:
P2b. If v is chosen, every higher proposal issued has value v
Since a proposal must be proposed before it can be accepted by acceptor, P2b implies P2a and thus P2 (P2b => P2a, P2a => P2, P2b => P2). So proposer have to be able to issue proposal that either its value is the highest-numbered proposal majority accepted or itself is the only proposal acceptors ever accepted. These forms P2c:
P2c. If any proposal (n, v) is issued, there is a set S of a majority of acceptors such that either
- (a) no one in S has accepted any proposal numbered less than n
- (b) v is the value of the highest proposal among all proposals less than n accepted by acceptors in S
To achieve P2c, a mechanism with two phases: Prepare Phase & Accept Phase was introduced:
Send requests numbered n to some set of acceptors, extracting a promise from acceptors where it won’t accept proposal less than n; In case it has accepted any value before, return value of the highest numbered proposal
After majority of the acceptors responded in the prepare phase, proposer can then issue a accept request numbered n and value v, (n, v), where v is the value of the highest numbered proposal among the responses, or is any value from the proposer if responses are of empty value.
The two phases help making sure issued proposals contains proper value (arbitary or (if any) value from highest numbered proposals accepted), thereby acceptors accepting correct values.
One important thing to note though, the entire system is asynchonous which means eventual termination won’t be guaranteed - (Imagine a scenario where, Proposer 1 issued a prepare request P1 where it was responsed positively by majority of the acceptors. Right before Proposer 1 could issue an accept request, Proposer 2 issued a prepare request P2 (P2 > P1) where it was completed with positive responses from majority of the acceptors, causing accept phase of Proposer 1 that follows to fail. If this goes on forever, there will be no value chosen afterall). To resolve this, there’s the need to have eventual leader election which ensures liveness. Say for each round of censensus, a distinguished proposer would be elected such that it’s the one which should issue a proposal or its proposal have a higher priority over the others. Evenutally the system would act like there’s only one proposer and thus termination.
Learning a chosen value
When acceptors accept a value, it can then inform the learners about the chosen value. Learners could then find out which value / proposal has been accepted by a majority of acceptors, and decide if that’s the chosen value. And in case of message loss or node failure, learners could also ask acceptors what value have been accepted, or to go further, it can ask whether a value has been accepted by having a proposer issue a proposal.