ZooKeeper Series: Voting

ZooKeeper's election process defaults to the FastLeaderElection class, which starts Messenger to send and receive election information when FastLeaderElection starts. After the election is completed, one Leader and several Follower are selected.

First understand several concepts:

Epoch: The voting cycle is used to distinguish each round, and each time a new leader-follower relationship is established, there will be a unique epoch value to identify. It is as if the emperor had to have a year number to distinguish him from the emperors before or after him. Start just by reading saved currentEpoch and acceptedEpoch values from the file. The default is - 1.

Zxid: Transaction ID, which represents the serial number of Zookeeper's current write operation, ensures that the write operation is executed sequentially. One scenario is that when Follower's maximum zxid is greater than Leader's zxid, Leader sends TRUNC packages to Follower to truncate redundant transactions to ensure consistency with Leader data. Zxid is initialized in the loadDataBase method of ZxDataBase. Each time a write transaction zxid is executed, it adds 1, defaulting to 0.

Election Rules: First judge the Epoch of both sides and retain the large Epoch; then look at the zxid of both sides and retain the large zxid side. The final result is that one of the parties (Epoch < 32 | zxid) will be elected Leader.

Messenger is used to exchange election information between clusters. Quorum Cnx Manager is used at the bottom of Messenger to manage the Socket data transfer between the local machine and other machines in the cluster. The communication flow chart is shown below.

All LOOKING machines in the cluster start the election process at the same time. At the end of the election, a Leader and multiple Follower are generated. Then each Follower connects to Leader for transaction initialization and data synchronization.

Observer does not participate in the election, but receives Leader's information synchronization requirements.

QuorumPeer will set the initial state to LOOKING when it starts, then start the FastLeader Election election process, first send itself as a Vote group to other QuorumPeer, and the data structure of the group is Notification. At the same time, start the sending and receiving thread of Messenage and call the lookForLeader method to initiate a round of electing the Leader.

In the Leader/Follower state, Messenger automatically replies to the Leader message if it receives the Notification package sent by the remote LOOKING node looking for the Leader; in the LOOKING state, Messenger replies to the current Vote node (temporary Leader) message to the other party.

The default election rules stipulate that when more than half of QuorumPeers agree with the same Leader, the election process ends and each QuorumPeer sets itself as Leader, Follower or Observer. Of course, interested readers can also try to create their own election rules class, as long as the implementation of Quorum Verifier.

Prerequisites for Fast leader selection:

1. Each QuorumPeer knows the ip addresses of other QuorumPeers and the total number of QuorumPeers.

2. Each QuorumPeer initially initiates a vote and chooses itself as leader. Send vote notification s to all other QuorumPeers and wait for a reply.

3. Processing vote notification messages according to QuorumPeer's status.

The provisional Leader generated during the election process becomes Vote, and the Vote generated in the last round after the election is the new Leader.

QuorumPeer calls the lookForLeader method for each election initiation. First, it sets itself to LOOKING state. This method is the main method of FastLeaderElection class. The specific flow is as follows:

  • First, update the election cycle logicalclock and send yourself as a leader to all other server s as a vote.
  • Then enter the cycle of this round of voting until they are no longer LOOKING.
  1. Get a network packet from recvqueue (the recvqueue data comes from Messenger), and check whether you want to reconnect and resend your vote if you don't receive the packet.
  2. Judge the voting status of the opponent after receiving the vote.
    • If the other party's voting cycle (Epoch) is greater than its own cycle (Epoch), then empty its receipt of the voting set recvset, and compare itself as a candidate with the leader of the other party's voting, select the big one as a new vote, and then send it to everyone. Here, the size of comparison is obtained by comparing the binary (zxid, sid). If the size of zxid is large, otherwise the size of sid is large.
    • If the opponent's voting cycle is less than his own, he will ignore the other party's voting.
    • If the cycle is equal, then compare the other party's votes and their own candidates, select the big ones as new candidates, and then send them to everyone.
    • Then it judges whether the votes received at present can draw the conclusion of who is leader, mainly by judging whether the current candidate leader has the majority in the votes received.
    • If the candidate leader takes up the majority of the votes received, wait for the finalizeWait clock to see if anyone changes the candidate leader, and if so, put the vote in the recvqueue and recycle it.
  2. OBSERVING: If the other party is an observer, because it has no right to vote, ignore it.
  • If the opponent and himself have another clock cycle, indicating that the other party has completed the election, if the other party says that it is the leader, then we will take it as the leader, otherwise we will compare whether the leader elected by the other party has a majority in his own place, and the leader confirmed his willingness to be the leader, if both have passed, we will take it as the leader. Candidates for Leadership
  • If the other party and himself are not in a clock cycle, it means that they hang up and recover. At this time, the votes of others are collected into a separate set of outofelection (which can be seen from the name is not used for election judgment). If the other party's votes are in the majority of the outofelection, and the leader also holds the majority of the votes. Confirm your willingness to be a leader, update your election cycle logicalclock and change your status to FOLLOWING or LEADING


The lookForLeader code is long, so let's first look at its body structure.

public Vote lookForLeader() throws InterruptedException {
     try {
        HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
        HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
       int notTimeout = finalizeWait;
             updateProposal(getInitId(), getInitLastLoggedZxid(), 
        while ((self.getPeerState() == ServerState.LOOKING) && (!stop)){           
            Notification n = recvqueue.poll(notTimeout,
            if (self.getCurrentAndNextConfigVoters().contains(n.sid)) {
               switch (n.state) {
                  case LOOKING:
                       {A Code}
                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                    case FOLLOWING:
                    case LEADING:                    
                        {B Code}
                        LOG.warn("Notification state unrecoginized: " + n.state
                              + " (n.state), " + n.sid + " (n.sid)");
            return null;

Specific analysis is as follows:

First, through the sendNotification method, I tell the cluster that I am looking for a Leader. Other machines in the cluster will receive Notification in the Messenger process, and then they will reply who should be the current Leader.

Fast Leader Election collects enough Notification messages to determine who is the legitimate Reader. To this end, it makes the following judgments for each Notification message:

A. The sender replying to Notification is also in LOOKING status

If the sender responding to Notification is also in LOOKING status, it does not know who the final leader is. At this time, FastLeader Election is compared with the sender to see whose potential leader has the largest Epoch and zxid, and whose largest is the current candidate leader. Then broadcast the candidate leader to the sender. Can change their follow-up Leader. At the same time, we can judge whether our collection of Nootifications can reach more than half of the conditions to determine the final Leader. If we can decide, we can set the final Leader and withdraw from the election process.

case LOOKING:   
   // If notification > current, replace and send messages out
   if (n.electionEpoch > logicalclock.get()) {
       if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                       getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
            updateProposal(n.leader, n.zxid, n.peerEpoch);
       } else {
   } else if (n.electionEpoch < logicalclock.get()) {
   } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,proposedLeader, proposedZxid, proposedEpoch)) {
       updateProposal(n.leader, n.zxid, n.peerEpoch); sendNotifications();
   recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, 
   if (termPredicate(recvset, new Vote(proposedLeader, proposedZxid,
                            logicalclock.get(), proposedEpoch))) {
      while((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null){
         if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,proposedLeader, proposedZxid, proposedEpoch)){
      if (n == null) {
           self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());
           Vote endVote = new Vote(proposedLeader, proposedZxid, proposedEpoch);
        return endVote;

The total OrderPredicate method is used to determine whether the Vote sent by the other party is an updated Reader candidate, and if so, to update the local proposed Leader. The main code is as follows:

protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        return ((newEpoch > curEpoch) ||
                ((newEpoch == curEpoch) &&  ((newZxid > curZxid) ||
 ((newZxid == curZxid) && (newId > curId)))));

termPredicate is used to determine whether the selected Leader condition is satisfied. The implementation of Quorum Verifier interface sets up the Leader and exits the FastLeader Election process if it is satisfied. The main code is as follows:

private boolean termPredicate(HashMap<Long, Vote> votes, Vote vote) {
        SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
        if (self.getLastSeenQuorumVerifier() != null
                && self.getLastSeenQuorumVerifier().getVersion() > self
                        .getQuorumVerifier().getVersion()) {
        for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())) {
        return voteSet.hasAllQuorums();


B. The sender replying to Notification is also Leader or Follower

If the sender replying to Notification is Leader or Follower, the process is relatively simple. The Notification message is saved in recvset and the termPredicate method is called to determine whether the Leader can be determined and the election process can be terminated.

Code snippet:

     if(n.electionEpoch == logicalclock.get()){
         recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, 
         if(termPredicate(recvset, new Vote(n.leader, n.zxid, n.electionEpoch,
n.peerEpoch, n.state))  && checkLeader(outofelection, n.leader, 
n.electionEpoch)) {
            self.setPeerState((n.leader == self.getId()) ?
                                  ServerState.LEADING: learningState());
            Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
            return endVote;

Keywords: Big Data Zookeeper socket network less

Added by dotBz on Mon, 09 Sep 2019 09:16:47 +0300