In the last article we analyzed in detail Leader Selector for Multiple Copies of Source Analysis RocketMQ DLedger This article will analyze the implementation of log replication in detail.
According to raft protocol, when the entire cluster finishes Leader selection, the primary node in the cluster can accept requests from the client, while the secondary node in the cluster is only responsible for synchronizing data from the primary node, and will not process read and write requests, which is very different from the read and write separation of M-S structure.
On the basis of the previous article, this article will start directly with the Leader processing client request entry, which starts with the handleAppend method of DLedgerServer.
1. Basic process of log replication
Before formally analyzing RocketMQ DLedger's multiple copy replication, let's first look at the Request Protocol fields for client sending logs, whose class diagrams are as follows:
Let's start by explaining what each field means:
- String group
The name of the group to which the cluster belongs. - String remoteId
Request destination node ID. - String localId
Node ID. - int code
Request response field, indicating the return response code. - String leaderId = null
Leader Id in cluster. - long term
Current election rounds for the cluster. - byte[] body
Data to be sent.
The handleAppend method with a DLedgerServer entry for request processing of the log.
DLedgerServer#handleAppend
PreConditions.check(memberState.getSelfId().equals(request.getRemoteId()), DLedgerResponseCode.UNKNOWN_MEMBER, "%s != %s", request.getRemoteId(), memberState.getSelfId()); reConditions.check(memberState.getGroup().equals(request.getGroup()), DLedgerResponseCode.UNKNOWN_GROUP, "%s != %s", request.getGroup(), memberState.getGroup()); PreConditions.check(memberState.isLeader(), DLedgerResponseCode.NOT_LEADER);
Step1: First verify the validity of the request:
- If the requested node ID is not the current processing node, an exception is thrown.
- If the requested cluster is not the cluster of the current node, an exception is thrown.
- If the current node is not the primary node, an exception is thrown.
DLedgerServer#handleAppend
long currTerm = memberState.currTerm(); if (dLedgerEntryPusher.isPendingFull(currTerm)) { // @1 AppendEntryResponse appendEntryResponse = new AppendEntryResponse(); appendEntryResponse.setGroup(memberState.getGroup()); appendEntryResponse.setCode(DLedgerResponseCode.LEADER_PENDING_FULL.getCode()); appendEntryResponse.setTerm(currTerm); appendEntryResponse.setLeaderId(memberState.getSelfId()); return AppendFuture.newCompletedFuture(-1, appendEntryResponse); } else { // @2 DLedgerEntry dLedgerEntry = new DLedgerEntry(); dLedgerEntry.setBody(request.getBody()); DLedgerEntry resEntry = dLedgerStore.appendAsLeader(dLedgerEntry); return dLedgerEntryPusher.waitAck(resEntry); }
Step2: If the preprocessing queue is full, reject the client request and return the LEADER_PENDING_FULL error code; if not, encapsulate the request as a DledgerEntry, call the dLedgerStore method to append the log and synchronize waiting for replication response from the replica node by using the waitAck method of dLedgerEntryPusher And eventually returns the result to the calling method.
- Code @1: If the push queue for dLedgerEntryPusher is full, an append is returned with the error code LEADER_PENDING_FULL.
- Code @2: Appends a message to the Leader server and broadcasts it to a slave node, which is considered a failure if no confirmation from the slave node is received within a specified time.
Next, follow these three main points:
- Determine if the Push queue is full
- Leader node stores messages
- Primary node waits for ACK replication from secondary node
1.1 How to determine if the Push queue is full
DLedgerEntryPusher#isPendingFull
public boolean isPendingFull(long currTerm) { checkTermForPendingMap(currTerm, "isPendingFull"); // @1 return pendingAppendResponsesByTerm.get(currTerm).size() > dLedgerConfig.getMaxPendingRequestsNum(); // @2 }
There are two main steps:
Code @1: Check to see if the current polling rounds are in PendingMap, and if not, initialize with a structure of Map < Long/Rounds/, ConcurrentMap>.
Code @2: Check to see if the number of currently waiting results from a node exceeds its maximum number of requests, via maxPendingRequests
Num configuration, which defaults to: 10000.
The above logic is simple, but where does the data in ConcurrentMap come from?We might as well look down.
1.2 Leader node stores data
The data store of the Leader node is mainly implemented by the appendAsLeader method of the DLedgerStore.DLedger implements memory-based and file-based storage implementations respectively. This paper focuses on file-based storage implementations, which are implemented as DLedgerMmapFileStore.
The following focuses on the analysis of the data storage process, whose entry is the appendAsLeader method of the DLedgerMmapFileStore.
DLedgerMmapFileStore#appendAsLeader
PreConditions.check(memberState.isLeader(), DLedgerResponseCode.NOT_LEADER); PreConditions.check(!isDiskFull, DLedgerResponseCode.DISK_FULL);
Step1: First of all, we can decide whether the data can be appended or not, based on the following two main points:
- Whether the state of the current node is a Leader or not, an exception is thrown.
- Whether the current disk is full or not depends on the fact that the root directory or data file directory of the DLedger is being used more than the maximum allowed, with a default value of 85%.
ByteBuffer dataBuffer = localEntryBuffer.get(); ByteBuffer indexBuffer = localIndexBuffer.get();
Step2: Get a data and index buffer from the local thread variable.The ByteBuffer used to store data has a fixed capacity of 4M, and the ByteBuffer of the index is the length of two index entries, fixed to 64 bytes.
DLedgerEntryCoder.encode(entry, dataBuffer); public static void encode(DLedgerEntry entry, ByteBuffer byteBuffer) { byteBuffer.clear(); int size = entry.computSizeInBytes(); //always put magic on the first position byteBuffer.putInt(entry.getMagic()); byteBuffer.putInt(size); byteBuffer.putLong(entry.getIndex()); byteBuffer.putLong(entry.getTerm()); byteBuffer.putLong(entry.getPos()); byteBuffer.putInt(entry.getChannel()); byteBuffer.putInt(entry.getChainCrc()); byteBuffer.putInt(entry.getBodyCrc()); byteBuffer.putInt(entry.getBody().length); byteBuffer.put(entry.getBody()); byteBuffer.flip(); }
Step3: Write DLedgerEntry to ByteBuffer, where you can see that each write calls ByteBuffer's clear method and empties the data. From this, you can see that only 4M of data can be stored for each data append.
DLedgerMmapFileStore#appendAsLeader
synchronized (memberState) { PreConditions.check(memberState.isLeader(), DLedgerResponseCode.NOT_LEADER, null); // ...omit code }
Step4: Lock the state machine and check again if the state of the node is a Leader node.
DLedgerMmapFileStore#appendAsLeader
long nextIndex = ledgerEndIndex + 1; entry.setIndex(nextIndex); entry.setTerm(memberState.currTerm()); entry.setMagic(CURRENT_MAGIC); DLedgerEntryCoder.setIndexTerm(dataBuffer, nextIndex, memberState.currTerm(), CURRENT_MAGIC);
Step5: Set the number of entryIndex and entryTerm (voting rounds) for the current log entry.The magic number, entryIndex, entryTerm and so on are written to bytebuffer.
DLedgerMmapFileStore#appendAsLeader
long prePos = dataFileList.preAppend(dataBuffer.remaining()); entry.setPos(prePos); PreConditions.check(prePos != -1, DLedgerResponseCode.DISK_ERROR, null); DLedgerEntryCoder.setPos(dataBuffer, prePos);
Step6: Calculate the starting offset of a new message, follow up on the preAppend implementation of the dataFileList, and write the offset to the bytebuffer in the log.
DLedgerMmapFileStore#appendAsLeader
for (AppendHook writeHook : appendHooks) { writeHook.doHook(entry, dataBuffer.slice(), DLedgerEntry.BODY_OFFSET); }
Step7: Execute the hook function.
DLedgerMmapFileStore#appendAsLeader
long dataPos = dataFileList.append(dataBuffer.array(), 0, dataBuffer.remaining()); PreConditions.check(dataPos != -1, DLedgerResponseCode.DISK_ERROR, null); PreConditions.check(dataPos == prePos, DLedgerResponseCode.DISK_ERROR, null);
Step8: Append data to pagecache.This method will be described in more detail later.
DLedgerMmapFileStore#appendAsLeader
DLedgerEntryCoder.encodeIndex(dataPos, entrySize, CURRENT_MAGIC, nextIndex, memberState.currTerm(), indexBuffer); long indexPos = indexFileList.append(indexBuffer.array(), 0, indexBuffer.remaining(), false); PreConditions.check(indexPos == entry.getIndex() * INDEX_UNIT_SIZE, DLedgerResponseCode.DISK_ERROR, null);
Step9: Build an entry index and append the index data to the pagecache.
DLedgerMmapFileStore#appendAsLeader
ledgerEndIndex++; ledgerEndTerm = memberState.currTerm(); if (ledgerBeginIndex == -1) { ledgerBeginIndex = ledgerEndIndex; } updateLedgerEndIndexAndTerm();
Step10:ledgerEndeIndex plus one (next entry) serial number.The ledgerEndIndex and ledgerEndTerm of the state machine of the leader node are set.
Leader node data appending is described here and will focus on the implementation details of storage-related methods later.
1.3 Primary node waiting for ACK replication from secondary node
Its implementation entry is the waitAck method of dLedgerEntryPusher.
DLedgerEntryPusher#waitAck
public CompletableFuture<AppendEntryResponse> waitAck(DLedgerEntry entry) { updatePeerWaterMark(entry.getTerm(), memberState.getSelfId(), entry.getIndex()); // @1 if (memberState.getPeerMap().size() == 1) { // @2 AppendEntryResponse response = new AppendEntryResponse(); response.setGroup(memberState.getGroup()); response.setLeaderId(memberState.getSelfId()); response.setIndex(entry.getIndex()); response.setTerm(entry.getTerm()); response.setPos(entry.getPos()); return AppendFuture.newCompletedFuture(entry.getPos(), response); } else { checkTermForPendingMap(entry.getTerm(), "waitAck"); AppendFuture<AppendEntryResponse> future = new AppendFuture<>(dLedgerConfig.getMaxWaitAckTimeMs()); // @3 future.setPos(entry.getPos()); CompletableFuture<AppendEntryResponse> old = pendingAppendResponsesByTerm.get(entry.getTerm()).put(entry.getIndex(), future); // @4 if (old != null) { logger.warn("[MONITOR] get old wait at index={}", entry.getIndex()); } wakeUpDispatchers(); // @5 return future; } }
Code @1: Update the push level of the current node.
Code @2: If the number of nodes in the cluster is 1, no forwarding is required and a successful result is returned directly.
Code @3: Build an append to respond to Future and set the timeout, defaulting to 2500 ms, which can be changed by the maxWaitAckTimeMs configuration.
Code @4: Put the built Future in the waiting result set.
Code @5: Wake up the Entry forwarding thread, which push es data from the primary node to the slave nodes.
Next, the key points are explained separately.
1.3.1 updatePeerWaterMark method
DLedgerEntryPusher#updatePeerWaterMark
private void updatePeerWaterMark(long term, String peerId, long index) { // Code@1 synchronized (peerWaterMarksByTerm) { checkTermForWaterMark(term, "updatePeerWaterMark"); // Code@2 if (peerWaterMarksByTerm.get(term).get(peerId) < index) { // Code@3 peerWaterMarksByTerm.get(term).put(peerId, index); } } }
Code @1: Start with a brief description of the two parameters of this method:
- long term
Current voting rounds. - String peerId
ID of the current node. - long index
The sequence number of the currently appended data.
Code @2: Initialize the peerWaterMarksByTerm data structure, resulting in < Long / term */, Map < String / peerId */, Long /** entry index */>.
Code @3: Update if the index stored by peerWaterMarksByTerm is less than the index of the current data.
1.3.2 wakeUpDispatchers Details
DLedgerEntryPusher#updatePeerWaterMark
public void wakeUpDispatchers() { for (EntryDispatcher dispatcher : dispatcherMap.values()) { dispatcher.wakeup(); } }
This method mainly traverses the transponder and wakes it up.The key to this approach is EntryDispatcher, so let's take a look at the initialization of this collection before going into detail.
DLedgerEntryPusher construction method
for (String peer : memberState.getPeerMap().keySet()) { if (!peer.equals(memberState.getSelfId())) { dispatcherMap.put(peer, new EntryDispatcher(peer, logger)); } }
An EntryDispatcher object was originally created for each slave node when the DLedgerEntryPusher was built.
Obviously, log replication is done by DLedgerEntryPusher.For space reasons, this section will continue in the next article.
The storage-related implementations were not analyzed in detail when Leader appended the logs above. For the sake of completing the knowledge system, let's analyze the core implementations.
2. Log Storage Implementation Details
This section focuses on the preAppend and append methods of the MmapFileList.
The design of the storage section can be found in my blog: Source Code Analysis RocketMQ DLedger Multi-copy Storage Implementation , MmapFileList pairs MapdFileQueue labeled RocketMQ.
Detailed preAppend for 2.1 MmapFileList
This method eventually calls the preAppend method with two parameters, so let's look directly at the preAppend method with two parameters.
MmapFileList#preAppend
public long preAppend(int len, boolean useBlank) { // @1 MmapFile mappedFile = getLastMappedFile(); // @2 start if (null == mappedFile || mappedFile.isFull()) { mappedFile = getLastMappedFile(0); } if (null == mappedFile) { logger.error("Create mapped file for {}", storePath); return -1; } // @2 end int blank = useBlank ? MIN_BLANK_LEN : 0; if (len + blank > mappedFile.getFileSize() - mappedFile.getWrotePosition()) { // @3 if (blank < MIN_BLANK_LEN) { logger.error("Blank {} should ge {}", blank, MIN_BLANK_LEN); return -1; } else { ByteBuffer byteBuffer = ByteBuffer.allocate(mappedFile.getFileSize() - mappedFile.getWrotePosition()); // @4 byteBuffer.putInt(BLANK_MAGIC_CODE); // @5 byteBuffer.putInt(mappedFile.getFileSize() - mappedFile.getWrotePosition()); // @6 if (mappedFile.appendMessage(byteBuffer.array())) { // @7 //need to set the wrote position mappedFile.setWrotePosition(mappedFile.getFileSize()); } else { logger.error("Append blank error for {}", storePath); return -1; } mappedFile = getLastMappedFile(0); if (null == mappedFile) { logger.error("Create mapped file for {}", storePath); return -1; } } } return mappedFile.getFileFromOffset() + mappedFile.getWrotePosition();// @8 }
Code @1: First introduce the meaning of its parameters:
- int len requires the length of the request.
- Whether boolean useBlank needs to be populated or not, defaults to true.
Code @2: Get the last file, which is the file you are currently writing.
Code @3: Logic to process if the requested resource exceeds the writable section of the current file.Code @4-@7 is all its processing logic.
Code @4: Request a bytebuffer of the size of the bytes remaining in the current file.
Code @5: Write magic number first.
Code @6: Write byte length equal to the total remaining size of the current file.
Code @7: Write empty bytes, code @4-@7 is meant to write an empty Entry filled with magic numbers and size for easy parsing.
Code @8: If the current file is large enough to hold the log to be written, return its physical offset directly.
From the above code interpretation, it is easy to see that this method returns the starting physical offset to be written to the log.
Appnd Details for 2.2 MmapFileList
The append method for the four parameters is invoked with the following code:
MmapFileList#append
public long append(byte[] data, int pos, int len, boolean useBlank) { // @1 if (preAppend(len, useBlank) == -1) { return -1; } MmapFile mappedFile = getLastMappedFile(); // @2 long currPosition = mappedFile.getFileFromOffset() + mappedFile.getWrotePosition(); // @3 if (!mappedFile.appendMessage(data, pos, len)) { // @4 logger.error("Append error for {}", storePath); return -1; } return currPosition; }
Code @1: Let's start with the parameters:
- byte[] data
The data to be written, which is the log to be appended. - int pos
From where in the data byte array to start reading. - int len
The number of bytes to be written. - boolean useBlank
Whether to use padding or not, defaults to true.
Code @2: Get the last file, which is the current writable file.
Code @3: Get the current write pointer.
Code @4: Append message.
Finally, let's look at appendMessage, the specific message append implementation logic.
DefaultMmapFile#appendMessage
public boolean appendMessage(final byte[] data, final int offset, final int length) { int currentPos = this.wrotePosition.get(); if ((currentPos + length) <= this.fileSize) { ByteBuffer byteBuffer = this.mappedByteBuffer.slice(); // @1 byteBuffer.position(currentPos); byteBuffer.put(data, offset, length); this.wrotePosition.addAndGet(length); return true; } return false; }
This method I would like to highlight that the way to write is mappedByteBuffer, which is created by the map method of FileChannel, which we often refer to as PageCache, where message appending is first written to the pageCache.
This article details the first two steps the Leader node takes to process client message append requests, that is, to determine if the Push queue is full and the Leader node stores messages.Considering the length of the page, data synchronization for each node will be detailed in the next article.
Before moving on to the next article, let's think about the following questions:
- If the primary node is appended successfully (written to PageCache), but the synchronization process from the primary node fails or the primary node is down, how does the data in the cluster ensure consistency?
Recommended reading: Source analysis in the RocketMQ DLedger series series.
1,RocketMQ Multi-Copy Preamble: A Preliminary Exploration of raft Protocol
2,Leader Selector for Multiple Copies of Source Analysis RocketMQ DLedger
3,Source Code Analysis RocketMQ DLedger Multi-copy Storage Implementation
The original was published from 2019-09-15
Author: Dingwei, author of Inside RocketMQ Technology.
This article is from Middleware Circle of Interest , learn more about it and pay attention to it Middleware Circle of Interest.