X. namenode working mechanism of HDFS

[TOC]

I. fsimage and edit files

1. Basic concepts

txid:
namenode gives a unique id for each operation event (add, delete, and change operation), which is called txid. Generally, txid is automatically increased from 0. For each additional operation, txid is automatically increased by 1.

fsimage:
It is a mirror file of the metadata of namenode in memory on the local disk, but generally fsimage does not contain new operation events, so it is essentially different from the metadata in memory. What is recorded here is not the operation log, which contains the serialization information of all directories and file idnode s of the HDFS file system. The general naming method is in the form of fsimage. The following txid represents the txid of the latest operation event recorded by the fsimage.

edits:
It is an operation log file that records the addition, deletion and modification of namenode. If mysql is known, it is similar to the binary log of mysql.

2. Directory structure of namenode

[root@bigdata121 tmp]# tree dfs/name
dfs/name
├── current
│   ├── edits_0000000000000000001-0000000000000000002
│   ├── edits_0000000000000000003-0000000000000000004
│   ├── edits_0000000000000000005-0000000000000000006
│   ├── edits_0000000000000000007-0000000000000000008
│   ├── edits_0000000000000000009-0000000000000000009
│   ├── edits_0000000000000000010-0000000000000000011
│   ├── edits_0000000000000000012-0000000000000000013
│   ├── edits_0000000000000000014-0000000000000000015
│   ├── edits_0000000000000000016-0000000000000000017
│   ├── edits_0000000000000000018-0000000000000000019
│   ├── edits_0000000000000000020-0000000000000000021
│   ├── edits_0000000000000000022-0000000000000000024
│   ├── edits_0000000000000000025-0000000000000000026
│   ├── edits_inprogress_0000000000000000027
│   ├── fsimage_0000000000000000024
│   ├── fsimage_0000000000000000024.md5
│   ├── fsimage_0000000000000000026
│   ├── fsimage_0000000000000000026.md5
│   ├── seen_txid
│   └── VERSION
└── in_use.lock

In summary, it is simplified into the following structure:

dfs/name
├── current
│   ├── edits_txid1-txid2        There may be multiple old ones that have been generated by rolling edits file
│   ├── edits_inprogress_txid3   Is currently in use edits
│   ├── fsimage_0000000000000000024       fsimage file
│   ├── fsimage_0000000000000000024.md5   fsimage Documentation md5 Check value
│   ├── seen_txid                 Record the latest txid
│   └── VERSION                    Record hdfs Some simple information of cluster
└── in_use.lock                    Lock the file to prevent the directory from being used to start multiple namenode

(1) content of VERSION file

# There will be multiple namenodes in hdfs. Different namenodes have different namenode IDs. They manage a group of blockpoolids respectively.
namespaceID=983105879
# Cluster ID, globally unique
clusterID=CID-c12b7022-0c51-49c5-942f-edc889d37fee
# Mark when the storage directory for the namenode was created. For the newly created storage system, this property is 0. However, after the file system upgrade, the value will be updated to the new timestamp.
cTime=1558262787574           
# Mark whether the storage directory is namenode or datanode
storageType=NAME_NODE              
# A block pool id identifies a block pool and is globally unique across clusters. When a new Namespace is created (part of the format process), a unique ID is created and persisted. Building a globally unique BlockPoolID during the creation process is more reliable than human configuration. NN will persist the BlockPoolID to disk, and it will be load ed and used again in the subsequent startup process.
blockpoolID=BP-473222668-192.168.50.121-1558262787574 
# This is useless
layoutVersion=-63 

(2)seen_txid
This file records the latest txid

(3) the directory structure of SNN is the same as that of namenode, but some of the latest edit files are missing.

3. The relationship between fsimage and edit file naming method

We can see that the file names of fsimage and edit files above follow a long string of numbers. What's that? It's actually txid. From the naming methods of the two, we can see some rules.

Edit file:
We can see that all the edit files are named by the way of edit XXX 000000xxx, which means that the scope of txid operation events is recorded in the edit file. Edit ﹣ inprogges ﹣ indicates the latest txid event recorded currently, and the file is the currently in use edit file.

fsimage file:
Named in the way of fsimage ﹐ 000000xxx, it means the latest txid event recorded by the fsimage file. Please note that the edits file will not be merged into fsimage until fsimage is triggered conditionally, otherwise it will not be merged. So in general, the txid after the edit file is larger than the fsimage.

4. View the contents of fsimage file

//Format: hdfs oiv -p output format - i input file - o output file
[root@bigdata121 current]# hdfs oiv -p XML -i fsimage_0000000000000000037 -o /tmp/fsimage37.xml

As mentioned earlier, fsimage records mainly metadata information, which describes the directory structure stored in hdfs and the files under the directory, as well as the corresponding directory and file meta information. Let's take a look at some information:

<fsimage>
    <version>
        <layoutVersion>-63</layoutVersion>
        <onDiskVersion>1</onDiskVersion>
        <oivRevision>17e75c2a11685af3e043aa5e604dc831e5b14674</oivRevision>
    </version>
    <NameSection>
        <namespaceId>983105879</namespaceId>
        <genstampV1>1000</genstampV1>
        <genstampV2>1014</genstampV2>
        <genstampV1Limit>0</genstampV1Limit>
        <lastAllocatedBlockId>1073741837</lastAllocatedBlockId>
        <txid>334</txid>
    </NameSection>
    <INodeSection>
        <lastInodeId>16407</lastInodeId>
        <numInodes>16</numInodes>

        At the beginning of this article, we focus on the directory structure and meta information.
        <inode>
            <id>16386</id>
            <type>DIRECTORY</type>    This is the directory. The name is test
            <name>test</name>
            <mtime>1558263065070</mtime>  Modification time
            <permission>root:supergroup:0755</permission> Jurisdiction
            <nsquota>-1</nsquota>
            <dsquota>-1</dsquota>
        </inode>
        <inode>
            <id>16387</id>    This is a file called edit_new.xml
            <type>FILE</type>
            <name>edit_new.xml</name>
            <replication>2</replication>
            <mtime>1558263065045</mtime>
            <atime>1558269494520</atime>
            <preferredBlockSize>134217728</preferredBlockSize>
            <permission>root:supergroup:0644</permission>  Jurisdiction
            <blocks>   Here is block Information, including block
                <block>
                    <id>1073741825</id>
                    <genstamp>1001</genstamp>
                    <numBytes>580</numBytes>
                </block>
            </blocks>
            <storagePolicyId>0</storagePolicyId>
        </inode>
    <INodeSection>
<fsimage>

From the above fsimage information, it records the directory structure and corresponding meta information of the current file system. Unlike edit, edit records the operation of the file system.

5. View the contents of the edits file

//Format: hdfs oev -p output format (default XML) -i input file - o output file
[root@bigdata121 current]# hdfs oev -i edits_inprogress_0000000000000000038 -o /tmp/edits_inprogess.xml

Also intercept some information to view:

<?xml version="1.0" encoding="UTF-8"?>
<EDITS>
  <EDITS_VERSION>-63</EDITS_VERSION>
  <RECORD>
    <OPCODE>OP_START_LOG_SEGMENT</OPCODE>   Indicates the type of operation. Here is the log start recording.
    <DATA>
      <TXID>38</TXID>  Similar to fuck ID,Is the only one.
    </DATA>
  </RECORD>
</EDITS>

each RECORD One operation is recorded.
<RECORD>
    <OPCODE>OP_ADD_BLOCK</OPCODE>     //Like this, it means the operation of uploading files.
    <DATA>
      <TXID>34</TXID>
      <PATH>/jdk-8u144-linux-x64.tar.gz._COPYING_</PATH>
      <BLOCK>
        <BLOCK_ID>1073741825</BLOCK_ID>
        <NUM_BYTES>134217728</NUM_BYTES>
        <GENSTAMP>1001</GENSTAMP>
      </BLOCK>
      <BLOCK>
        <BLOCK_ID>1073741826</BLOCK_ID>
        <NUM_BYTES>0</NUM_BYTES>
        <GENSTAMP>1002</GENSTAMP>
      </BLOCK>
      <RPC_CLIENTID></RPC_CLIENTID>
      <RPC_CALLID>-2</RPC_CALLID>
    </DATA>
  </RECORD>

Each RECORD records an operation, such as the
OP? Add represents the operation of adding files. Generally speaking, there are records in it.
File PATH
Modification time (MTIME)
Add time (ATIME)
Client name
Client? Machine
Very useful information such as permission & Status

6. Manually scroll the edit log

Format: hdfs dfsadmin -rollEdits

7. Configuration of NN,SNN,DN data directory

(1) hadoop.tmp.dir is configured.

If hadoop.tmp.dir is configured in core-site.xml, their data directories are as follows:

NN: {Hadoop. TMP. Dir} / DFs / name fsimage and edit files will be stored in this directory
 SNN: data directory of ${Hadoop. TMP. Dir} / DFs / nameeconomy SNN
 DN: data directory of ${Hadoop. TMP. Dir} / DFs / data datanode

(2) set directory separately

If the value of hadoop.tmp.dir is not set, then NN, SNN and DN need to manually set their respective data directories. Otherwise, the data files will be generated under / TMP / Hadoop root / DFs /. Their setting parameters are as follows:

/*Are set in hdfs-site.xml*/
//If only one of them is set, fsimage and edit files will be stored in a specified directory.
NN:dfs.namenode.name.dir set the path of fsimage storage
   Dfs.namenode.edit.dir set the edit storage path

DN: dfs.datanode.data.dir this is the datanode storage directory
 SNN: dfs.namenode.checkpoint.dir this is the SNN storage directory

(3) setting of namenode multi directory

When setting the working directory of namenode separately, we can set multiple values for dfs.namenode.name.dir, separated by commas. When hdfs namenode -format is formatted, two namenode directories will also be formatted, and the contents of the two directories are consistent during operation. This method can be used as a supplement to the backup data of namenode. Such as:

<property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///${hadoop.tmp.dir}/dfs/name1,file:///${hadoop.tmp.dir}/dfs/name2</value>
    </property>

II. Workflow of namenode and SNN

1. namenode startup stage

(1) when the namenode is first started (that is, after the namenode is first formatted), fsimage and edit files will be created automatically. If it is not started for the first time, the namenode will load the latest fsimage and the included edits from the fsimage to the latest operation event (based on the txid recorded in the seen file) into memory, and finally store the latest meta information in memory. And create a new edit file to record the operation. The name is edit ﹣ inprogress ﹣ XXXX.
(2) the client initiates an add, delete and change request to the namenode.
(3) namenode response request
(4) when adding, deleting or modifying a namenode record, the operation will be written to the edit file first, and the metadata stored in the memory will not be modified until it is written successfully. This method is to ensure that the latest operation must be stored in the permanent storage like disk to avoid accidental loss of operation records.

2. SNN working stage

(1) SNN asks namenode whether it needs to execute checkpoint according to the set time interval of checkpoint checking. Namenode responds to SNN results.
(2) if the result is yes, SNN requests namenode to perform checkpoint operation.
(3) namenode starts to perform checkpoint operation. First, it scrolls the currently used edit files, names the scrolled edit files in the form of edit [txid1-txid2], and creates a new edit file named edit [inprogges] txid2 + 1. The purpose of rolling edit is to prevent the merge operation from affecting the external services provided by namenode. After rolling, the operation record can be written to the new edit file normally.
(4) copy the latest fsimage (look at the txid behind the file, the biggest is the latest) and the edit file from this to the latest txid to SNN. Note that the txid of the fsimage file name and the txid after the edit file name. The edit file smaller than the txid after the fsimage does not need to be copied.
(5) SNN reads the copied fsiamge and edits files and merges them into memory.
(6) after merging, a new fsimage file with the name of fsimage.chkpoint is generated.
(7) copy fsimage.chkpoint to namenode
(8) after the namenode receives fsimage.chkpoint, it is renamed as fsimage ﹣ txid, and the txid after it indicates the txid of the latest operation recorded in this fsimage file.

3. checkpoint check parameter configuration

hdfs-default.xml

---------------checkpoint Actual interval------------------------
<!--checkpoint The default time interval is one hour in seconds.-->
<property>
  <name>dfs.namenode.checkpoint.period</name>
  <value>3600</value>
</property>

---------------Operation times------------------------
<!--trigger checkpoint One of the operating conditions, edits The number of recorded operations will be triggered when the number specified here is reached checkpoint-->
<property>
  <name>dfs.namenode.checkpoint.txns</name>
  <value>1000000</value>
<description>Operation times</description>
</property>

<!--Check the actual interval of operation times, default 60 seconds-->
<property>
  <name>dfs.namenode.checkpoint.check.period</name>
  <value>60</value>
</property>

4. Conditions for triggering checkpoint

(1) the size of the EDI file exceeds 64M
(2) the existing time of the current edit file exceeds a certain time, 3600 seconds by default.
(3) the number of operations recorded in the edit file reaches the specified number, which is 1000000 by default.

Keywords: Big Data xml Hadoop MySQL encoding

Added by p3rk5 on Thu, 17 Oct 2019 00:57:57 +0300