Hadoop3.3.1 compilation, installation and deployment tutorial


   it's best to recompile the source code when building Hadoop, because some functions of Hadoop must coordinate Java class files and library files generated by Native code through JNT. To run Native code in linux system, first compile Native into [. so] file of target CPU architecture. Different processor architectures need to compile the dynamic library [. so] file of the corresponding platform in order to be executed correctly. Therefore, it is best to recompile the Hadoop source code and make the [. so] file correspond to your processor.
  if you want to directly install the official compiled version, you can operate according to another blog: Hadoop3. Fully distributed deployment of X on centos

1. Environmental preparation

  1. Three virtual machines,,, Installing CentOS7 virtual machine on VMware 15 under win10
  2. JDK (self prepared)
  3. hadoop installation package (official website download address: https://hadoop.apache.org/releases.html)
  4. The installation package required for compilation and installation needs to be self fetched:
    Link: https://pan.baidu.com/s/11adzPBvnq0louRr3qUhfoA
    Extraction code: 6666

2. Create user

  1. Create a hadoop user and modify the password of the hadoop user
[root@localhost hadoop-3.3.1]# useradd hadoop
[root@localhost hadoop-3.3.1]# passwd hadoop
  1. vim /etc/sudoers configures the hadoop user to have root permission, so that sudo can execute the command with root permission later. Add a line below% wheeel, as shown below:
%wheel 	ALL=(ALL)	ALL
hadoop	ALL=(ALL) 	ALL
  1. Modify / data directory owner and group
chown -R hadoop:hadoop /data/
  1. Add address mapping to the three virtual machines in turn
vim /etc/hosts
 Add the following three lines to the end of the file hadoop1 hadoop2 hadoop3
  1. Close the firewall (production can't do this, just open several designated ports)
firewall-cmd --state    #View firewall status
systemctl stop firewalld.service  #Stop firewalld service
systemctl disable firewalld.service  #Disable firewalld service at startup

3. Password free login

  1. To / home / hadoop / In SSH / directory, use hadoop user to execute SSH keygen - t RSA, and then press enter three times to generate two file IDs_ RSA (private key), id_rsa.pub (public key)
  2. Execute the following command to copy the public key to the machine to be logged in without secret, and repeat these two steps on the other two machines at one time
  1. Now the hadoop users of the three machines can log in without secret. Add a root user to log in to the other two machines without secret. Use the root user to execute the following command
cd ~
cd .ssh
ssh-keygen -t rsa
  1. Explanation of file functions in. ssh folder
file namefunction
known_hostsRecord the public key of the computer accessed by ssh
id_rsaGenerated private key
id_rsa.pubGenerated public key
authorized_keysStore the authorized secret free login server public key

4. Compile and install

  1. First, use root to install and compile related dependencies
yum install -y gcc gcc-c++
yum install -y make cmake
yum install -y autoconf automake libtool curl
yum install -y lzo-devel zlib-devel openssl openssl-devel ncurses-devel
yum install -y snappy snappy-devel bzip2 bzip2-devel lzo lzo-devel lzop libXtst
  1. Install cmake manually. The version of cmake installed by default is too low and the source code cannot be compiled. It is recommended to install cmake above version 3.6. I have version 3.13 here
# Uninstall installed cmake
yum erase cmake

# Upload the installation package and unzip it
tar -zxvf cmake-3.13.5.tar.gz -C /data/

# Compile and install
cd /data/cmake-3.13.5/
make && make install

# verification
[root@localhost cmake-3.13.5]# cmake -version
cmake version 3.13.5

#If the version is not displayed correctly, you can disconnect the ssh connection and log in again
  1. Manually install snappy
# Uninstall installed snappy
cd /usr/local/lib
rm -rf libsnappy*

# Upload and decompress
cd /data/soft/
tar -zxvf snappy-1.1.3.tar.gz -C /data/

# Compile and install
cd /data/snappy-1.1.3/
make && make install

# Verify installation
[root@localhost snappy-1.1.3]# ls -lh /usr/local/lib | grep snappy
-rw-r--r--. 1 root root 511K Jan 14 13:07 libsnappy.a
-rwxr-xr-x. 1 root root  955 Jan 14 13:07 libsnappy.la
lrwxrwxrwx. 1 root root   18 Jan 14 13:07 libsnappy.so -> libsnappy.so.1.3.0
lrwxrwxrwx. 1 root root   18 Jan 14 13:07 libsnappy.so.1 -> libsnappy.so.1.3.0
-rwxr-xr-x. 1 root root 253K Jan 14 13:07 libsnappy.so.1.3.0
  1. Installing and configuring maven
# Decompression installation
cd /data/soft/
tar -zxvf apache-maven-3.5.4-bin.tar.gz -C /data/

# Configure environment variables
vim /etc/profile
export MAVEN_HOME=/data/apache-maven-3.5.4
export MAVEN_OPTS="-Xms4096m -Xmx4096m"
export PATH=:$MAVEN_HOME/bin:$PATH

# Verify that the installation was successful
[root@localhost apache-maven-3.5.4]# mvn -v
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-18T02:33:14+08:00)
Maven home: /data/apache-maven-3.5.4
Java version: 1.8.0_211, vendor: Oracle Corporation, runtime: /usr/java/jdk1.8.0_211/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-957.el7.x86_64", arch: "amd64", family: "unix"

# Add maven alicloud warehouse address
vim /data/apache-maven-3.5.4/conf/settings.xml
		 <name>aliyun maven</name>
  1. Installing ProtocolBuffer 2.5.0
# decompression
cd /data/soft/
tar -zxvf protobuf-2.5.0.tar.gz -C /data/

# Compile and install
cd /data/protobuf-2.5.0/
make && make install

# Verify that the installation was successful
[root@localhost protobuf-2.5.0]# protoc --version
libprotoc 2.5.0
  1. Upload the installation package to the virtual machine
  2. Execute the command tar -zxvf hadoop-3.3.1 tar. GZ - C / data / extract to the / data directory
  3. Compile hadoop (compilation is very slow and takes about an hour)
cd /data/hadoop-3.3.1-src/
mvn clean package -Pdist,native -DskipTests -Dtar -Dbundle.snappy -Dsnappy.lib=/usr/local/lib

#Parameter Description:
Pdist,native : Recompile the generated hadoop Dynamic library;
DskipTests : Skip test
Dtar : Finally, put the document in tar pack
Dbundle.snappy : add to snappy Compression support [downloading from the official website by default is not supported]
Dsnappy.lib=/usr/local/lib : finger snappy The library path after installation on the compiling machine
  1. After successful compilation, the installation package path is as follows
[root@hadoop111 target]# pwd
[root@hadoop111 target]# ll
total 501664
drwxr-xr-x.  2 root root        28 Jan 14 16:56 antrun
drwxr-xr-x.  3 root root        22 Jan 14 16:56 classes
drwxr-xr-x. 10 root root       215 Jan 14 16:56 hadoop-3.3.1
-rw-r--r--.  1 root root 513703809 Jan 14 16:56 hadoop-3.3.1.tar.gz
drwxr-xr-x.  3 root root        22 Jan 14 16:56 maven-shared-archive-resources
drwxr-xr-x.  3 root root        22 Jan 14 16:56 test-classes
drwxr-xr-x.  2 root root         6 Jan 14 16:56 test-dir
  1. Unzip the compiled installation package
tar -zxvf hadoop-3.3.1.tar.gz -C /data/
  1. Enter the / data/hadoop-3.3.1/etc/hadoop path and execute the command VIM core site XML, edit the core configuration file and add the following content:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- to configure NameNode of URL -->
    <!-- appoint hadoop Storage directory of data,yes hadoop Basic configuration of file system dependency,The default location is/tmp/{$user}It is a temporary directory. Once it is affected by external factors such as power failure,/tmp/${user}Everything under will be lost -->
    <!-- to configure HDFS The static user used for web page login is hadoop -->
  1. Execute the command VIM HDFS site XML, edit the HDFS configuration file, and add the following content:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!-- NameNode Path on the local file system where the namespace and transaction logs are stored -->
    <!-- DataNode Path on the local file system where the namespace and transaction logs are stored  -->
    <!-- NameNode web End access address-->
    <!-- SecondaryNameNode web End access address-->
  1. Execute the command VIM YARN site XML edit the YARN configuration file and add the following:
<?xml version="1.0"?>

    <!-- appoint MR go shuffle -->
    <!-- appoint ResourceManager Address of-->
    <!-- Minimum memory limit allocated for each container request resource manager (512) M) -->
    <!-- Maximum memory allocated per container request limit resource manager (4) G) -->
    <!-- Virtual memory ratio, the default is 2.1,Set here to 4x -->
    <!-- Inheritance of environment variables -->
  1. Execute the command VIM mapred site XML edit MapReduce configuration file and add the following content:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- implement MapReduce How to: yarn/local -->
  1. Under the / data/hadoop-3.3.1/etc/hadoop path, execute the command vim workers to configure workers (Note: no spaces are allowed at the end of the content added in the file, and no empty lines are allowed in the file)
  1. Create corresponding directory
mkdir /data/hadoop-3.3.1/data/datanode
mkdir /data/hadoop-3.3.1/data/tmp
  1. Change the user of hadoop installation package
chown -R hadoop:hadoop hadoop-3.3.1/
chown -R hadoop:hadoop hadoop-3.3.1/*
  1. Switch to the hadoop user and execute the following command to distribute the configured hadoop installation package to the other two machines (the other two machines check the user of the installation package and change to the hadoop user in installation step 10)
scp -r hadoop-3.3.1 root@
scp -r hadoop-3.3.1 root@
  1. Add environment variables to the three virtual machines in turn, edit the / etc/profile file, add the following contents, then save the source /etc/profile, and execute the hadoop version command to check whether the addition is successful
export HADOOP_HOME=/data/hadoop-3.3.1
  1. The cluster starts for the first time, The NameNode needs to be formatted on the primary node (both use hadoop users) (Note: formatting NameNode will generate a new cluster id, resulting in inconsistent cluster IDS between NameNode and datanode, and the cluster cannot find past data. If the cluster reports an error during operation and needs to reformat NameNode, be sure to stop the NameNode and datanode process, delete the data and logs directories of all machines, and then format it (chemical)
hdfs namenode -format
  1. Starting HDFS with hadoop users
  1. Start YARN on
  1. jps check whether the service processes of the three virtual machines are as shown in the following table
  1. View the HDFS NameNode on the Web (you can view the HDFS directory structure in utilities = > browse the file system)
  2. View YARN's ResourceManager on the Web

5. Basic cluster test

  1. Upload files to cluster
[hadoop@localhost hadoop-3.3.1]$ hadoop fs -mkdir /input
[hadoop@localhost hadoop-3.3.1]$ hadoop fs -put /data/input/1.txt /input
  1. Go to the HDFS file storage path to view the contents of HDFS files stored on disk
[hadoop@localhost subdir0]$ pwd
[hadoop@localhost subdir0]$ ls
blk_1073741825  blk_1073741825_1001.meta
[hadoop@localhost subdir0]$ cat blk_1073741825
hello hadoop
stream data
flink spark
  1. Download File
[hadoop@localhost hadoop-3.3.1]$ hadoop fs -get /input/1.txt /data/output/
[hadoop@localhost hadoop-3.3.1]$ ls /data/output/
[hadoop@localhost hadoop-3.3.1]$ cat /data/output/1.txt 
hello hadoop
stream data
flink spark
  1. Execute the wordcount program
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output
see/output/File contents under( windows browser web When pulling files from the page to view, you need to C:\Windows\System32\drivers\etc\hosts Add 2 to.4 Address mapping described in section)
data	1
flink	1
hadoop	1
hello	1
spark	1
stream	1
  1. Calculate the PI (in the calculation command, 2 indicates the number of threads to be calculated and 50 indicates the number of investment points. The larger the value, the more accurate the calculated pi value is)
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar pi 2 50
Job Finished in 23.948 seconds
Estimated value of Pi is 3.20000000000000000000

6. Configure history server

  in order to view the historical operation of the program, you need to configure the history server. The specific configuration steps are as follows:

  1. vim mapred-site.xml to edit the MapReduce configuration file and add the following contents (required for all three virtual machines):
<!-- Historical server address -->
<!-- History server web End address -->
  1. Start the history server at
On: mapred --daemon start historyserver
 close: mapred --daemon stop historyserver
15299 DataNode
15507 NodeManager
15829 Jps
15769 JobHistoryServer
15132 NameNode
  1. View JobHistory

7. Configure log aggregation

   log aggregation concept: after the application runs, upload the program running log information to the HDFS system.

   advantages of log aggregation function: you can easily view the details of program operation, which is convenient for development and debugging.

  1. vim yarn-site. Configure yarn site. XML XML, add the following content (required for all three sets):
<!-- Enable log aggregation -->
<!-- Set log aggregation server address -->
<!-- Set the log retention time to 7 days -->
<!-- Configuring running logs hdfs Storage path on -->
<!-- The configuration run logs are stored in the hdfs Storage path on -->
  1. Restart the service and execute the wordcount program
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output
  1. view log

8. Summary of cluster start / stop commands

  • Each module starts / stops separately (ssh is configured on the premise)
  1. Overall start / stop HDFS
  1. Overall start / stop of YARN
  • Each service component starts / stops one by one
  1. Start / stop HDFS components separately
hdfs --daemon start/stop namenode/datanode/secondarynamenode
  1. Start / stop YARN
yarn --daemon start/stop  resourcemanager/nodemanager

9. Cluster clustering script

  1. vim myhadoop.sh add the following and save

if [ $# -lt 1 ]
    echo "No Args Input..."
    exit ;

case $1 in
        echo " =================== start-up hadoop colony ==================="

        echo " --------------- start-up hdfs ---------------"
        ssh "/data/hadoop-3.3.1/sbin/start-dfs.sh"
        echo " --------------- start-up yarn ---------------"
        ssh "/data/hadoop-3.3.1/sbin/start-yarn.sh"
        echo " --------------- start-up historyserver ---------------"
        ssh "/data/hadoop-3.3.1/bin/mapred --daemon start historyserver"
        echo " =================== close hadoop colony ==================="

        echo " --------------- close historyserver ---------------"
        ssh "/data/hadoop-3.3.1/bin/mapred --daemon stop historyserver"
        echo " --------------- close yarn ---------------"
        ssh "/data/hadoop-3.3.1/sbin/stop-yarn.sh"
        echo " --------------- close hdfs ---------------"
        ssh "/data/hadoop-3.3.1/sbin/stop-dfs.sh"
    echo "Input Args Error..."
  1. chmod +x myhadoop.sh grant script execution permission
  2. Start / stop cluster

10. Description of common port numbers

Port nameHadoop2.xHadoop3.x
NameNode internal communication port8020 / 90008020 / 9000 / 9820
NameNode HTTP UI500709870
MapReduce view task execution port80888088
History server communication port1988819888

Keywords: Big Data Hadoop hdfs

Added by roxki on Fri, 14 Jan 2022 13:21:03 +0200