Hadoop deployment and configuration

Hadoop download address

https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/

1, Hadoop installation

1. Upload hadoop-3.1.3.tar.gz to / opt/software directory of linux

hadoop-3.1.3.tar.gz

2. Unzip hadoop-3.1.3.tar.gz to / opt/server /

[linux@node1 software]$ tar -zxvf hadoop-3.1.3.tar.gz  -C /opt/server/

3. Modify / etc/profile.d/yes.sh and add environment variables

[linux@node1 server]$ sudo vim /etc/profile.d/yes.sh

#HADOOP_HOME
export HADOOP_HOME=/opt/server/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

4. Make the revised document effective

[linux@node1 server]$ source /etc/profile

5. Test whether the installation is successful

[linux@node1 server]$ hadoop version
Hadoop 3.1.3
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r ba631c436b806728f8ec2f54ab1e289526c90579
Compiled by ztang on 2019-09-12T02:47Z
Compiled with protoc 2.5.0
From source with checksum ec785077c385118ac91aadde5ec9799
This command was run using /opt/server/hadoop-3.1.3/share/hadoop/common/hadoop-common-3.1.3.jar

2, Hadoop directory structure

1. View Hadoop directory structure

[linux@node1 hadoop-3.1.3]$ ll
 Total consumption 176
drwxr-xr-x. 2 linux linux    183 9 December 2019 bin
drwxr-xr-x. 3 linux linux     20 9 December 2019 etc
drwxr-xr-x. 2 linux linux    106 9 December 2019 include
drwxr-xr-x. 3 linux linux     20 9 December 2019 lib
drwxr-xr-x. 4 linux linux    288 9 December 2019 libexec
-rw-rw-r--. 1 linux linux 147145 9 April 2019 LICENSE.txt
-rw-rw-r--. 1 linux linux  21867 9 April 2019 NOTICE.txt
-rw-rw-r--. 1 linux linux   1366 9 April 2019 README.txt
drwxr-xr-x. 3 linux linux   4096 9 December 2019 sbin
drwxr-xr-x. 4 linux linux     31 9 December 2019 share

2. Important contents

(1) bin directory: stores scripts for operating Hadoop related services (HDFS,YARN)
(2) etc Directory: Hadoop configuration file directory, which stores Hadoop configuration files
(3) lib Directory: local library for storing Hadoop (function of compressing and decompressing data)
(4) sbin Directory: stores scripts for starting or stopping Hadoop related services
(5) share Directory: stores the dependent jar packages, documents, and official cases of Hadoop

3, Configure cluster

1. Core configuration file: core-site.xml

<!--appoint HDFS in NameNode Address of -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node1:9820</value>
    </property>
<!-- appoint Hadoop The storage directory where files are generated at run time -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/server/hadoop-3.1.3/data</value>
    </property>
<!--  adopt web Interface operation hdfs Permissions for -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>linux</value>
    </property>
<!-- behind hive Compatibility configuration for  -->
    <!-- Configure this linux(superUser)Host nodes that are allowed to be accessed through proxy -->
    <property>
        <name>hadoop.proxyuser.linux.hosts</name>
        <value>*</value>
    </property>
    <!-- Configure this linux(superUser)Allow groups to which users belong through proxy -->
    <property>
        <name>hadoop.proxyuser.linux.groups</name>
        <value>*</value>
    </property>
    <!-- Configure this linux(superUser)Allow users through proxy-->
    <property>
        <name>hadoop.proxyuser.linux.users</name>
        <value>*</value>
    </property>

2.HDFS configuration file: hdfs-site.xml

<!-- nn web End access address-->
    <property>
        <name>dfs.namenode.http-address</name>
        <value>node1:9870</value>
    </property>

    <!-- 2nn web End access address-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>node3:9868</value>
    </property>

    <!-- Test environment assignment HDFS Number of copies 1 -->
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

    <!-- determine DFS Where should the name node store the name table on the local file system-->
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///opt/server/hadoop-3.1.3/dfs/name</value>
    </property>

    <!--DFS Where should a data node store its blocks on the local file system -->
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///opt/server/hadoop-3.1.3/dfs/data</value>
    </property>

3.YARN configuration file: yarn-site.xml

<!-- appoint MR go shuffle -->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>

<!-- appoint ResourceManager Address of-->
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>node2</value>
</property>

<!-- Inheritance of environment variables -->
<property>
    <name>yarn.nodemanager.env-whitelist</name>
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>

<!-- yarn Maximum and minimum memory allowed to be allocated by the container -->
<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>512</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>4096</value>
</property>

<!-- yarn The amount of physical memory the container allows to manage -->
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
</property>

<!-- close yarn Limit check for physical memory and virtual memory -->
<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

4.MapReduce configuration file: mapred-site.xml

<!-- appoint MapReduce The program runs on Yarn upper -->
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

5. Cluster node workers

node1
node2
node3

4, Configure history server

In order to view the historical operation of the program, you need to configure the history server. The specific configuration steps are as follows:
1) Configure mapred-site.xml

[linux@node1 hadoop]$ vim mapred-site.xml

Add the following configuration to this file.

<!-- Historical server address -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>node1:10020</value>
</property>

<!-- History server web End address -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>node1:19888</value>
</property>

5, Configure log aggregation

Log aggregation concept: after the application runs, upload the program running log information to the HDFS system.
Benefits of log aggregation function: you can easily view the details of program operation, which is convenient for development and debugging.
Note: to enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryManager.
The specific steps to enable log aggregation are as follows:
1) Configure yarn-site.xml

[linux@node1 hadoop]$ vim yarn-site.xml

Add the following configuration to this file.

<!-- Enable log aggregation -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

<!-- Set log aggregation server address -->
<property>  
    <name>yarn.log.server.url</name>  
    <value>http://node1:19888/jobhistory/logs</value>
</property>

<!-- Set the log retention time to 7 days -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

6, Start cluster

1. Distribute hadoop files

[linux@node1 server]$ xsync hadoop-3.1.3/

2. Start hadoop
If the cluster is started for the first time, you need to format the namenode on node1 node (note that before formatting, you must stop all namenode and datanode processes started last time, and then delete data and log data)

[linux@node1 server]$ hdfs namenode -format

Start HDFS on the node (node1) with NameNode configured

[linux@node1 server]$ start-dfs.sh

Start YARN on the node (node2) where the ResourceManager is configured

[linux@node2 opt]$ start-yarn.sh

View the Web page of HDFS on the Web side:

http://node1:9870/

View the Web page of YARN on the Web side

http://http://node2:8088

View SecondaryNameNode information

http://node3:9868

7, Cluster startup script

#!/bin/bash
if [ $# -lt 1 ]
#!/bin/bash
if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi
case $1 in
"start")
        echo " =================== start-up hadoop colony ==================="

        echo " --------------- start-up hdfs ---------------"
        ssh node1 "/opt/server/hadoop-3.1.3/sbin/start-dfs.sh"
        echo " --------------- start-up yarn ---------------"
        ssh node2 "/opt/server/hadoop-3.1.3/sbin/start-yarn.sh"
        echo " --------------- start-up historyserver ---------------"
        ssh node1 "/opt/server/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== close hadoop colony ==================="

        echo " --------------- close historyserver ---------------"
        ssh node1 "/opt/server/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
        echo " --------------- close yarn ---------------"
        ssh node2 "/opt/server/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo " --------------- close hdfs ---------------"
        ssh node1 "/opt/server/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

8, Cluster time synchronization

Time synchronization method: find a machine as a time server, and all machines synchronize the time with the cluster regularly. For example, synchronize the time every ten minutes.

Specific operation of configuring time synchronization:
1) Time server configuration (must be root)
(0) view ntpd service status and startup and self startup status of all nodes

[linux@node1 ~]$ sudo systemctl status ntpd
[linux@node1 ~]$ sudo systemctl is-enabled ntpd

(1) Turn off ntp service and self start on all nodes

[linux@node1 ~]$ sudo systemctl stop ntpd
[linux@node1 ~]$ sudo systemctl disable ntpd

(2) Modify the ntp.conf configuration file of node1

[linux@node1 ~]$ sudo vim /etc/ntp.conf

The amendments are as follows:
a) Modify 1 (authorize all machines in the 192.168.1.0-192.168.1.255 network segment to query and synchronize time from this machine)

#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
 by restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

b) Modification 2 (time when the cluster is in the LAN and does not use other Internet)

server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org iburst
server 2.centos.pool.ntp.org iburst
server 3.centos.pool.ntp.org iburst

#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst

c) Add 3 (when the node loses network connection, the local time can still be used as the time server to provide time synchronization for other nodes in the cluster)

server 127.127.1.0
fudge 127.127.1.0 stratum 10

(3) Modify the / etc/sysconfig/ntpd file of node1

[linux@node1 ~]$ sudo vim /etc/sysconfig/ntpd

Add the following contents (synchronize the hardware time with the system time)

SYNC_HWCLOCK=yes

(4) Restart the ntpd service

[linux@node1 ~]$ sudo systemctl start ntpd

(5) Set ntpd service startup

[linux@node1 ~]$ sudo systemctl enable ntpd

2) Other machine configurations (must be root)
(1) Configure other machines to synchronize with the time server once every 10 minutes

[linux@node2 ~]$ sudo crontab -e

The scheduled tasks are as follows:

*/10 * * * * /usr/sbin/ntpdate node1

(2) Modify any machine time

[linux@node2 ~]$ sudo date -s "2017-9-11 11:11:11"

(3) Check whether the machine is synchronized with the time server in ten minutes

[linux@node2 ~]$ sudo date

Note: when testing, you can adjust 10 minutes to 1 minute to save time.

Keywords: Big Data Hadoop

Added by the7soft.com on Fri, 26 Nov 2021 13:00:49 +0200

Programming VIP