Hadoop download address
https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/
1, Hadoop installation
1. Upload hadoop-3.1.3.tar.gz to / opt/software directory of linux
hadoop-3.1.3.tar.gz
2. Unzip hadoop-3.1.3.tar.gz to / opt/server /
[linux@node1 software]$ tar -zxvf hadoop-3.1.3.tar.gz -C /opt/server/
3. Modify / etc/profile.d/yes.sh and add environment variables
[linux@node1 server]$ sudo vim /etc/profile.d/yes.sh
#HADOOP_HOME export HADOOP_HOME=/opt/server/hadoop-3.1.3 export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin
4. Make the revised document effective
[linux@node1 server]$ source /etc/profile
5. Test whether the installation is successful
[linux@node1 server]$ hadoop version Hadoop 3.1.3 Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r ba631c436b806728f8ec2f54ab1e289526c90579 Compiled by ztang on 2019-09-12T02:47Z Compiled with protoc 2.5.0 From source with checksum ec785077c385118ac91aadde5ec9799 This command was run using /opt/server/hadoop-3.1.3/share/hadoop/common/hadoop-common-3.1.3.jar
2, Hadoop directory structure
1. View Hadoop directory structure
[linux@node1 hadoop-3.1.3]$ ll Total consumption 176 drwxr-xr-x. 2 linux linux 183 9 December 2019 bin drwxr-xr-x. 3 linux linux 20 9 December 2019 etc drwxr-xr-x. 2 linux linux 106 9 December 2019 include drwxr-xr-x. 3 linux linux 20 9 December 2019 lib drwxr-xr-x. 4 linux linux 288 9 December 2019 libexec -rw-rw-r--. 1 linux linux 147145 9 April 2019 LICENSE.txt -rw-rw-r--. 1 linux linux 21867 9 April 2019 NOTICE.txt -rw-rw-r--. 1 linux linux 1366 9 April 2019 README.txt drwxr-xr-x. 3 linux linux 4096 9 December 2019 sbin drwxr-xr-x. 4 linux linux 31 9 December 2019 share
2. Important contents
(1) bin directory: stores scripts for operating Hadoop related services (HDFS,YARN)
(2) etc Directory: Hadoop configuration file directory, which stores Hadoop configuration files
(3) lib Directory: local library for storing Hadoop (function of compressing and decompressing data)
(4) sbin Directory: stores scripts for starting or stopping Hadoop related services
(5) share Directory: stores the dependent jar packages, documents, and official cases of Hadoop
3, Configure cluster
1. Core configuration file: core-site.xml
<!--appoint HDFS in NameNode Address of --> <property> <name>fs.defaultFS</name> <value>hdfs://node1:9820</value> </property> <!-- appoint Hadoop The storage directory where files are generated at run time --> <property> <name>hadoop.tmp.dir</name> <value>/opt/server/hadoop-3.1.3/data</value> </property> <!-- adopt web Interface operation hdfs Permissions for --> <property> <name>hadoop.http.staticuser.user</name> <value>linux</value> </property> <!-- behind hive Compatibility configuration for --> <!-- Configure this linux(superUser)Host nodes that are allowed to be accessed through proxy --> <property> <name>hadoop.proxyuser.linux.hosts</name> <value>*</value> </property> <!-- Configure this linux(superUser)Allow groups to which users belong through proxy --> <property> <name>hadoop.proxyuser.linux.groups</name> <value>*</value> </property> <!-- Configure this linux(superUser)Allow users through proxy--> <property> <name>hadoop.proxyuser.linux.users</name> <value>*</value> </property>
2.HDFS configuration file: hdfs-site.xml
<!-- nn web End access address--> <property> <name>dfs.namenode.http-address</name> <value>node1:9870</value> </property> <!-- 2nn web End access address--> <property> <name>dfs.namenode.secondary.http-address</name> <value>node3:9868</value> </property> <!-- Test environment assignment HDFS Number of copies 1 --> <property> <name>dfs.replication</name> <value>1</value> </property> <!-- determine DFS Where should the name node store the name table on the local file system--> <property> <name>dfs.namenode.name.dir</name> <value>file:///opt/server/hadoop-3.1.3/dfs/name</value> </property> <!--DFS Where should a data node store its blocks on the local file system --> <property> <name>dfs.datanode.data.dir</name> <value>file:///opt/server/hadoop-3.1.3/dfs/data</value> </property>
3.YARN configuration file: yarn-site.xml
<!-- appoint MR go shuffle --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- appoint ResourceManager Address of--> <property> <name>yarn.resourcemanager.hostname</name> <value>node2</value> </property> <!-- Inheritance of environment variables --> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> <!-- yarn Maximum and minimum memory allowed to be allocated by the container --> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>512</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>4096</value> </property> <!-- yarn The amount of physical memory the container allows to manage --> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <!-- close yarn Limit check for physical memory and virtual memory --> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
4.MapReduce configuration file: mapred-site.xml
<!-- appoint MapReduce The program runs on Yarn upper --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
5. Cluster node workers
node1 node2 node3
4, Configure history server
In order to view the historical operation of the program, you need to configure the history server. The specific configuration steps are as follows:
1) Configure mapred-site.xml
[linux@node1 hadoop]$ vim mapred-site.xml
Add the following configuration to this file.
<!-- Historical server address --> <property> <name>mapreduce.jobhistory.address</name> <value>node1:10020</value> </property> <!-- History server web End address --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>node1:19888</value> </property>
5, Configure log aggregation
Log aggregation concept: after the application runs, upload the program running log information to the HDFS system.
Benefits of log aggregation function: you can easily view the details of program operation, which is convenient for development and debugging.
Note: to enable the log aggregation function, you need to restart NodeManager, ResourceManager and HistoryManager.
The specific steps to enable log aggregation are as follows:
1) Configure yarn-site.xml
[linux@node1 hadoop]$ vim yarn-site.xml
Add the following configuration to this file.
<!-- Enable log aggregation --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <!-- Set log aggregation server address --> <property> <name>yarn.log.server.url</name> <value>http://node1:19888/jobhistory/logs</value> </property> <!-- Set the log retention time to 7 days --> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property>
6, Start cluster
1. Distribute hadoop files
[linux@node1 server]$ xsync hadoop-3.1.3/
2. Start hadoop
If the cluster is started for the first time, you need to format the namenode on node1 node (note that before formatting, you must stop all namenode and datanode processes started last time, and then delete data and log data)
[linux@node1 server]$ hdfs namenode -format
Start HDFS on the node (node1) with NameNode configured
[linux@node1 server]$ start-dfs.sh
Start YARN on the node (node2) where the ResourceManager is configured
[linux@node2 opt]$ start-yarn.sh
View the Web page of HDFS on the Web side:
http://node1:9870/
View the Web page of YARN on the Web side
http://http://node2:8088
View SecondaryNameNode information
http://node3:9868
7, Cluster startup script
#!/bin/bash if [ $# -lt 1 ] #!/bin/bash if [ $# -lt 1 ] then echo "No Args Input..." exit ; fi case $1 in "start") echo " =================== start-up hadoop colony ===================" echo " --------------- start-up hdfs ---------------" ssh node1 "/opt/server/hadoop-3.1.3/sbin/start-dfs.sh" echo " --------------- start-up yarn ---------------" ssh node2 "/opt/server/hadoop-3.1.3/sbin/start-yarn.sh" echo " --------------- start-up historyserver ---------------" ssh node1 "/opt/server/hadoop-3.1.3/bin/mapred --daemon start historyserver" ;; "stop") echo " =================== close hadoop colony ===================" echo " --------------- close historyserver ---------------" ssh node1 "/opt/server/hadoop-3.1.3/bin/mapred --daemon stop historyserver" echo " --------------- close yarn ---------------" ssh node2 "/opt/server/hadoop-3.1.3/sbin/stop-yarn.sh" echo " --------------- close hdfs ---------------" ssh node1 "/opt/server/hadoop-3.1.3/sbin/stop-dfs.sh" ;; *) echo "Input Args Error..." ;; esac
8, Cluster time synchronization
Time synchronization method: find a machine as a time server, and all machines synchronize the time with the cluster regularly. For example, synchronize the time every ten minutes.
Specific operation of configuring time synchronization:
1) Time server configuration (must be root)
(0) view ntpd service status and startup and self startup status of all nodes
[linux@node1 ~]$ sudo systemctl status ntpd [linux@node1 ~]$ sudo systemctl is-enabled ntpd
(1) Turn off ntp service and self start on all nodes
[linux@node1 ~]$ sudo systemctl stop ntpd [linux@node1 ~]$ sudo systemctl disable ntpd
(2) Modify the ntp.conf configuration file of node1
[linux@node1 ~]$ sudo vim /etc/ntp.conf
The amendments are as follows:
a) Modify 1 (authorize all machines in the 192.168.1.0-192.168.1.255 network segment to query and synchronize time from this machine)
#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap by restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
b) Modification 2 (time when the cluster is in the LAN and does not use other Internet)
server 0.centos.pool.ntp.org iburst server 1.centos.pool.ntp.org iburst server 2.centos.pool.ntp.org iburst server 3.centos.pool.ntp.org iburst
by
#server 0.centos.pool.ntp.org iburst #server 1.centos.pool.ntp.org iburst #server 2.centos.pool.ntp.org iburst #server 3.centos.pool.ntp.org iburst
c) Add 3 (when the node loses network connection, the local time can still be used as the time server to provide time synchronization for other nodes in the cluster)
server 127.127.1.0 fudge 127.127.1.0 stratum 10
(3) Modify the / etc/sysconfig/ntpd file of node1
[linux@node1 ~]$ sudo vim /etc/sysconfig/ntpd
Add the following contents (synchronize the hardware time with the system time)
SYNC_HWCLOCK=yes
(4) Restart the ntpd service
[linux@node1 ~]$ sudo systemctl start ntpd
(5) Set ntpd service startup
[linux@node1 ~]$ sudo systemctl enable ntpd
2) Other machine configurations (must be root)
(1) Configure other machines to synchronize with the time server once every 10 minutes
[linux@node2 ~]$ sudo crontab -e
The scheduled tasks are as follows:
*/10 * * * * /usr/sbin/ntpdate node1
(2) Modify any machine time
[linux@node2 ~]$ sudo date -s "2017-9-11 11:11:11"
(3) Check whether the machine is synchronized with the time server in ten minutes
[linux@node2 ~]$ sudo date
Note: when testing, you can adjust 10 minutes to 1 minute to save time.