01Hadoop learning notes - Hadoop cluster construction

1 Introduction

1. What is it
Hadoop is an open source software of apache for big data storage and computing.
2. What can I do
a big data storage b distributed computing c (computer) resource scheduling
3. Features

High performance, low cost, high efficiency and reliability. It is universal and simple to use

4. Version
4.1 open source community version
Fast update iteration, low compatibility and stability

https://hadoop.apache.org/

4.2 commercial release
5. Composition
HDFS (distributed file storage)
MapReduce (distributed data processing, components at the code level)
YARN (resource management task scheduling in 2.0)
6.Hadoop cluster
Hadoop cluster = HDFS + YARN
HDFS cluster

Protagonist NameNode
 From role DataNode
 Primary role secondary role SecondaryNameNode

YARN cluster

Protagonist RecourceManager
 From role NodeManager

It can be seen that in the experimental deployment, the HDFS cluster has a set of NN plus three slave DN S plus an auxiliary role SNN

One master RM and three slave NM S in the YARN cluster

2 deployment

1 set host

 hostnamectl  set-hostname hadoop-node01
 hostnamectl  set-hostname hadoop-node02
 hostnamectl  set-hostname hadoop-node03

2. Set hosts
It is suggested to set it on the host computer

cat /etc/hosts

192.168.67.200 hadoop-node01
192.168.67.201 hadoop-node02
192.168.67.202 hadoop-node03

3 turn off the firewall

systemctl stop firewalld
systemctl disable firewalld

4ssh password free login

All operations are performed at node 1. This operation is mainly used for remote operation between linux hosts

#node1 generates public and private keys (all the way enter)
ssh-keygen  
	
#Node1 configure password free login to node1 node2 node3
ssh-copy-id node1
ssh-copy-id node2
ssh-copy-id node3

5 time synchronization
ntpdate does not have this command. You can use yum to install it

ntpdate ntp5.aliyun.com


6 create working directory

mkdir -p /export/server software installation directory

mkdir -p /export/data software data storage directory

mkdir -p /export/software installation package storage directory

7 install jdk1 on the first machine eight

Configure environment variables and take effect

[root@localhost server]# vi /etc/profile
[root@localhost server]# source /etc/profile
export JAVA_HOME=/export/server/jdk1.8.0_241
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

Remote copy to the other two machines on node one

scp -r /export/server/jdk1.8.0_241/ root@hadoop-node02:/export/server/

Environment variable files can also be copied. Remember to take effect

scp -r /etc/profile root@hadoop-node02:/etc/profile

8Hadoop installation

directory structure

  • bin basic management script

  • sbin startup script

  • etc configuration

  • jar packages and samples compiled by share

9 modify configuration

hadoop-env.sh

#Add last file
export JAVA_HOME=/export/server/jdk1.8.0_241

export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root 

core-site.xml

<!-- Set the default file system Hadoop support file,HDFS,GFS,ali|Amazon Cloud and other file systems -->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://hadoop-node01:8020</value>
</property>

<!-- set up Hadoop Local save data path -->
<property>
    <name>hadoop.tmp.dir</name>
    <value>/export/data/hadoop-3.3.0</value>
</property>

<!-- set up HDFS web UI User identity -->
<property>
    <name>hadoop.http.staticuser.user</name>
    <value>root</value>
</property>

<!-- integration hive User agent settings -->
<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>

<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>

<!-- File system trash can save time -->
<property>
    <name>fs.trash.interval</name>
    <value>1440</value>
</property>

hdfs-site.xml

<!-- set up SNN Process running machine location information -->
<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>hadoop-node02:9868</value>
</property>

mapred-site.xml

<!-- set up MR Program default running mode: yarn Cluster mode local Local mode -->
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>

<!-- MR Program history service address -->
<property>
  <name>mapreduce.jobhistory.address</name>
  <value>hadoop-node01:10020</value>
</property>
 
<!-- MR Program history server web End address -->
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>hadoop-node01:19888</value>
</property>

<property>
  <name>yarn.app.mapreduce.am.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>

<property>
  <name>mapreduce.map.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>

<property>
  <name>mapreduce.reduce.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>

yarn-site.xml

<!-- set up YARN Cluster main color running machine position -->
<property>
	<name>yarn.resourcemanager.hostname</name>
	<value>hadoop-node01</value>
</property>

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>

<!-- Will physical memory restrictions be imposed on the container -->
<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>

<!-- Whether virtual memory limits will be imposed on the container. -->
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

<!-- Enable log aggregation -->
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>

<!-- set up yarn Historical server address -->
<property>
    <name>yarn.log.server.url</name>
    <value>http://hadoop-node01:19888/jobhistory/logs</value>
</property>

<!-- The historical log is kept for 7 days -->
<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>604800</value>
</property>

vi workers

hadoop-node01
hadoop-node02
hadoop-node03

10 distribute installation packages to other machines

cd /export/server

scp -r hadoop-3.3.0 root@hadoop-node02:$PWD
scp -r hadoop-3.3.0 root@hadoop-node03:$PWD

11 add Hadoop to environment variable

vim /etc/profile

export HADOOP_HOME=/export/server/hadoop-3.3.0
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

source /etc/profile

After entering hadoop, you can see the relevant commands (take a snapshot here)
12HDFS system initialization (format)

The initialization command can only be started once on a node

hdfs namenode -format

Initialization succeeded

13 start the service at node 1

Start HDFS service

start-dfs.sh


Verify that the service started successfully

node1

[root@localhost server]# jps
3732 Jps
3445 DataNode
3323 NameNode

node2

[root@localhost ~]# jps
14880 DataNode
14993 SecondaryNameNode
15116 Jps

node3

[root@localhost ~]# jps
2680 Jps
2618 DataNode

Start the YARN service only on the first machine

start-yarn.sh

Verify that the service started successfully

node01

[root@localhost server]# jps
4321 Jps
3987 NodeManager
3445 DataNode
3864 ResourceManager
3323 NameNode

node02

[root@localhost ~]# jps
14880 DataNode
14993 SecondaryNameNode
15475 NodeManager
15615 Jps

node03

[root@localhost ~]# jps
2738 NodeManager
2618 DataNode
2828 Jps

3 system UI and simple use

HDFS
http://hadoop-node01:9870/

Browse file system


YARN 8088

4 basic system operation

HDFS

[root@localhost server]# hadoop fs -ls /
[root@localhost server]# hadoop fs -mkdir /hellohdfs
[root@localhost server]# ls
hadoop-3.3.0  hadoop-3.3.0-Centos7-64-with-snappy.tar.gz  hadoop-3.3.0-src.tar.gz  jdk1.8.0_241  jdk-8u241-linux-x64.tar.gz
[root@localhost server]# echo hello > hello.txt
[root@localhost server]# hadoop fs -put hello.txt /hellohdfs

YARN

cd /export/server/hadoop-3.3.0/share/hadoop/mapreduce
Case 1: calculating pi

hadoop jar hadoop-mapreduce-examples-3.3.0.jar pi 2 2
[root@localhost mapreduce]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar pi 2 2
Number of Maps  = 2
Samples per Map = 2
Wrote input for Map #0
Wrote input for Map #1
Starting Job
2022-03-06 15:44:32,206 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at hadoop-node01/192.168.67.200:8032
2022-03-06 15:44:34,363 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1646551219840_0001
2022-03-06 15:44:35,119 INFO input.FileInputFormat: Total input files to process : 2
2022-03-06 15:44:35,395 INFO mapreduce.JobSubmitter: number of splits:2
2022-03-06 15:44:36,376 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1646551219840_0001
2022-03-06 15:44:36,377 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-03-06 15:44:37,306 INFO conf.Configuration: resource-types.xml not found
2022-03-06 15:44:37,308 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-03-06 15:44:39,455 INFO impl.YarnClientImpl: Submitted application application_1646551219840_0001
2022-03-06 15:44:39,947 INFO mapreduce.Job: The url to track the job: http://hadoop-node01:8088/proxy/application_1646551219840_0001/
2022-03-06 15:44:39,948 INFO mapreduce.Job: Running job: job_1646551219840_0001
2022-03-06 15:45:10,432 INFO mapreduce.Job: Job job_1646551219840_0001 running in uber mode : false
2022-03-06 15:45:10,437 INFO mapreduce.Job:  map 0% reduce 0%
2022-03-06 15:45:41,275 INFO mapreduce.Job:  map 100% reduce 0%
2022-03-06 15:45:57,587 INFO mapreduce.Job:  map 100% reduce 100%
2022-03-06 15:45:58,682 INFO mapreduce.Job: Job job_1646551219840_0001 completed successfully
2022-03-06 15:45:59,575 INFO mapreduce.Job: Counters: 54
        File System Counters
                FILE: Number of bytes read=50
                FILE: Number of bytes written=795297
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=536
                HDFS: Number of bytes written=215
                HDFS: Number of read operations=13
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=52488
                Total time spent by all reduces in occupied slots (ms)=13859
                Total time spent by all map tasks (ms)=52488
                Total time spent by all reduce tasks (ms)=13859
                Total vcore-milliseconds taken by all map tasks=52488
                Total vcore-milliseconds taken by all reduce tasks=13859
                Total megabyte-milliseconds taken by all map tasks=53747712
                Total megabyte-milliseconds taken by all reduce tasks=14191616
        Map-Reduce Framework
                Map input records=2
                Map output records=4
                Map output bytes=36
                Map output materialized bytes=56
                Input split bytes=300
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=56
                Reduce input records=4
                Reduce output records=0
                Spilled Records=8
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=1004
                CPU time spent (ms)=4930
                Physical memory (bytes) snapshot=504107008
                Virtual memory (bytes) snapshot=8212619264
                Total committed heap usage (bytes)=263778304
                Peak Map Physical memory (bytes)=198385664
                Peak Map Virtual memory (bytes)=2737340416
                Peak Reduce Physical memory (bytes)=109477888
                Peak Reduce Virtual memory (bytes)=2741846016
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=236
        File Output Format Counters
                Bytes Written=97
Job Finished in 88.0 seconds
Estimated value of Pi is 4.00000000000000000000

Case 2: word statistics

hadoop fs -mkdir -p /wordcount/input
hadoop fs -put mywords.txt /wordcount/input
hadoop jar hadoop-mapreduce-examples-3.3.0.jar wordcount /wordcount/input /wordcount/output
[root@localhost mapreduce]# hadoop jar hadoop-mapreduce-examples-3.3.0.jar wordcount /wordcount/input /wordcount/output
2022-03-06 15:52:46,025 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at hadoop-node01/192.168.67.200:8032
2022-03-06 15:52:48,220 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1646551219840_0002
2022-03-06 15:52:49,299 INFO input.FileInputFormat: Total input files to process : 1
2022-03-06 15:52:49,673 INFO mapreduce.JobSubmitter: number of splits:1
2022-03-06 15:52:50,671 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1646551219840_0002
2022-03-06 15:52:50,672 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-03-06 15:52:51,504 INFO conf.Configuration: resource-types.xml not found
2022-03-06 15:52:51,505 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2022-03-06 15:52:51,755 INFO impl.YarnClientImpl: Submitted application application_1646551219840_0002
2022-03-06 15:52:52,019 INFO mapreduce.Job: The url to track the job: http://hadoop-node01:8088/proxy/application_1646551219840_0002/
2022-03-06 15:52:52,036 INFO mapreduce.Job: Running job: job_1646551219840_0002
2022-03-06 15:53:17,361 INFO mapreduce.Job: Job job_1646551219840_0002 running in uber mode : false
2022-03-06 15:53:17,380 INFO mapreduce.Job:  map 0% reduce 0%
2022-03-06 15:53:41,194 INFO mapreduce.Job:  map 100% reduce 0%
2022-03-06 15:54:08,530 INFO mapreduce.Job:  map 100% reduce 100%
2022-03-06 15:54:09,601 INFO mapreduce.Job: Job job_1646551219840_0002 completed successfully
2022-03-06 15:54:10,177 INFO mapreduce.Job: Counters: 54
        File System Counters
                FILE: Number of bytes read=3367
                FILE: Number of bytes written=536181
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=4681
                HDFS: Number of bytes written=2551
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=18564
                Total time spent by all reduces in occupied slots (ms)=23913
                Total time spent by all map tasks (ms)=18564
                Total time spent by all reduce tasks (ms)=23913
                Total vcore-milliseconds taken by all map tasks=18564
                Total vcore-milliseconds taken by all reduce tasks=23913
                Total megabyte-milliseconds taken by all map tasks=19009536
                Total megabyte-milliseconds taken by all reduce tasks=24486912
        Map-Reduce Framework
                Map input records=86
                Map output records=428
                Map output bytes=5358
                Map output materialized bytes=3367
                Input split bytes=118
                Combine input records=428
                Combine output records=204
                Reduce input groups=204
                Reduce shuffle bytes=3367
                Reduce input records=204
                Reduce output records=204
                Spilled Records=408
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=583
                CPU time spent (ms)=4690
                Physical memory (bytes) snapshot=285466624
                Virtual memory (bytes) snapshot=5475692544
                Total committed heap usage (bytes)=139329536
                Peak Map Physical memory (bytes)=190091264
                Peak Map Virtual memory (bytes)=2733432832
                Peak Reduce Physical memory (bytes)=95375360
                Peak Reduce Virtual memory (bytes)=2742259712
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=4563
        File Output Format Counters
                Bytes Written=2551

After execution, HDFS will automatically generate the output directory

Enter directory

After downloading the file, open it with notepad + +

Note: if you use a virtual machine, you need to fill in the virtual machine ip and domain name mapping in the host hosts file when using it with the host, otherwise the download will fail.

Keywords: Java Big Data Back-end

Added by j.bouwers on Mon, 07 Mar 2022 12:52:31 +0200