34 Hadoop introduction and cluster construction

Hadoop composition

HDFS Architecture Overview

NameNode(nn): stores the metadata of the file, such as file name, file directory structure, file attributes, block list of each file, DataNode where the block is located, etc
DataNode(dn): stores file block data and the checksum of block data in the local file system
Secondary NameNode(2nn): backup the metadata of NameNode at regular intervals

Overview of Yarn architecture

MapReduce Architecture Overview

Big data ecosystem

Recommended system framework

Prepare template virtual machine (Centos7, 4G memory, 50G hard disk)

Installation environment

yum install -y epel-release
yum install -y psmisc nc net-tools rsync vim lrzsz ntp libzstd openssl-static tree iotop git

Turn off the firewall. Turn off the firewall and start it automatically

systemctl stop firewalld
systemctl disable firewalld

Create an ordinary user and change the password

useradd lixuan
passwd lixuan

Configure that the lixuan user has root privileges

vim /etc/sudoers

## Allow root to run any commands anywhere
root    ALL=(ALL)     ALL
lixuan   ALL=(ALL)     NOPASSWD:ALL

Create a file in the / opt directory and modify the owner and owner

mkdir /opt/module
mkdir /opt/software

chown lixuan:lixuan /opt/module 
chown lixuan:lixuan /opt/software

Uninstall the openJDK that comes with the virtual machine

rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps

Restart the virtual machine

reboot

Clone virtual machine node01

Modify the static IP of the clone machine (all three must be changed)

vim /etc/sysconfig/network-scripts/ifcfg-ens33

DEVICE=ens33
TYPE=Ethernet
ONBOOT=yes
BOOTPROTO=static
NAME="ens33"
IPADDR=192.168.50.100
PREFIX=24
GATEWAY=192.168.50.2
DNS1=192.168.50.2

View virtual network editor

Modify clone hostname

vim /etc/hostname

Configure host file

vim /etc/hosts

192.168.50.100 node01
192.168.50.110 node02
192.168.50.120 node03
192.168.50.130 node04

restart

Modify the hosts file of windows host

Install JDK

ls /opt/software/

hadoop-3.1.3.tar.gz  jdk-8u212-linux-x64.tar.gz

Unzip JDK

tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/

Configure JDK environment variables

sudo vim /etc/profile.d/my_env.sh

#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin

source /etc/profile

Install hadoop

tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/

Add Hadoop to environment variable

sudo vim /etc/profile.d/my_env.sh

#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

source /etc/profile

Write cluster distribution script xsync

scp

scp    -r          $pdir/$fname              $user@$host:$pdir/$fname
 Command recursion       File path to copy/name    		Target user@host:Destination path/name

scp -r /opt/module/jdk1.8.0_212  lixuan@node02:/opt/module

scp -r lixuan@node01:/opt/module/* lixuan@node03:/opt/module

rsync

It is mainly used for backup and mirroring. It has the advantages of high speed, avoiding copying the same content and supporting symbolic links

rsync    -av       $pdir/$fname              $user@$host:$pdir/$fname
 The command option parameter is the path of the file to be copied/name    Target user@host:Destination path/name

rsync -av /opt/software/* lixuan@node02:/opt/software

Write xsync

cd /home/lixuan
mkdir bin
cd bin
vim xsync

#!/bin/bash
#1. Number of judgment parameters
if [ $# -lt 1 ]
then
  echo Input Path You Need Give Others
  exit;
fi
#2. Traverse all machines in the cluster
for host in node01 node02 node03 node04
do
  echo ====================  $host  ====================
  #3. Traverse all directories and send them one by one
  for file in $@
  do
    #4. Judge whether the document exists
    if [ -e $file ]
    then
      #5. Get parent directory
      pdir=$(cd -P $(dirname $file); pwd)
      #6. Get the name of the current file
      fname=$(basename $file)
      ssh $host "mkdir -p $pdir"
      rsync -av $pdir/$fname $host:$pdir
    else
      echo $file does not exists!
    fi
  done
done

chmod +x xsync

SSH password free login configuration

ssh-keygen -t rsa	#Then hit three returns

ssh-copy-id node02
ssh-copy-id node03

The other two machines have to perform the same operation

Cluster deployment planning

	node01	node02	node03
HDFS	NameNode DataNode	DataNode	SecondaryNameNode DataNode
YARN	NodeManager	ResourceManager NodeManager	NodeManager

Configure cluster

core-site.xml

cd $HADOOP_HOME/etc/hadoop
vim core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- appoint NameNode Address of -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node01:9820</value>
	</property>
<!-- appoint hadoop Storage directory of data -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-3.1.3/data</value>
	</property>

<!-- to configure HDFS The static user used for web page login is lixuan -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>lixuan</value>
	</property>

<!-- Configure this lixuan(superUser)Host nodes that are allowed to be accessed through proxy -->
    <property>
        <name>hadoop.proxyuser.lixuan.hosts</name>
        <value>*</value>
	</property>
<!-- Configure this lixuan(superUser)Allow groups to which users belong through proxy -->
    <property>
        <name>hadoop.proxyuser.lixuan.groups</name>
        <value>*</value>
	</property>
<!-- Configure this lixuan(superUser)Allow users through proxy-->
    <property>
        <name>hadoop.proxyuser.lixuan.groups</name>
        <value>*</value>
	</property>

</configuration>

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- nn web End access address-->
	<property>
        <name>dfs.namenode.http-address</name>
        <value>node01:9870</value>
    </property>
	<!-- 2nn web End access address-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>node03:9868</value>
    </property>
</configuration>

yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- appoint MR go shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
</property>
<!-- appoint ResourceManager Address of-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node02</value>
</property>
<!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<!-- yarn Maximum and minimum memory allowed to be allocated by the container -->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>512</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>4096</value>
</property>
<!-- yarn The amount of physical memory the container allows to manage -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
</property>
<!-- close yarn Limit check on physical memory and virtual memory -->
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
</configuration>

Configure log aggregation

At yarn site Add the following configuration to XML

<!-- Enable log aggregation -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<!-- Set log aggregation server address -->
<property>  
    <name>yarn.log.server.url</name>  
    <value>http://node01:19888/jobhistory/logs</value>
</property>
<!-- Set the log retention time to 7 days -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- appoint MapReduce The program runs on Yarn upper -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Configure history server

At mapred site Add the following configuration to XML

<!-- Historical server address -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>node01:10020</value>
</property>

<!-- History server web End address -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>node01:19888</value>
</property>

Distribution profile

xsync /opt/module/hadoop-3.1.3/etc/hadoop/

Configure workers

vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

#Add the following contents. There shall be no spaces at the end and no blank lines in the file

node01
node02
node03

#Synchronize all node profiles
xsync /opt/module/hadoop-3.1.3/etc

Group together

First start

If it is the first time to start, you need to format the NameNode in node01 node
```
hdfs namenode -format
```
If you want to format after running, delete the data and logs directories of all machines before formatting

Start of node01 history server

mapred --daemon start historyserver

jps

Write jpsall script

#!/bin/bash
for host in hadoop102 hadoop103 hadoop104
do
	echo =============== $host ===============
    ssh $host jps $@ | grep -v Jps
done

chmod +x jpsall

Write the cluster startup and shutdown script myhadoop sh

#!/bin/bash
if [ $# -lt 1 ]
then
    echo "In Put start/stop"
    exit ;
fi
case $1 in
"start")
        echo " =================== start-up hadoop colony ==================="

        echo " --------------- start-up hdfs ---------------"
        ssh node01 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
        echo " --------------- start-up yarn ---------------"
        ssh node02 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
        echo " --------------- start-up historyserver ---------------"
        ssh node01 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== close hadoop colony ==================="

        echo " --------------- close historyserver ---------------"
        ssh node01 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
        echo " --------------- close yarn ---------------"
        ssh node02 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo " --------------- close hdfs ---------------"
        ssh node01 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

Added by Skippy93 on Thu, 27 Jan 2022 08:14:33 +0200

Programming VIP