34 Hadoop introduction and cluster construction

Hadoop composition

HDFS Architecture Overview

  • NameNode(nn): stores the metadata of the file, such as file name, file directory structure, file attributes, block list of each file, DataNode where the block is located, etc
  • DataNode(dn): stores file block data and the checksum of block data in the local file system
  • Secondary NameNode(2nn): backup the metadata of NameNode at regular intervals

Overview of Yarn architecture

MapReduce Architecture Overview

Big data ecosystem

Recommended system framework

Prepare template virtual machine (Centos7, 4G memory, 50G hard disk)

Installation environment

yum install -y epel-release
yum install -y psmisc nc net-tools rsync vim lrzsz ntp libzstd openssl-static tree iotop git

Turn off the firewall. Turn off the firewall and start it automatically

systemctl stop firewalld
systemctl disable firewalld

Create an ordinary user and change the password

useradd lixuan
passwd lixuan

Configure that the lixuan user has root privileges

vim /etc/sudoers
## Allow root to run any commands anywhere
root    ALL=(ALL)     ALL
lixuan   ALL=(ALL)     NOPASSWD:ALL

Create a file in the / opt directory and modify the owner and owner

mkdir /opt/module
mkdir /opt/software
chown lixuan:lixuan /opt/module 
chown lixuan:lixuan /opt/software

Uninstall the openJDK that comes with the virtual machine

rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps

Restart the virtual machine

reboot

Clone virtual machine node01

Modify the static IP of the clone machine (all three must be changed)

vim /etc/sysconfig/network-scripts/ifcfg-ens33
DEVICE=ens33
TYPE=Ethernet
ONBOOT=yes
BOOTPROTO=static
NAME="ens33"
IPADDR=192.168.50.100
PREFIX=24
GATEWAY=192.168.50.2
DNS1=192.168.50.2

View virtual network editor

Modify clone hostname

vim /etc/hostname

Configure host file

vim /etc/hosts
192.168.50.100 node01
192.168.50.110 node02
192.168.50.120 node03
192.168.50.130 node04

restart

Modify the hosts file of windows host

Install JDK

ls /opt/software/

hadoop-3.1.3.tar.gz  jdk-8u212-linux-x64.tar.gz

Unzip JDK

tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/

Configure JDK environment variables

sudo vim /etc/profile.d/my_env.sh
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile

Install hadoop

tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/

Add Hadoop to environment variable

sudo vim /etc/profile.d/my_env.sh
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
source /etc/profile

Write cluster distribution script xsync

scp

scp    -r          $pdir/$fname              $user@$host:$pdir/$fname
 Command recursion       File path to copy/name    		Target user@host:Destination path/name
scp -r /opt/module/jdk1.8.0_212  lixuan@node02:/opt/module
scp -r lixuan@node01:/opt/module/* lixuan@node03:/opt/module

rsync

  • It is mainly used for backup and mirroring. It has the advantages of high speed, avoiding copying the same content and supporting symbolic links

    rsync    -av       $pdir/$fname              $user@$host:$pdir/$fname
     The command option parameter is the path of the file to be copied/name    Target user@host:Destination path/name
    
    rsync -av /opt/software/* lixuan@node02:/opt/software
    

Write xsync

cd /home/lixuan
mkdir bin
cd bin
vim xsync
#!/bin/bash
#1. Number of judgment parameters
if [ $# -lt 1 ]
then
  echo Input Path You Need Give Others
  exit;
fi
#2. Traverse all machines in the cluster
for host in node01 node02 node03 node04
do
  echo ====================  $host  ====================
  #3. Traverse all directories and send them one by one
  for file in $@
  do
    #4. Judge whether the document exists
    if [ -e $file ]
    then
      #5. Get parent directory
      pdir=$(cd -P $(dirname $file); pwd)
      #6. Get the name of the current file
      fname=$(basename $file)
      ssh $host "mkdir -p $pdir"
      rsync -av $pdir/$fname $host:$pdir
    else
      echo $file does not exists!
    fi
  done
done
chmod +x xsync

SSH password free login configuration

ssh-keygen -t rsa	#Then hit three returns
ssh-copy-id node02
ssh-copy-id node03
  • The other two machines have to perform the same operation

Cluster deployment planning

node01 node02 node03
HDFS NameNode DataNode DataNode SecondaryNameNode DataNode
YARN NodeManager ResourceManager NodeManager NodeManager

Configure cluster

core-site.xml

cd $HADOOP_HOME/etc/hadoop
vim core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- appoint NameNode Address of -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node01:9820</value>
	</property>
<!-- appoint hadoop Storage directory of data -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-3.1.3/data</value>
	</property>

<!-- to configure HDFS The static user used for web page login is lixuan -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>lixuan</value>
	</property>

<!-- Configure this lixuan(superUser)Host nodes that are allowed to be accessed through proxy -->
    <property>
        <name>hadoop.proxyuser.lixuan.hosts</name>
        <value>*</value>
	</property>
<!-- Configure this lixuan(superUser)Allow groups to which users belong through proxy -->
    <property>
        <name>hadoop.proxyuser.lixuan.groups</name>
        <value>*</value>
	</property>
<!-- Configure this lixuan(superUser)Allow users through proxy-->
    <property>
        <name>hadoop.proxyuser.lixuan.groups</name>
        <value>*</value>
	</property>

</configuration>

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- nn web End access address-->
	<property>
        <name>dfs.namenode.http-address</name>
        <value>node01:9870</value>
    </property>
	<!-- 2nn web End access address-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>node03:9868</value>
    </property>
</configuration>

yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- appoint MR go shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
</property>
<!-- appoint ResourceManager Address of-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node02</value>
</property>
<!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<!-- yarn Maximum and minimum memory allowed to be allocated by the container -->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>512</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>4096</value>
</property>
<!-- yarn The amount of physical memory the container allows to manage -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
</property>
<!-- close yarn Limit check on physical memory and virtual memory -->
    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
</configuration>

Configure log aggregation

  • At yarn site Add the following configuration to XML

    <!-- Enable log aggregation -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <!-- Set log aggregation server address -->
    <property>  
        <name>yarn.log.server.url</name>  
        <value>http://node01:19888/jobhistory/logs</value>
    </property>
    <!-- Set the log retention time to 7 days -->
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>
    

mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
	<!-- appoint MapReduce The program runs on Yarn upper -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Configure history server

  • At mapred site Add the following configuration to XML

    <!-- Historical server address -->
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>node01:10020</value>
    </property>
    
    <!-- History server web End address -->
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>node01:19888</value>
    </property>
    

Distribution profile

xsync /opt/module/hadoop-3.1.3/etc/hadoop/

Configure workers

vim /opt/module/hadoop-3.1.3/etc/hadoop/workers
#Add the following contents. There shall be no spaces at the end and no blank lines in the file

node01
node02
node03
#Synchronize all node profiles
xsync /opt/module/hadoop-3.1.3/etc

Group together

First start

  • If it is the first time to start, you need to format the NameNode in node01 node

    hdfs namenode -format
    
  • If you want to format after running, delete the data and logs directories of all machines before formatting

Start of node01 history server

mapred --daemon start historyserver
jps

Write jpsall script

#!/bin/bash
for host in hadoop102 hadoop103 hadoop104
do
	echo =============== $host ===============
    ssh $host jps $@ | grep -v Jps
done
chmod +x jpsall

Write the cluster startup and shutdown script myhadoop sh

#!/bin/bash
if [ $# -lt 1 ]
then
    echo "In Put start/stop"
    exit ;
fi
case $1 in
"start")
        echo " =================== start-up hadoop colony ==================="

        echo " --------------- start-up hdfs ---------------"
        ssh node01 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
        echo " --------------- start-up yarn ---------------"
        ssh node02 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
        echo " --------------- start-up historyserver ---------------"
        ssh node01 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== close hadoop colony ==================="

        echo " --------------- close historyserver ---------------"
        ssh node01 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
        echo " --------------- close yarn ---------------"
        ssh node02 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo " --------------- close hdfs ---------------"
        ssh node01 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

Added by Skippy93 on Thu, 27 Jan 2022 08:14:33 +0200