Hadoop cluster ecological construction

Hadoop cluster construction (continuous update)

The relevant resource files that are not used in this paper, the extraction code eeee

1: Preparations to be completed before starting construction

  1. Built Linux server
  2. You can access the public network (ping www.baidu.com), and you can ping it
  3. Xshell connection (can be omitted)
  4. Server version information
  5. Update system components
yum update
  1. Turn off the firewall
systemctl stop firewalld
  1. Turn off the firewall and start it automatically
systemctl disable firewalld
  1. Change host name (optional)
hostnamectl set-hostname MrBun

2: Java jdk installation

Hadoop depends on java at the bottom, so Hadoop needs java environment. All resource files are in the link at the beginning of the article. If they fail, please leave a message and update!!!
Some Linux systems come with their own Java environment. If so, you can skip this step. Of course, you can also install other versions of java jdk

  • Create java directory
mkdir -p /usr/java
  • Enter the java directory
cd /usr/java
  • Upload Java JDK

If Xshell is used in the preparatory work, you can use yum -y install lrzsz command to install lrzsz here
Use the rz -y command to upload files, and use the sz command to pull down files locally
Of course, you can also use third-party software to upload and download, or use scp command to transfer files.
There are many methods. You can baidu by yourself

Use after uploading  ll   Command to view the current directory file

  • Extract to current directory
tar -zxvf jdk-8u171-linux-x64.tar.gz -C /usr/java/
  • Check whether the decompression is successful and delete the installation package. You can also delete it without deleting it
Delete command
rm -rf jdk-8u171-linux-x64.tar.gz

  • Configure environment variables
vim /etc/profile

Add at the end of the document

export JAVA_HOME=/usr/java/jdk1.8.0_171
export CLASSPATH=$JAVA_HOME/lib/
export PATH=$PATH:$JAVA_HOME/bin
export PATH JAVA_HOME CLASSPATH

  • Each time you add an environment variable, reactivate it to take effect
source /etc/profile
  • Check whether java is installed successfully
java -version


3: Hadoop installation

Hadoop is a distributed architecture. Huge data tasks are completed on different machines (sub nodes) of the cluster. The Hadoop version used in this article is generation 2
Hadoop generation 2 has four basic components
1: Hadoop Common
2: Hadoop Distributed File System(HDFS)
3: MapReduce
4: YARN
Hadoop Common provides configuration information interface function support for 2, 3 and 4
HDFS provides storage for data
MapReduce provides calculations for data
YARN resource scheduling

  • Create Hadoop working directory
mkdir -p /usr/hadoop
cd /usr/hadoop
  • Upload, decompress and delete the compressed package
tar -zxvf hadoop-2.7.3.tar.gz -C /usr/hadoop/
rm -rf hadoop-2.7.3.tar.gz

  • Configure environment variables
vim /etc/profile

Add the following configuration items

export HADOOP_HOME=/usr/hadoop/hadoop-2.7.3
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

  • Activate environment variable
source /etc/profile
  • Check whether the installation is successful
hadoop version

Five configuration files need to be configured after installation
1: hadoop-env.sh
hadoop runtime will find Java environment variables from this file, so you need to_ Home write it in
2: core-site.xml
This file is used to set the path of temporary files and hdfs communication mode during hadoop operation.
3: hdfs-site.xml
This file sets the path of name space metadata and data data block stored in hdfs runtime.
4: yarn-site.xml
Set information related to resource scheduling
5: mapred-site.xml
Set MapReduce working path

Enter the configuration file directory

cd /usr/hadoop/hadoop-2.7.3/etc/hadoop
  • hadoop-env.sh
vim hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_171

  • core-site.xml
vim core-site.xml

The localhost here is the host name in the preparation. If you change it, remember to make the corresponding change. The same is true for the subsequent configuration files

<property>
  <name>hadoop.proxyuser.root.groups</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.root.hosts</name>
  <value>*</value>
</property>
<property>
  <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
   <value>/usr/hadoop/hadoop-2.7.3/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
  <name>io.file.buffer.size</name>
   <value>131072</value>
</property>
<property>
  <name>fs.checkpoint.period</name>
   <value>60</value>
</property>
<property>
  <name>fs.checkpoint.size</name>
   <value>67108864</value>
</property>

  • hdfs-site.xml
vim hdfs-site.xml
<property>
 <name>dfs.replication</name>
   <value>1</value>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/usr/hadoop/hadoop-2.7.3/hdfs/name</value>
   <final>true</final>
</property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/usr/hadoop/hadoop-2.7.3/hdfs/data</value>
   <final>true</final>
 </property>
 <property>
  <name>dfs.namenode.secondary.http-address</name>
   <value>localhost:9001</value>
 </property>
 <property>
   <name>dfs.webhdfs.enabled</name>
   <value>true</value>
 </property>
 <property>
   <name>dfs.permissions</name>
   <value>false</value>
 </property>

  • yarn-site.xml
vim yarn-site.xml
<!-- appoint ResourceManager Address of-->
<property>
 <name>yarn.resourcemanager.address</name>
   <value>localhost:18040</value>
 </property>
 <property>
   <name>yarn.resourcemanager.scheduler.address</name>
   <value>localhost:18030</value>
 </property>
 <property>
   <name>yarn.resourcemanager.webapp.address</name>
   <value>localhost:18088</value>
 </property>
 <property>
   <name>yarn.resourcemanager.resource-tracker.address</name>
   <value>localhost:18025</value>
 </property>
 <property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>localhost:18141</value>
 </property>
<!-- appoint reducer How to get data-->
 <property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
 </property>
 <property>
  <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

  • mapred-site.xml
because hadoop This file is not provided. We need to create it ourselves
cp mapred-site.xml.template mapred-site.xml
vim mapred-site.xml
<property>
<!--appoint Mapreduce Run in yarn upper-->
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>

  • Hadoop name node initialization
hadoop namenode -format

During the formatting process, create a name node and a temporary file storage directory for the local disk using the path set in the configuration file. That is to build a storage framework of hdfs. When no error is reported and "Exiting with status 0" appears, the format is successful.

  • Open Hadoop
cd /usr/hadoop/hadoop-2.7.3/
sbin/start-all.sh

start-all.sh open all services for
stop-all.sh stop all services for
Both are in the sbin directory

Ask whether to continue the connection. Enter yes and enter. The password entered is the password of the current user. If it is the root user, enter the password of the root user. The following service will start successfully. It is annoying to enter the password every time. Next, set ssh free

nameeffect
NameNodeNameNode is mainly responsible for managing meta information, such as file name, directory structure, attributes, data block storage location, etc
DataNodeDataNode is responsible for the specific access of data blocks
ResourceManagerThe resource manager is responsible for the unified management and distribution of all computing power in the cluster
NodeManagerNodeManager is the agent on each machine, which is responsible for container management, monitoring their resource usage, and providing resource usage reports to resource manager
SecondaryNameNodeSecondaryNameNode is an auxiliary tool of NameNode. It has two functions: one is image backup, and the other is the regular combination of log and image. Note: it is not a backup of NameNode
  • ssh free
cd ~                                //Return to home directory
ssh localhost						//localhost is the host name
ll -a								
cd .ssh								//get into. ssh folder
ssh-keygen -t rsa					//Create the public key and private key, and prompt to enter all the time
cat id_rsa.pub >> authorized_keys	//Write private key
chmod 600 authorized_keys			//Grant permission

After setting SSH free, you don't need to enter the password again when starting or closing the Hadoop cluster again

View the Hadoop cluster just built through the browser
In the address bar of the browser, enter the Ip address of the linux server plus 50070 port
For example:
http://192.168.25.128:50070

So far, the Hadoop cluster has been built
Updated on May 9, 2021

Keywords: Linux Big Data Hadoop hive

Added by ijug.net on Thu, 17 Feb 2022 10:50:58 +0200