Hadoop cluster ecological construction

Hadoop cluster construction (continuous update)

The relevant resource files that are not used in this paper, the extraction code eeee

1: Preparations to be completed before starting construction

Built Linux server
You can access the public network (ping www.baidu.com), and you can ping it
Xshell connection (can be omitted)
Server version information
Update system components

yum update

Turn off the firewall

systemctl stop firewalld

Turn off the firewall and start it automatically

systemctl disable firewalld

Change host name (optional)

hostnamectl set-hostname MrBun

2: Java jdk installation

Hadoop depends on java at the bottom, so Hadoop needs java environment. All resource files are in the link at the beginning of the article. If they fail, please leave a message and update!!!
Some Linux systems come with their own Java environment. If so, you can skip this step. Of course, you can also install other versions of java jdk

Create java directory

mkdir -p /usr/java

Enter the java directory

cd /usr/java

Upload Java JDK

If Xshell is used in the preparatory work, you can use yum -y install lrzsz command to install lrzsz here
Use the rz -y command to upload files, and use the sz command to pull down files locally
Of course, you can also use third-party software to upload and download, or use scp command to transfer files.
There are many methods. You can baidu by yourself

Use after uploading  ll   Command to view the current directory file

Extract to current directory

tar -zxvf jdk-8u171-linux-x64.tar.gz -C /usr/java/

Check whether the decompression is successful and delete the installation package. You can also delete it without deleting it

Delete command
rm -rf jdk-8u171-linux-x64.tar.gz

Configure environment variables

vim /etc/profile

Add at the end of the document

export JAVA_HOME=/usr/java/jdk1.8.0_171
export CLASSPATH=$JAVA_HOME/lib/
export PATH=$PATH:$JAVA_HOME/bin
export PATH JAVA_HOME CLASSPATH

Each time you add an environment variable, reactivate it to take effect

source /etc/profile

Check whether java is installed successfully

java -version

3: Hadoop installation

Hadoop is a distributed architecture. Huge data tasks are completed on different machines (sub nodes) of the cluster. The Hadoop version used in this article is generation 2
Hadoop generation 2 has four basic components
1: Hadoop Common
2: Hadoop Distributed File System(HDFS)
3: MapReduce
4: YARN
Hadoop Common provides configuration information interface function support for 2, 3 and 4
HDFS provides storage for data
MapReduce provides calculations for data
YARN resource scheduling

Create Hadoop working directory

mkdir -p /usr/hadoop
cd /usr/hadoop

Upload, decompress and delete the compressed package

tar -zxvf hadoop-2.7.3.tar.gz -C /usr/hadoop/

rm -rf hadoop-2.7.3.tar.gz

Configure environment variables

vim /etc/profile

Add the following configuration items

export HADOOP_HOME=/usr/hadoop/hadoop-2.7.3
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Activate environment variable

source /etc/profile

Check whether the installation is successful

hadoop version

Five configuration files need to be configured after installation
1: hadoop-env.sh
hadoop runtime will find Java environment variables from this file, so you need to_ Home write it in
2: core-site.xml
This file is used to set the path of temporary files and hdfs communication mode during hadoop operation.
3: hdfs-site.xml
This file sets the path of name space metadata and data data block stored in hdfs runtime.
4: yarn-site.xml
Set information related to resource scheduling
5: mapred-site.xml
Set MapReduce working path

Enter the configuration file directory

cd /usr/hadoop/hadoop-2.7.3/etc/hadoop

hadoop-env.sh

vim hadoop-env.sh

export JAVA_HOME=/usr/java/jdk1.8.0_171

core-site.xml

vim core-site.xml

The localhost here is the host name in the preparation. If you change it, remember to make the corresponding change. The same is true for the subsequent configuration files

<property>
  <name>hadoop.proxyuser.root.groups</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.root.hosts</name>
  <value>*</value>
</property>
<property>
  <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
   <value>/usr/hadoop/hadoop-2.7.3/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
  <name>io.file.buffer.size</name>
   <value>131072</value>
</property>
<property>
  <name>fs.checkpoint.period</name>
   <value>60</value>
</property>
<property>
  <name>fs.checkpoint.size</name>
   <value>67108864</value>
</property>

hdfs-site.xml

vim hdfs-site.xml

<property>
 <name>dfs.replication</name>
   <value>1</value>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/usr/hadoop/hadoop-2.7.3/hdfs/name</value>
   <final>true</final>
</property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/usr/hadoop/hadoop-2.7.3/hdfs/data</value>
   <final>true</final>
 </property>
 <property>
  <name>dfs.namenode.secondary.http-address</name>
   <value>localhost:9001</value>
 </property>
 <property>
   <name>dfs.webhdfs.enabled</name>
   <value>true</value>
 </property>
 <property>
   <name>dfs.permissions</name>
   <value>false</value>
 </property>

yarn-site.xml

vim yarn-site.xml

<!-- appoint ResourceManager Address of-->
<property>
 <name>yarn.resourcemanager.address</name>
   <value>localhost:18040</value>
 </property>
 <property>
   <name>yarn.resourcemanager.scheduler.address</name>
   <value>localhost:18030</value>
 </property>
 <property>
   <name>yarn.resourcemanager.webapp.address</name>
   <value>localhost:18088</value>
 </property>
 <property>
   <name>yarn.resourcemanager.resource-tracker.address</name>
   <value>localhost:18025</value>
 </property>
 <property>
  <name>yarn.resourcemanager.admin.address</name>
  <value>localhost:18141</value>
 </property>
<!-- appoint reducer How to get data-->
 <property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
 </property>
 <property>
  <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

mapred-site.xml

because hadoop This file is not provided. We need to create it ourselves
cp mapred-site.xml.template mapred-site.xml
vim mapred-site.xml

<property>
<!--appoint Mapreduce Run in yarn upper-->
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>

Hadoop name node initialization

hadoop namenode -format

During the formatting process, create a name node and a temporary file storage directory for the local disk using the path set in the configuration file. That is to build a storage framework of hdfs. When no error is reported and "Exiting with status 0" appears, the format is successful.

Open Hadoop

cd /usr/hadoop/hadoop-2.7.3/
sbin/start-all.sh

start-all.sh open all services for
stop-all.sh stop all services for
Both are in the sbin directory

Ask whether to continue the connection. Enter yes and enter. The password entered is the password of the current user. If it is the root user, enter the password of the root user. The following service will start successfully. It is annoying to enter the password every time. Next, set ssh free

name	effect
NameNode	NameNode is mainly responsible for managing meta information, such as file name, directory structure, attributes, data block storage location, etc
DataNode	DataNode is responsible for the specific access of data blocks
ResourceManager	The resource manager is responsible for the unified management and distribution of all computing power in the cluster
NodeManager	NodeManager is the agent on each machine, which is responsible for container management, monitoring their resource usage, and providing resource usage reports to resource manager
SecondaryNameNode	SecondaryNameNode is an auxiliary tool of NameNode. It has two functions: one is image backup, and the other is the regular combination of log and image. Note: it is not a backup of NameNode

ssh free

cd ~                                //Return to home directory
ssh localhost						//localhost is the host name
ll -a								
cd .ssh								//get into. ssh folder
ssh-keygen -t rsa					//Create the public key and private key, and prompt to enter all the time
cat id_rsa.pub >> authorized_keys	//Write private key
chmod 600 authorized_keys			//Grant permission

After setting SSH free, you don't need to enter the password again when starting or closing the Hadoop cluster again

View the Hadoop cluster just built through the browser
In the address bar of the browser, enter the Ip address of the linux server plus 50070 port
For example:
http://192.168.25.128:50070

So far, the Hadoop cluster has been built
Updated on May 9, 2021

Keywords: Linux Big Data Hadoop hive

Added by ijug.net on Thu, 17 Feb 2022 10:50:58 +0200

Programming VIP

Hadoop cluster ecological construction

Hadoop cluster construction (continuous update)

Popular Keywords