Installation and introduction of hadoop

1 big data

1.1 big data concept

big data,
IT industry term refers to a collection of data that cannot be captured, managed and processed within a certain period of time with conventional software tools,
It is a mass, high growth rate and diversified information asset that needs new processing mode to have stronger decision-making ability, insight and discovery ability and process optimization ability.

1.2 features of big data
Volume
 Velocity
 Variety
 Value (high value, low value density)
Veracity.
1.3 unit of big data:
bit byte KB MB GB TB PB EB
1.4 data sources of big data
Business system database, log, buy data interface | crawler

2 Hadoop

2.1 source of Hadoop thought: Google
	Google's three papers, big data Troika
	File system GFS
	Calculation framework MapReduce
	Big table
2.2 development of Hadoop
1. hadoop Father:	Doung Cutting
2. Doung People use 2 years spare time to realize HDFS and MapReduce
3. GFS And HDFS
	Google								Hadoop
	GFS(google file system)			   HDFS
	Mapreduce							MapReduce
2.3 introduction to Hadoop

Official website: http://hadoop.apache.org/

Open source and reliable distributed computing framework.

2.4 distribution of Hadoop
  • Apache Hadoop
  • CDH Hadoop
  • DHP Hadoop

3 important modules of Hadoop

  • Hadoop Common: support other Hadoop modules, including serialization interface, protocol, etc.
  • Hadoop distributed file system (HDFS): distributed file system. storage
  • Hadoop YARN: a framework for job scheduling and cluster resource management. Task scheduling, resource management
  • Hadoop MapReduce: a YARN based system for parallel processing of large data sets. Calculation

4 HDFS architecture

  • Master / slave architecture
nameNode 
	Management metadata (data describing data: storage host, path, number of copies, data size...) ,
	Processing the request of reading and writing data from the client;
	Is the master node;

dataNode
	Manage specific data;

Client
	Initiate file read / write request

SecondaryNameNode
	Synchronize metadata
  • HDFS reading and writing process
HDFS write request:
The client sends a write data request to the NameNode.
After the request is successful, write a record and return the DataNode list.
The client splits the file by 128MB (default), and writes the block to the corresponding DataNode.
(note here, there are also some specific details. For example, write a block to the nearest Datanode,
Then write the block to the next datanode that stores the copy of the block;
At the same time, in the form of packet, the pipeline is established to write to the DataNode.)
When the write is complete, the client sends a completion signal to the NameNode.

HDFS read process:
The client initiates a read request to the NameNode,
NameNode queries metadata information,
Gets the list of block locations for this file and returns.
The client selects the nearest datanode server to read the data.

5 Yarn architecture

Main process

ResourceManager
    Processing client requests
    Start / monitor ApplicationMaster
    Monitor NodeManager
    Resource allocation and scheduling

NodeManager
    Resource management on a single node
    Processing commands from ResourceManager
    Processing commands from Application Master

Client

Application master
    Data segmentation
    Request resources for the application and assign them to internal tasks
    Task monitoring and fault tolerance

Container
	The abstraction of task running environment,
	It encapsulates multi-dimensional resources such as CPU and memory
	And information related to the operation of tasks such as environment variables and startup commands

Submit task flow to Yarn

1. The client sends a request to the ResourceManager to submit the task.
2. Resource manager sends command to NodeManager node to start App master.
3. After the app master is started, request resources from the ResourceManager.
4. The App master sends commands to the Nodemanager where the resource is located to start task tasks. The App master monitors the status of these tasks.
5. When these task s are finished, give feedback to app master.
6 after all tasks are completed, app master will feed back to Resource Manager and recycle the resources of this task.

6 MapReduce

Distributed parallel computing framework

7 Hadoop operating environment

  • standalone mode

    Without starting any daemons, all programs run in a single JVM, which is generally used for development

  • Pseudo distribution

    There is only one node, and all daemons run on one node. For development, testing

  • Complete distribution

    There are multiple nodes, which are generally used for testing and production.

8 construction of pseudo distribution

8.1 Linux network configuration

1. Modify the host name [cannot start with a number or special characters]

# vi /etc/hostname 

2. modify ip

# vi /etc/sysconfig/network-scripts/ifcfg-ens33

3. Host mapping

# vi /etc/hosts
192.168.47.81 linux01
Host mapping in windows
192.168.47.81 linux01

8.2 jdk installation

Using common users to manage clusters

Create two directories

# mkdir /opt/software /opt/modules

/ opt/software store package (*. tag.gz)

/ opt/modules installation file location

Modify authority

# chown -R hadoop:hadoop /opt/modules /opt/software

Install jdk

# rpm -qa | grep jdk
# RPM - e jdkxxxxxxxx -- nodes / / uninstall without verifying dependency

$ tar -zxf /opt/software/jdk-8u112-linux-x64.tar.gz -C /opt/modules/

Configure environment variables

# vim /etc/profile
//Add the following:
#JAVA_HOME
export JAVA_HOME=/opt/modules/jdk1.8.0_112
export PATH=$PATH:$JAVA_HOME/bin

Make configuration effective

# source /etc/profile

8.3 Linux firewall and security subsystem

# systemctl stop firewalld     ##Turn off firewall
# systemctl disable firewalld   ##No random start
#systemctl status firewalld view firewall status 
# vi /etc/sysconfig/selinux
SELINUX=disabled

# reboot

8.4 configure Hadoop

${HADOOP_HOME} indicates the directory where hadoop is installed

Unzip hadoop-2.x.tar.gz
# Pay attention to users and permissions
$ tar -zxvf /opt/software/hadoop-2.6.0-cdh5.14.2.tar.gz -C /opt/modules/
1. Configure the java environment support of hadoop

Modify the

hadoop-env.sh ,mapred-env.sh,yarn-env.sh

Configure in these three files

export JAVA_HOME=/opt/modules/jdk1.8.0_112
2. Configuration related to hdfs

Configure core-site.xml

<!--  NameNode Address, 8020 is the designated process 8020, access portal -->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://linux01:8020</value>
</property>

<!--  hadoop Files generated at runtime, where metadata is stored -->
<property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/modules/hadoop-2.6.0-cdh5.14.2/data</value>
</property>

hdfs-site.xml

<! -- number of copies of files stored on hdfs, pseudo distributed configuration is 1 -- >
<property>
	<name>dfs.replication</name>
	<value>1</value>
</property>
3 format namenode

In the ${Hadoop ﹣ home} Directory:

$ bin/hdfs namenode -format     

Format only once, do not repeat

4 start the hdfs Daemons
$ sbin/hadoop-daemon.sh start namenode  //Start the namenode process
$ sbin/hadoop-daemon.sh start datanode    //Start datanode

View Daemons

$  jps
3097 Jps
2931 NameNode
3023 DataNode
5 Web access interface

http: / / hostname: 50070/

6 HDFS file system common commands
$ bin/hdfs dfs   //All dfs related operation instructions can be viewed
$ bin/hdfs dfs -ls /  
$ bin/hdfs dfs -mkdir -p /input/test
$ bin/hdfs dfs  -rmdir /input/test
$ bin/hdfs dfs  -put /opt/software/jdk-7u67-linux-x64.tar.gz /input/test
7 configure YARN
a configure etc/hadoop/yarn-site.xml
<!-- Appoint ResorceManager Host name of the server -->
<property>
	<name>yarn.resourcemanager.hostname</name>
	<value>[hostname]</value>
</property>

<!-- Indicate in execution MapReduce When using shuffle -->
<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>

<property>
	<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

Copy and rename template file

$ cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
Bconfigure etc/hadoop/mapred-site.xml
<!-- Appoint MapReduce Be based on Yarn To run -->
<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
</property>
8. Start the hdfs yarn process

Restart hdfs

$ sbin/hadoop-daemon.sh start namenode
$ sbin/hadoop-daemon.sh start datanode
$ sbin/yarn-daemon.sh start resourcemanager
$ sbin/yarn-daemon.sh start nodemanager
9. Check the Daemons
$ jps

4595 ResourceManager
4419 NameNode
4836 NodeManager
5896 Jps
4506 DataNode

Yahoo's web access: http://centos01:8088

10. Submit mapproducer task to yarn
1) Calculate pi
$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.14.2.jar pi 5 100

2) wordcount word frequency statistics

a. Create a wordcount.txt in the user's home directory

$ vi /home/user01/wordcount.txt
hadoop java
html java
linux hadoop
yarn hadoop

b. Upload to the input directory of hdfs

$ bin/hdfs dfs -put ~/wordcoun.txt /input/

c. Submit wordcount task

$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar  wordcount  /input /output
11 common errors
1: Host name configuration error or not configured;
2: xml is not configured correctly SAXParserException
 3: namenode reformat
 4: Modify configuration parameters, not saved
 6: Executing wordcount again will prompt that the output directory already exists
		org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 	hdfs://hadoop.beifeng.com:8020/output already exists
	 Solution: delete the output directory on hdfs or specify the output directory again
7: unknowHostException  
Check host mapping
 8: hdfs http://centos01:50070, firewall, host mapping cannot be accessed through windows by browser
12. Configure log aggregation

Modify mapred-site.xml to add configuration

<!--  Appoint jobhistory Host and RPC Port number  -->
<property>
	<name>mapreduce.jobhistory.address</name>
	<value>host name:10020</value>
</property>

<!--  Appoint jobhistory Service web Hosts accessed and RPC Port number  -->
<property>
	<name>mapreduce.jobhistory.webapp.address</name>
	<value>host name:19888</value>
</property>

Modify configuration yarn-site.xml

<!--  Enable log aggregation  -->
<property>
	<name>yarn.log-aggregation-enable</name>
	<value>true</value>
</property>

<!--  Log save time  -->
<property>
	<name>yarn.log-aggregation.retain-seconds</name>
	<value>86400</value>
</property>

12. Stop all processes and restart for the profile to take effect

1) Stop process

$ sbin/hadoop-daemon.sh stop namenode
$ sbin/hadoop-daemon.sh stop datanode
$ sbin/yarn-daemon.sh stop resourcemanager
$ sbin/yarn-daemon.sh stop nodemanager

2) Start process

$ sbin/hadoop-daemon.sh start namenode
$ sbin/hadoop-daemon.sh start datanode
$ sbin/yarn-daemon.sh start resourcemanager
$ sbin/yarn-daemon.sh start nodemanager
//Start history service
$ sbin/mr-jobhistory-daemon.sh start historyserver

3) View Daemons

28904 ResourceManager

28724 NameNode

28808 DataNode

29152 NodeManager

29304 JobHistoryServer

30561 Jps

  1. View hadoop cluster on web page
View the web interface of hdfs
http://hadoop.beifeng.com:50070
 View yarn's web access interface
http://hadoop.beifeng.com:8088
 web access interface for viewing history logs
http://hadoop.beifeng.com:19888
Published 42 original articles, won praise 1, visited 223
Private letter follow

Keywords: Hadoop NodeManager xml Big Data

Added by blackbeard on Thu, 16 Jan 2020 17:20:28 +0200