1 big data
1.1 big data concept
big data,
IT industry term refers to a collection of data that cannot be captured, managed and processed within a certain period of time with conventional software tools,
It is a mass, high growth rate and diversified information asset that needs new processing mode to have stronger decision-making ability, insight and discovery ability and process optimization ability.
1.2 features of big data
Volume Velocity Variety Value (high value, low value density) Veracity.
1.3 unit of big data:
bit byte KB MB GB TB PB EB
1.4 data sources of big data
Business system database, log, buy data interface | crawler
2 Hadoop
2.1 source of Hadoop thought: Google
Google's three papers, big data Troika File system GFS Calculation framework MapReduce Big table
2.2 development of Hadoop
1. hadoop Father: Doung Cutting 2. Doung People use 2 years spare time to realize HDFS and MapReduce 3. GFS And HDFS Google Hadoop GFS(google file system) HDFS Mapreduce MapReduce
2.3 introduction to Hadoop
Official website: http://hadoop.apache.org/
Open source and reliable distributed computing framework.
2.4 distribution of Hadoop
- Apache Hadoop
- CDH Hadoop
- DHP Hadoop
3 important modules of Hadoop
- Hadoop Common: support other Hadoop modules, including serialization interface, protocol, etc.
- Hadoop distributed file system (HDFS): distributed file system. storage
- Hadoop YARN: a framework for job scheduling and cluster resource management. Task scheduling, resource management
- Hadoop MapReduce: a YARN based system for parallel processing of large data sets. Calculation
4 HDFS architecture
- Master / slave architecture
nameNode Management metadata (data describing data: storage host, path, number of copies, data size...) , Processing the request of reading and writing data from the client; Is the master node; dataNode Manage specific data; Client Initiate file read / write request SecondaryNameNode Synchronize metadata
- HDFS reading and writing process
HDFS write request: The client sends a write data request to the NameNode. After the request is successful, write a record and return the DataNode list. The client splits the file by 128MB (default), and writes the block to the corresponding DataNode. (note here, there are also some specific details. For example, write a block to the nearest Datanode, Then write the block to the next datanode that stores the copy of the block; At the same time, in the form of packet, the pipeline is established to write to the DataNode.) When the write is complete, the client sends a completion signal to the NameNode. HDFS read process: The client initiates a read request to the NameNode, NameNode queries metadata information, Gets the list of block locations for this file and returns. The client selects the nearest datanode server to read the data.
5 Yarn architecture
Main process
ResourceManager Processing client requests Start / monitor ApplicationMaster Monitor NodeManager Resource allocation and scheduling NodeManager Resource management on a single node Processing commands from ResourceManager Processing commands from Application Master Client Application master Data segmentation Request resources for the application and assign them to internal tasks Task monitoring and fault tolerance Container The abstraction of task running environment, It encapsulates multi-dimensional resources such as CPU and memory And information related to the operation of tasks such as environment variables and startup commands
Submit task flow to Yarn
1. The client sends a request to the ResourceManager to submit the task. 2. Resource manager sends command to NodeManager node to start App master. 3. After the app master is started, request resources from the ResourceManager. 4. The App master sends commands to the Nodemanager where the resource is located to start task tasks. The App master monitors the status of these tasks. 5. When these task s are finished, give feedback to app master. 6 after all tasks are completed, app master will feed back to Resource Manager and recycle the resources of this task.
6 MapReduce
Distributed parallel computing framework
7 Hadoop operating environment
-
standalone mode
Without starting any daemons, all programs run in a single JVM, which is generally used for development
-
Pseudo distribution
There is only one node, and all daemons run on one node. For development, testing
-
Complete distribution
There are multiple nodes, which are generally used for testing and production.
8 construction of pseudo distribution
8.1 Linux network configuration
1. Modify the host name [cannot start with a number or special characters]
# vi /etc/hostname
2. modify ip
# vi /etc/sysconfig/network-scripts/ifcfg-ens33
3. Host mapping
# vi /etc/hosts 192.168.47.81 linux01
Host mapping in windows 192.168.47.81 linux01
8.2 jdk installation
Using common users to manage clusters
Create two directories
# mkdir /opt/software /opt/modules
/ opt/software store package (*. tag.gz)
/ opt/modules installation file location
Modify authority
# chown -R hadoop:hadoop /opt/modules /opt/software
Install jdk
# rpm -qa | grep jdk # RPM - e jdkxxxxxxxx -- nodes / / uninstall without verifying dependency $ tar -zxf /opt/software/jdk-8u112-linux-x64.tar.gz -C /opt/modules/
Configure environment variables
# vim /etc/profile //Add the following: #JAVA_HOME export JAVA_HOME=/opt/modules/jdk1.8.0_112 export PATH=$PATH:$JAVA_HOME/bin
Make configuration effective
# source /etc/profile
8.3 Linux firewall and security subsystem
# systemctl stop firewalld ##Turn off firewall
# systemctl disable firewalld ##No random start
#systemctl status firewalld view firewall status
# vi /etc/sysconfig/selinux SELINUX=disabled # reboot
8.4 configure Hadoop
${HADOOP_HOME} indicates the directory where hadoop is installed
Unzip hadoop-2.x.tar.gz
# Pay attention to users and permissions $ tar -zxvf /opt/software/hadoop-2.6.0-cdh5.14.2.tar.gz -C /opt/modules/
1. Configure the java environment support of hadoop
Modify the
hadoop-env.sh ,mapred-env.sh,yarn-env.sh
Configure in these three files
export JAVA_HOME=/opt/modules/jdk1.8.0_112
2. Configuration related to hdfs
Configure core-site.xml
<!-- NameNode Address, 8020 is the designated process 8020, access portal --> <property> <name>fs.defaultFS</name> <value>hdfs://linux01:8020</value> </property> <!-- hadoop Files generated at runtime, where metadata is stored --> <property> <name>hadoop.tmp.dir</name> <value>/opt/modules/hadoop-2.6.0-cdh5.14.2/data</value> </property>
hdfs-site.xml
<! -- number of copies of files stored on hdfs, pseudo distributed configuration is 1 -- > <property> <name>dfs.replication</name> <value>1</value> </property>
3 format namenode
In the ${Hadoop ﹣ home} Directory:
$ bin/hdfs namenode -format
Format only once, do not repeat
4 start the hdfs Daemons
$ sbin/hadoop-daemon.sh start namenode //Start the namenode process $ sbin/hadoop-daemon.sh start datanode //Start datanode
View Daemons
$ jps 3097 Jps 2931 NameNode 3023 DataNode
5 Web access interface
6 HDFS file system common commands
$ bin/hdfs dfs //All dfs related operation instructions can be viewed $ bin/hdfs dfs -ls / $ bin/hdfs dfs -mkdir -p /input/test $ bin/hdfs dfs -rmdir /input/test $ bin/hdfs dfs -put /opt/software/jdk-7u67-linux-x64.tar.gz /input/test
7 configure YARN
a configure etc/hadoop/yarn-site.xml
<!-- Appoint ResorceManager Host name of the server --> <property> <name>yarn.resourcemanager.hostname</name> <value>[hostname]</value> </property> <!-- Indicate in execution MapReduce When using shuffle --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property>
Copy and rename template file
$ cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
Bconfigure etc/hadoop/mapred-site.xml
<!-- Appoint MapReduce Be based on Yarn To run --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
8. Start the hdfs yarn process
Restart hdfs
$ sbin/hadoop-daemon.sh start namenode $ sbin/hadoop-daemon.sh start datanode $ sbin/yarn-daemon.sh start resourcemanager $ sbin/yarn-daemon.sh start nodemanager
9. Check the Daemons
$ jps 4595 ResourceManager 4419 NameNode 4836 NodeManager 5896 Jps 4506 DataNode
Yahoo's web access: http://centos01:8088
10. Submit mapproducer task to yarn
1) Calculate pi
$ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.14.2.jar pi 5 100
2) wordcount word frequency statistics
a. Create a wordcount.txt in the user's home directory
$ vi /home/user01/wordcount.txt hadoop java html java linux hadoop yarn hadoop
b. Upload to the input directory of hdfs
$ bin/hdfs dfs -put ~/wordcoun.txt /input/
c. Submit wordcount task
$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /input /output
11 common errors
1: Host name configuration error or not configured; 2: xml is not configured correctly SAXParserException 3: namenode reformat 4: Modify configuration parameters, not saved 6: Executing wordcount again will prompt that the output directory already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://hadoop.beifeng.com:8020/output already exists Solution: delete the output directory on hdfs or specify the output directory again 7: unknowHostException Check host mapping 8: hdfs http://centos01:50070, firewall, host mapping cannot be accessed through windows by browser
12. Configure log aggregation
Modify mapred-site.xml to add configuration
<!-- Appoint jobhistory Host and RPC Port number --> <property> <name>mapreduce.jobhistory.address</name> <value>host name:10020</value> </property> <!-- Appoint jobhistory Service web Hosts accessed and RPC Port number --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>host name:19888</value> </property>
Modify configuration yarn-site.xml
<!-- Enable log aggregation --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <!-- Log save time --> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>86400</value> </property>
12. Stop all processes and restart for the profile to take effect
1) Stop process
$ sbin/hadoop-daemon.sh stop namenode $ sbin/hadoop-daemon.sh stop datanode $ sbin/yarn-daemon.sh stop resourcemanager $ sbin/yarn-daemon.sh stop nodemanager
2) Start process
$ sbin/hadoop-daemon.sh start namenode $ sbin/hadoop-daemon.sh start datanode $ sbin/yarn-daemon.sh start resourcemanager $ sbin/yarn-daemon.sh start nodemanager //Start history service $ sbin/mr-jobhistory-daemon.sh start historyserver
3) View Daemons
28904 ResourceManager
28724 NameNode
28808 DataNode
29152 NodeManager
29304 JobHistoryServer
30561 Jps
- View hadoop cluster on web page
View the web interface of hdfs http://hadoop.beifeng.com:50070 View yarn's web access interface http://hadoop.beifeng.com:8088 web access interface for viewing history logs http://hadoop.beifeng.com:19888