Some experience of using Hadoop

Some experience on the use of HDFS

Write before:

I've been working on big data in the company for some time. Take time to sort out the problems encountered and some better optimization methods.

1.HDFS storage multi directory

1.1 production server disk

1.2 on HDFS site Configure multiple directories in the XML file, and pay attention to the access rights of the newly mounted disk.

The path to save data in the DataNode node of HDFS is determined by DFS DataNode. data. The dir parameter determines the default value of file://${hadoop.tmp.dir}/dfs/data. If the server has multiple disks, this parameter must be modified. Adjust the parameters according to the above figure, and the results are as follows:

<property>
    <name>dfs.datanode.data.dir</name>
 <value>file:///dfs/data1,file:///hd2/dfs/data2,file:///hd3/dfs/data3,file:///hd4/dfs/data4</value>
</property>
2. Cluster data balancing

1. Data balance between nodes

Enable data equalization command

start-balancer.sh -threshold 10

For parameter 10, it means that the disk space utilization rate of each node in the cluster does not exceed 10%, which can be adjusted according to the actual situation

Stop data equalization command

stop-balancer.sh

2. Data balance between disks

(1) Generate balance plan

hdfs diskbalancer -plan   test-dj01

(2) Execute balanced plan

hdfs diskbalancer -execute test-dj01.plan.json

(3) view the execution of the current balancing task

hdfs  diskbalancer -query test-dj01

(4) cancel the balancing task

hdfs diskbalancer -cancel hadoop103.plan.json
3. LZO create index

Create the index of lzo file. The slicability of lzo compressed file depends on its index, so we need to manually create the index for lzo compressed file. If there is no index, there is only one slice of lzo file;

The command to create the index is:

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar  com.hadoop.compression.lzo.DistributedLzoIndexer /input/bigtable.lzo
4. Benchmark test

In the local virtual machine test, the company cluster is not used

1) Test HDFS write performance

Test content: write 10 128M files to HDFS cluster

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 128MB



2020-04-16 13:41:24,724 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
2020-04-16 13:41:24,724 INFO fs.TestDFSIO:             Date & time: Thu Apr 16 13:41:24 CST 2020
2020-04-16 13:41:24,724 INFO fs.TestDFSIO:         Number of files: 10
2020-04-16 13:41:24,725 INFO fs.TestDFSIO:  Total MBytes processed: 1280
2020-04-16 13:41:24,725 INFO fs.TestDFSIO:       Throughput mb/sec: 8.88
2020-04-16 13:41:24,725 INFO fs.TestDFSIO:  Average IO rate mb/sec: 8.96
2020-04-16 13:41:24,725 INFO fs.TestDFSIO:   IO rate std deviation: 0.87
2020-04-16 13:41:24,725 INFO fs.TestDFSIO:      Test exec time sec: 67.61

2) Test HDFS read performance

Test content: read 10 128M files of HDFS cluster

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 128MB



2020-04-16 13:43:38,857 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
2020-04-16 13:43:38,858 INFO fs.TestDFSIO:   Date & time: Thu Apr 16 13:43:38 CST 2020
2020-04-16 13:43:38,859 INFO fs.TestDFSIO:         Number of files: 10
2020-04-16 13:43:38,859 INFO fs.TestDFSIO:  Total MBytes processed: 1280
2020-04-16 13:43:38,859 INFO fs.TestDFSIO:       Throughput mb/sec: 85.54
2020-04-16 13:43:38,860 INFO fs.TestDFSIO:  Average IO rate mb/sec: 100.21
2020-04-16 13:43:38,860 INFO fs.TestDFSIO:   IO rate std deviation: 44.37
2020-04-16 13:43:38,860 INFO fs.TestDFSIO:      Test exec time sec: 53.61

3) Delete test generated data

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar TestDFSIO -clean

4) Use Sort program to evaluate MapReduce

(1) RandomWriter is used to generate random numbers. Each node runs 10 Map tasks, and each Map generates binary random numbers with a size of about 1G

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar randomwriter random-data

(2) execute Sort program

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar sort random-data sorted-data

(3) verify whether the data are really in order

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar testmapredsort -sortInput random-data -sortOutput sorted-data
5.Hadoop parameter tuning

1)HDFS parameter tuning HDFS site xml

​ dfs.namenode.handler.count = [the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-ngyasnoe-1624345157754)( file:///C: \Users\wanyukun\AppData\Local\Temp\ksohtml20784\wps1. Jpg)], for example, when the cluster size is 8, this parameter is set to 41

The number of Namenode RPC server threads that listen to requests from clients. If dfs.namenode.servicerpc-address is not configured then Namenode RPC server threads listen to requests from all nodes.
NameNode There is a pool of worker threads to handle different tasks DataNode Concurrent heartbeat and concurrent metadata operations on the client.
For large clusters or clusters with a large number of clients, you usually need to increase the parameters dfs.namenode.handler.count The default value of 10.
<property>
    <name>dfs.namenode.handler.count</name>
    <value>10</value>
</property>

2)yarnc parameter tuning yarn site xml

(1) Scenario Description: a total of 7 machines with hundreds of millions of data every day. Data source - > flume - > Kafka - > HDFS - > hive

Problems: HiveSQL is mainly used for data statistics. There is no data skew. Small files have been merged and reused by the opened JVM. Moreover, IO is not blocked and less than 50% of memory is used. However, it still runs very slowly, and when the peak of data volume comes, the whole cluster will go down. Based on this situation, there is no optimization scheme.

(2) Solution:

Insufficient memory utilization. This is generally caused by two configurations of Yarn. The maximum memory size that can be applied for by a single task and the available memory size of a single Hadoop node. Adjusting these two parameters can improve the utilization of system memory.

(a)  yarn.nodemanager.resource.memory-mb

​	Indicates on this node YARN The total amount of physical memory that can be used. The default is 8192( MB),Note that if your node memory resources are not enough 8 GB,You need to decrease this value, and YARN It will not intelligently detect the total physical memory of the node.

(b)  yarn.scheduler.maximum-allocation-mb

​	The maximum amount of physical memory that can be applied for by a single task. The default is 8192( MB). 

yarn.nodemanager.resource.memory-mb

Indicates the total amount of physical memory that can be used by YARN on the node. The default is 8192 (MB). Note that if your node memory resources are insufficient for 8GB, you need to reduce this value, and YARN will not intelligently detect the total amount of physical memory of the node.

(b) yarn.scheduler.maximum-allocation-mb

The maximum amount of physical memory that can be requested by a single task. The default is 8192 (MB).

Keywords: Big Data Hadoop hdfs mapreduce

Added by Soldier Jane on Fri, 28 Jan 2022 02:06:47 +0200