009 Optimization & new features & HA

1,Hadoop data compression

compression algorithmOriginal file sizeCompressed file sizeCompression speedDecompression speedBring your ownsegmentationChange procedure
gzip8.3GB1.8GB17.5MB/s58MB/syesnono
bzip28.3GB1.1GB2.4MB/s9.5MB/syesyesno
LZO8.3GB2.9GB49.3MB/s74.6MB/snoyesyes
  • Input compression: (Hadoop uses the file extension to determine whether it supports a codec, core-site.xml)
    org.apache.hadoop.io.compress.DefaultCodec
    org.apache.hadoop.io.compress.GzipCodec
    org.apache.hadoop.io.compress.BZip2Codec
    com.hadoop.compression.lzo.LzopCodec
    org.apache.hadoop.io.compress.SnappyCodec
  • mapper output: (enterprises often use LZO or Snappy codec to compress data at this stage, mapred-site.xml)
    com.hadoop.compression.lzo.LzopCodec
    org.apache.hadoop.io.compress.SnappyCodec
  • reducer output: (use standard tools or codecs, such as gzip and bzip2, mapred-site.xml)
    org.apache.hadoop.io.compress.GzipCodec
    org.apache.hadoop.io.compress.BZip2Codec

PS: LZO format is based on GPL license and cannot be distributed through Apache. Based on this, its hadoop CODEC / decoder must be downloaded separately, Installation and compilation lzo details on Linux . Lzop encoder / decoder is compatible with lzop tool. It is actually LZO format, but there is an additional header. It is exactly what we want. There is also a pure LZO format encoder / decoder LzoCodec, which uses lzo_deflate as the extension (according to DEFLATE and so on, it is a gzip format without a header).

1.1 compression and decompression of data stream

//Get compression codec codec
CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration());
CompressionCodec codec = factory.getCodecByName(method);
//To obtain the normal output stream, you need to add a compression suffix after the file
FileOutputStream fos = new FileOutputStream(new File(filename + codec.getDefaultExtension()));
//Obtain the compressed output stream and compress the fos with the compression decoder
CompressionOutputStream cos = codec.createOutputStream(fos);
//Get compression codec codec
CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration());
CompressionCodec codec = factory.getCodec(new Path(filename));
//Get normal input stream
FileInputStream fis = new FileInputStream(new File(filename));
//Get the compressed output stream and decompress the fis with the compression decoder
CompressionInputStream cis = codec.createInputStream(fis);

1.2. Map and Reduce outputs are compressed

Mapper and Reducer remain unchanged

// Enable map output compression
conf.setBoolean("mapreduce.map.output.compress", true);
// Set the map side output compression mode
conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class,CompressionCodec.class);
// Set the output compression on the reduce side
FileOutputFormat.setCompressOutput(job, true);
// Set compression mode
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);

2. Hadoop enterprise optimization

2.1. Bottleneck of MapReduce program efficiency

1) Computer performance: CPU, memory, hard disk, network
2) I/O operation optimization: data skew, unreasonable number of maptasks and ReduceTask, small files, compressed files cannot be segmented, too many slices, too many merges, and too long Reduce time

Solution:
1) Input stage: combine textinputformat merges a large number of small files at the input end
2) Map stage: reduce the number of overflow writes, reduce the number of merges, and add Combine
mapred-default.xml

<!-- Increase trigger Spill Maximum memory limit-->
<property>
  <name>mapreduce.task.io.sort.mb</name>
  <value>100</value>
  <description>The total amount of buffer memory to use while sorting
  files, in megabytes.  By default, gives each merge stream 1MB, which
  should minimize seeks.</description>
</property>
<property>
  <name>mapreduce.map.sort.spill.percent</name>
  <value>0.80</value>
  <description>The soft limit in the serialization buffer. Once reached, a
  thread will begin to spill the contents to disk in the background. Note that
  collection will not block if this threshold is exceeded while a spill is
  already in progress, so spills may be larger than this threshold when it is
  set to less than .5</description>
</property>

<!--enlarge Merge Number of files-->
<property>
  <name>mapreduce.task.io.sort.factor</name>
  <value>10</value>
  <description>The number of streams to merge at once while sorting
  files.  This determines the number of open file handles.</description>
</property>

3) Reduce stage: reasonably set the number of maptasks and reducetasks (too few tasks will wait, and too many tasks will compete), set the coexistence of Map and reduce (after the Map runs to a certain extent, start running reduce), and reduce (a large amount of network consumption will be caused by the data obtained by reduce)
mapred-default.xml

<property>
  <name>mapreduce.job.reduce.slowstart.completedmaps</name>
  <value>0.05</value>
  <description>Fraction of the number of maps in the job which should be
  complete before reduces are scheduled for the job.
  </description>
</property>

<property>
  <name>mapreduce.reduce.input.buffer.percent</name>
  <value>0.0</value>
  <description>The percentage of memory- relative to the maximum heap size- to
  retain map outputs during the reduce. When the shuffle is concluded, any
  remaining map outputs in memory must consume less than this threshold before
  the reduce can begin.
  </description>
</property>

4) I/O phase: use Snappy and LZO to compress encoder and SequenceFile binary(Summary of thinking on hive binary storage format, i.e. SequenceFile and RCFile)
5) Data skew: sampling and range partition (preset partition for data sampling), user-defined partition, Combiner to streamline data, avoid Reduce Join (try to Map Join)

2.2,Common tuning parameters of hadoop

2.3,Hadoop small file optimization method

Supplement: SequenceFile is composed of a series of binary k/v. if key is the file name and value is the file content, a large number of small files can be merged into a large file

3. New features of Hadoop

3.1. Use the distcp command to realize recursive data replication between two Hadoop clusters

[atguigu@hadoop102 hadoop-3.1.3]$  bin/hadoop distcp hdfs://hadoop102:9820/user/atguigu/hello.txt hdfs://hadoop105:9820/user/atguigu/hello.txt

3.2 small file archiving

# Archive file
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop archive -archiveName input.har -p  /user/atguigu/input   /user/atguigu/output
# View Archive
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls /user/atguigu/output/input.har
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls har:///user/atguigu/output/input.har
# Solution Archive
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cp har:/// user/atguigu/output/input.har/*    /user/atguigu

3.3,Hadoop Trash recycle bin usage guide

Add: files deleted through the program will not go through the recycle bin. You need to call moveToTrash() to enter the recycle bin

Trash trash = New Trash(conf);
trash.moveToTrash(path);

3.4,Hadoop3.x new features

PS: Erasure Coding (distributed storage system)

Keywords: Linux Big Data Hadoop

Added by prbrowne on Mon, 27 Dec 2021 20:14:25 +0200