1,Hadoop data compression
compression algorithm | Original file size | Compressed file size | Compression speed | Decompression speed | Bring your own | segmentation | Change procedure |
---|---|---|---|---|---|---|---|
gzip | 8.3GB | 1.8GB | 17.5MB/s | 58MB/s | yes | no | no |
bzip2 | 8.3GB | 1.1GB | 2.4MB/s | 9.5MB/s | yes | yes | no |
LZO | 8.3GB | 2.9GB | 49.3MB/s | 74.6MB/s | no | yes | yes |
- Input compression: (Hadoop uses the file extension to determine whether it supports a codec, core-site.xml)
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec
com.hadoop.compression.lzo.LzopCodec
org.apache.hadoop.io.compress.SnappyCodec - mapper output: (enterprises often use LZO or Snappy codec to compress data at this stage, mapred-site.xml)
com.hadoop.compression.lzo.LzopCodec
org.apache.hadoop.io.compress.SnappyCodec - reducer output: (use standard tools or codecs, such as gzip and bzip2, mapred-site.xml)
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec
PS: LZO format is based on GPL license and cannot be distributed through Apache. Based on this, its hadoop CODEC / decoder must be downloaded separately, Installation and compilation lzo details on Linux . Lzop encoder / decoder is compatible with lzop tool. It is actually LZO format, but there is an additional header. It is exactly what we want. There is also a pure LZO format encoder / decoder LzoCodec, which uses lzo_deflate as the extension (according to DEFLATE and so on, it is a gzip format without a header).
1.1 compression and decompression of data stream
//Get compression codec codec CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration()); CompressionCodec codec = factory.getCodecByName(method); //To obtain the normal output stream, you need to add a compression suffix after the file FileOutputStream fos = new FileOutputStream(new File(filename + codec.getDefaultExtension())); //Obtain the compressed output stream and compress the fos with the compression decoder CompressionOutputStream cos = codec.createOutputStream(fos);
//Get compression codec codec CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration()); CompressionCodec codec = factory.getCodec(new Path(filename)); //Get normal input stream FileInputStream fis = new FileInputStream(new File(filename)); //Get the compressed output stream and decompress the fis with the compression decoder CompressionInputStream cis = codec.createInputStream(fis);
1.2. Map and Reduce outputs are compressed
Mapper and Reducer remain unchanged
// Enable map output compression conf.setBoolean("mapreduce.map.output.compress", true); // Set the map side output compression mode conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class,CompressionCodec.class);
// Set the output compression on the reduce side FileOutputFormat.setCompressOutput(job, true); // Set compression mode FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
2. Hadoop enterprise optimization
2.1. Bottleneck of MapReduce program efficiency
1) Computer performance: CPU, memory, hard disk, network
2) I/O operation optimization: data skew, unreasonable number of maptasks and ReduceTask, small files, compressed files cannot be segmented, too many slices, too many merges, and too long Reduce time
Solution:
1) Input stage: combine textinputformat merges a large number of small files at the input end
2) Map stage: reduce the number of overflow writes, reduce the number of merges, and add Combine
mapred-default.xml
<!-- Increase trigger Spill Maximum memory limit--> <property> <name>mapreduce.task.io.sort.mb</name> <value>100</value> <description>The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.</description> </property> <property> <name>mapreduce.map.sort.spill.percent</name> <value>0.80</value> <description>The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5</description> </property> <!--enlarge Merge Number of files--> <property> <name>mapreduce.task.io.sort.factor</name> <value>10</value> <description>The number of streams to merge at once while sorting files. This determines the number of open file handles.</description> </property>
3) Reduce stage: reasonably set the number of maptasks and reducetasks (too few tasks will wait, and too many tasks will compete), set the coexistence of Map and reduce (after the Map runs to a certain extent, start running reduce), and reduce (a large amount of network consumption will be caused by the data obtained by reduce)
mapred-default.xml
<property> <name>mapreduce.job.reduce.slowstart.completedmaps</name> <value>0.05</value> <description>Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job. </description> </property> <property> <name>mapreduce.reduce.input.buffer.percent</name> <value>0.0</value> <description>The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. </description> </property>
4) I/O phase: use Snappy and LZO to compress encoder and SequenceFile binary(Summary of thinking on hive binary storage format, i.e. SequenceFile and RCFile)
5) Data skew: sampling and range partition (preset partition for data sampling), user-defined partition, Combiner to streamline data, avoid Reduce Join (try to Map Join)
2.2,Common tuning parameters of hadoop
2.3,Hadoop small file optimization method
Supplement: SequenceFile is composed of a series of binary k/v. if key is the file name and value is the file content, a large number of small files can be merged into a large file
3. New features of Hadoop
3.1. Use the distcp command to realize recursive data replication between two Hadoop clusters
[atguigu@hadoop102 hadoop-3.1.3]$ bin/hadoop distcp hdfs://hadoop102:9820/user/atguigu/hello.txt hdfs://hadoop105:9820/user/atguigu/hello.txt
3.2 small file archiving
# Archive file [atguigu@hadoop102 hadoop-3.1.3]$ hadoop archive -archiveName input.har -p /user/atguigu/input /user/atguigu/output # View Archive [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls /user/atguigu/output/input.har [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls har:///user/atguigu/output/input.har # Solution Archive [atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cp har:/// user/atguigu/output/input.har/* /user/atguigu
3.3,Hadoop Trash recycle bin usage guide
Add: files deleted through the program will not go through the recycle bin. You need to call moveToTrash() to enter the recycle bin
Trash trash = New Trash(conf); trash.moveToTrash(path);