Big data journey for beginners of strange upgrade ＜ Hadoop compression ＞

Xiaobai's big data journey (57)

Hadoop compression

Last review

After introducing zookeeper, the next step is Hadoop's extended knowledge point, compression and ha. Because HA is based on zookeeper, I bring this knowledge point now

Hadoop compression

Compression overview

First of all, we should know that compression is an optimization method for data
Using compression can effectively reduce the number of read and write sections of HDFS stored data and improve the efficiency of network bandwidth and disk space
Because Shuffle and Merge take a lot of time when running MR operation, using compression can improve the efficiency of our MR program

Advantages and disadvantages of compression

Although the disk IO can be reduced by compressing the data in the MR process to improve the running speed of the MR program, the compression also increases the computational burden of the CPU
Because the compressed data needs to be decompressed before it can be used, the proper use of compression effects can improve the performance, and improper use may also reduce the performance

Scenes using compression

After knowing the advantages and disadvantages of compression, we can naturally guess the application scenario of compression?

For computationally intensive jobs, we try to use less compression
For IO intensive jobs, we need to use compression

Compression coding of MR

Compression is also divided into many formats because their underlying algorithms are different
Through different compression technologies, we can use compression for different scenarios
Next, let's take a look at several compression technologies supported by MR

MR supported compression coding

Compression format	hadoop comes with?	algorithm	File extension	Whether it can be segmented	After changing to compressed format, does the original program need to be modified
DEFLATE	Yes, direct use	DEFLATE	.deflate	no	Like text processing, it does not need to be modified
Gzip	Yes, direct use	DEFLATE	.gz	no	Like text processing, it does not need to be modified
bzip2	Yes, direct use	bzip2	.bz2	yes	Like text processing, it does not need to be modified
LZO	No, need to install	LZO	.lzo	yes	You need to build an index and specify the input format
Snappy	No, need to install	Snappy	.snappy	no	Like text processing, it does not need to be modified

Hadoop encoding / decoding API

Compression format	Corresponding encoder / decoder
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

Compression performance comparison

compression algorithm	Original file size	Compressed file size	Compression speed	Decompression speed
gzip	8.3GB	1.8GB	17.5MB/s	58MB/s
bzip2	8.3GB	1.1GB	2.4MB/s	9.5MB/s
LZO	8.3GB	2.9GB	49.3MB/s	74.6MB/s
snappy	8.3GB	4GB	250MB/s	500MB/s

Snappy's compression speed is the fastest, but its compression effect is the lowest. Basically, it can only compress the file to about half the size of the source file. If you want to know about snappy's partners, please refer to the instructions of GitHub: http://google.github.io/snappy

Compression parameter configuration

parameter	Default value	stage	proposal
io.compression.codecs (configured in core-site.xml)	org.apache.hadoop.io.compress.DefaultCodec，org.apache.hadoop.io.compress.GzipCodec，org.apache.hadoop.io.compress.BZip2Codec	Input compression	Hadoop uses file extensions to determine whether a codec is supported
mapreduce.map.output.compress (configured in mapred-site.xml)	false	mapper output	Set this parameter to true to enable compression
mapreduce.map.output.compress.codec (configured in mapred-site.xml)	org.apache.hadoop.io.compress.DefaultCodec	mapper output	Enterprises often use LZO or Snappy codec to compress data at this stage
mapreduce.output.fileoutputformat.compress (configured in mapred-site.xml)	false	reducer output	Set this parameter to true to enable compression
mapreduce.output.fileoutputformat.compress.codec (configured in mapred-site.xml)	org.apache.hadoop.io.compress.DefaultCodec	reducer output	Use standard tools or codecs, such as gzip and bzip2

The table is for your convenience. Let's summarize the compression application scenario

Gzip
- Hadoop itself supports and is easy to use, because processing files in Gzip format in the program is the same as directly processing text, and most linux systems have Gzip commands
- Because Gzip does not support slicing, we can use Gzip when the compressed size of a single file is about equal to less than one block (about 130M)
Bzip2
- Bzip2 is also a compression format of Hadoop. Its feature is that it supports the compression of a single large text file and wants to slice it. At this time, we choose to use bZIP
- However, the compression and decompression speed of Bzip2 is relatively slow
Lzo
- Compressing and decompressing speed blocks and supporting slicing is one of the most commonly used compression formats. We can install lzop command in Linux to use it
- When using Lzo, we need to do some special processing for Lzo format files. For example, in order to support slicing, we need to establish an index and specify InputFormat as Lzo format
- Lzo is used when the compressed data size is greater than 200M
- The biggest advantage of Lzo is that the larger a single file is, the more obvious its compression efficiency advantage is
Snappy
- Its characteristic is one word, fast, high-speed compression and decompression efficiency, which is beyond the reach of other compression
- However, it does not support slicing, and the compressed size is half the size of the source file
- Because it is very fast, we usually use it as the intermediate format from Map to Reduce when the data in MR Map stage is relatively large, or as the output of one MR job to the input of another MR job

Compression and decompression of data stream

After understanding the compression and decompression and the formats supported by MR, let's demonstrate how to create and use specific compression through code

In Hadoop, we can use the built-in CompressionCodec object to create a compressed stream to complete the compression and decompression of files

package com.compress;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.*;
import org.apache.hadoop.util.ReflectionUtils;
import org.junit.Test;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;

public class CompressDemo {
    /**
     *  compress
     */
    @Test
    public void test() throws Exception {
        //1. Input stream -- ordinary file stream
        FileInputStream fis = new FileInputStream("D:\\io\\compress\\aaa.txt");


        //2. Output stream -- compressed stream
        //2.1 create the object of the corresponding codec
        CompressionCodec gzipCodec =
                ReflectionUtils.newInstance(GzipCodec.class, new Configuration());
        //2.2 create flow
        //gzipCodec.getDefaultExtension(): get the extension of compression type
        CompressionOutputStream cos = gzipCodec.createOutputStream(
                new FileOutputStream("D:\\io\\decompress\\aaa.txt" +
                        gzipCodec.getDefaultExtension()));

        //3. Document copying
        IOUtils.copyBytes(fis,cos,1024,true);

    }


    /**
     * decompression 
     */
    @Test
    public void test2() throws Exception {
        //Input stream --- compressed stream
        CompressionCodec gzipCodec =
                ReflectionUtils.newInstance(GzipCodec.class, new Configuration());
        CompressionInputStream cis = gzipCodec.createInputStream(
                new FileInputStream("D:\\io\\decompress\\aaa.txt.gz"));

        //Output stream --- ordinary file stream
        FileOutputStream fos = new FileOutputStream("D:\\io\\compress\\aaa.txt");

        //3. Document copying
        IOUtils.copyBytes(cis,fos,1024,true);
    }


    /**
     * decompression 
     */
    @Test
    public void test3() throws Exception {
        //Create a compressed factory class
        CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration());

        //Input stream --- compressed stream
        //Create the object of the corresponding codec class according to the extension of the file (SMART)
        CompressionCodec gzipCodec =
                factory.getCodec(new Path("D:\\io\\decompress\\aaa.txt.gz"));

        if (gzipCodec != null){//Object without corresponding codec

            CompressionInputStream cis = gzipCodec.createInputStream(
                    new FileInputStream("D:\\io\\decompress\\aaa.txt.gz"));

            //Output stream --- ordinary file stream
            FileOutputStream fos = new FileOutputStream("D:\\io\\compress\\aaa.txt");

            //3. Document copying
            IOUtils.copyBytes(cis,fos,1024,true);
        }


    }

}

summary

This chapter shares the knowledge points of compression. Understanding and using compression is one of the essential knowledge in our daily work. Reasonable use of compression can greatly improve the execution efficiency of our programs. Well, this chapter is all about these. The next chapter brings you the HA of Hadoop

Keywords: Java Programming Big Data Hadoop

Added by ozzthegod on Tue, 08 Feb 2022 18:04:45 +0200

Programming VIP

Big data journey for beginners of strange upgrade ＜ Hadoop compression ＞

Xiaobai's big data journey (57)

Last review

Hadoop compression

Compression overview

Advantages and disadvantages of compression

Scenes using compression

Compression coding of MR

Compression and decompression of data stream

summary

Popular Keywords