Developing the correct posture of function calculation -- using brotli to compress large files

Big file problem

Function calculation The size of the uploaded zip code package is limited to 50M. This limit can be exceeded in some scenarios, such as Uncut serverless-chrome , similar to libreoffice, in addition to the common machine learning training model file.
At present, there are three ways to solve the problem of large documents

  1. Adopt higher compression ratio algorithm, such as brotli algorithm introduced in this paper
  2. Download with OSS runtime
  3. Adopt NAS file sharing

Simply compare the advantages and disadvantages of these three methods

Method Advantage shortcoming
High density compression Easy to release, fastest to start Upload the code package slowly; write the decompression code; the size is limited to no more than 50 M
OSS No more than 512 M files after downloading and decompressing It needs to be uploaded to OSS in advance; to write download and decompress code, the download speed is about 50M/s
NAS There is no limit to file size, no need to compress Need to upload to NAS in advance; VPC environment has cold start delay (~ 5s)

Under normal circumstances, if the code package can be controlled below 50M, the startup will be faster. Moreover, the project is relatively simple. Data and code are put together, and no additional script is needed to update OSS or NAS synchronously.

compression algorithm

Brotli It is an open source compression algorithm developed by Google engineers, which has been supported by the new version of mainstream browsers as the compression algorithm for HTTP transmission. Here is a comparison benchmark of Brotli and other common compression algorithms found on the Internet.




From the above three figures, we can see that brotli has the highest compression ratio, close to the decompression speed of gzip and the slowest compression speed compared with gzip, xz and bz2.

However, in our scenario, we are not sensitive to the disadvantage of slow compression. The compression task only needs to be performed once in the stage of developing and preparing materials.

Making compressed files

Let me first introduce how to make compressed files. The following code and use cases are from the project packed-selenium-java-example .

Install brotli command

Mac users

brew install brotli

Windows users can go to this interface to download, https://github.com/google/brotli/releases

Pack and compress

The first two file sizes are 7.5M and 97M respectively

╭─ ~/D/test1[◷ 18:15:21]
╰─  ll
total 213840
-rwxr-xr-x  1 vangie  staff   7.5M  3  5 11:13 chromedriver
-rwxr-xr-x  1 vangie  staff    97M  1 25  2018 headless-chromium

Pack and compress with GZip, size 44 M.

╭─ ~/D/test1[◷ 18:15:33]
╰─  tar -czvf chromedriver.tar chromedriver headless-chromium
a chromedriver
a headless-chromium
╭─ ~/D/test1[◷ 18:16:41]
╰─  ll
total 306216
-rwxr-xr-x  1 vangie  staff   7.5M  3  5 11:13 chromedriver
-rw-r--r--  1 vangie  staff    44M  3  6 18:16 chromedriver.tar
-rwxr-xr-x  1 vangie  staff    97M  1 25  2018 headless-chromium

Remove the z option from tar and pack again. The size is 104M

╭─ ~/D/test1[◷ 18:16:42]
╰─  tar -cvf chromedriver.tar chromedriver headless-chromium
a chromedriver
a headless-chromium
╭─ ~/D/test1[◷ 18:17:06]
╰─  ll
total 443232
-rwxr-xr-x  1 vangie  staff   7.5M  3  5 11:13 chromedriver
-rw-r--r--  1 vangie  staff   104M  3  6 18:17 chromedriver.tar
-rwxr-xr-x  1 vangie  staff    97M  1 25  2018 headless-chromium

The compressed size is 33M, which is much smaller than the 44M of Gzip. It takes 6 minutes and 18 seconds, and Gzip takes only 5 seconds.

╭─ ~/D/test1[◷ 18:17:08]
╰─  time brotli -q 11 -j -f chromedriver.tar
brotli -q 11 -j -f chromedriver.tar  375.39s user 1.66s system 99% cpu 6:18.21 total
╭─ ~/D/test1[◷ 18:24:23]
╰─  ll
total 281552
-rwxr-xr-x  1 vangie  staff   7.5M  3  5 11:13 chromedriver
-rw-r--r--  1 vangie  staff    33M  3  6 18:17 chromedriver.tar.br
-rwxr-xr-x  1 vangie  staff    97M  1 25  2018 headless-chromium

Runtime decompression

Take java maven project as an example

Add decompression dependency package

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-compress</artifactId>
    <version>1.18</version>
</dependency>

<dependency>
    <groupId>org.brotli</groupId>
    <artifactId>dec</artifactId>
    <version>0.1.2</version>
</dependency>

Commons compress is a decompression toolkit provided by apache. It provides a consistent abstract interface for various compression algorithms, and only supports decompression for brotli algorithm, which is enough here. org.brotli:dec package is the bottom implementation of brotli decompression algorithm provided by Google.

Implementing the initialize method

public class ChromeDemo implements  FunctionInitializer {

    public void initialize(Context context) throws IOException {

        Instant start = Instant.now();

        try (TarArchiveInputStream in =
                     new TarArchiveInputStream(
                             new BrotliCompressorInputStream(
                                     new BufferedInputStream(
                                             new FileInputStream("chromedriver.tar.br"))))) {

            TarArchiveEntry entry;
            while ((entry = in.getNextTarEntry()) != null) {
                if (entry.isDirectory()) {
                    continue;
                }
                File file = new File("/tmp/bin", entry.getName());
                File parent = file.getParentFile();
                if (!parent.exists()) {
                    parent.mkdirs();
                }

                System.out.println("extract file to " + file.getAbsolutePath());

                try (FileOutputStream out = new FileOutputStream(file)) {
                    IOUtils.copy(in, out);
                }

                Files.setPosixFilePermissions(file.getCanonicalFile().toPath(),
                        getPosixFilePermission(entry.getMode()));
            }
        }

        Instant finish = Instant.now();
        long timeElapsed = Duration.between(start, finish).toMillis();

        System.out.println("Extract binary elapsed: " + timeElapsed + "ms");

    }
}

Implements the initialize method of the FunctionInitializer interface. At the beginning of the decompression process, there are four layers of nested flows, with the functions as follows:

  1. FileInputStream read file
  2. BufferedInputStream provides cache, introduces context switch brought by system call, and prompts the speed of reading
  3. BrotliCompressorInputStream decodes the byte stream
  4. TarArchiveInputStream parses the files in the tar package one by one

Then the function of Files.setPosixFilePermissions is to restore the permissions of the files in the tar package. The code is too long. Please refer to packed-selenium-java-example

Instant start = Instant.now();
...

Instant finish = Instant.now();
long timeElapsed = Duration.between(start, finish).toMillis();

System.out.println("Extract binary elapsed: " + timeElapsed + "ms");

The above code segment will take about 3.7 s econds to print out and decompress.

Finally, don't forget to configure Initializer and InitializationTimeout in template.yml

Reference reading

  1. https://www.opencpu.org/posts/brotli-benchmarks/
  2. https://github.com/vangie/packed-selenium-java-example

Join us

Team Introduction

Alibaba cloud function service is a new computing service that supports event driven programming mode. He helps users focus on their own business logic, build applications in the way of Serverless, and quickly realize low-cost, scalable and highly available systems without considering the management of underlying infrastructure such as servers. Users can quickly create prototypes, and the same architecture can scale smoothly with the business scale. Make computing more efficient, more economical, more flexible and more reliable. Both small start-ups and large enterprises benefit from this. Our team is expanding rapidly and is eager for talents. We want to find teammates like this:
Solid basic skills. It can not only read papers to track the trend of the industry, but also quickly code to solve practical problems.
Rigorous and systematic thinking ability. It can not only consider business opportunities, system architecture, operation and maintenance costs and many other factors, but also control the complete process of design / development / test / release, predict and control risks.
Driven by curiosity and sense of mission. Willing to explore the unknown, he is not only a dreamer, but also a practitioner.
Tough, optimistic and confident. Can see the opportunity in the pressure and the difficulty, lets the work be full of fun!
If you are passionate about cloud computing and want to build an influential computing platform and ecosystem, please join us to realize your dream with us!

Job description

Build a new generation of Serverless computing platform, including:

  1. Design and implement a complete and extensible front-end system, including authentication / permission management, metadata management, flow control, metering and billing, log monitoring, etc
  2. Design and implement elastic and reliable back-end system, including resource scheduling, load balancing, fault-tolerant processing, etc
  3. Rich and easy-to-use SDK/Tools/CLI / console
  4. Driven by user needs, tracking industry trends, and using technology to drive business growth

Job requirements

  1. Solid basic knowledge of algorithm / data structure / operating system, excellent logical thinking ability.
  2. Master at least one programming language. For example, Java / go / C / C × / C + +.
  3. Experience in large-scale and highly available distributed system development is preferred.
  4. Web/Mobile Backends/Microservice development experience is preferred.
  5. Good communication skills and team work spirit, have a certain ability of organization and coordination.
  6. Bachelor degree or above
  7. More than 3 years of working experience, students who have passed the "Alibaba code specification" certification are preferred to be admitted, and the certification address is: https://edu.aliyun.com/certification/cldt02

Resume submitted

yixian.dw AT alibaba-inc.com

"Alibaba cloud native technology circle Pay attention to microservices, Serverless, containers, Service Mesh and other technology fields, focus on cloud native popular technology trends, cloud native large-scale landing practice, and become the technology circle that most understands cloud native developers. "

Keywords: Linux Java Google Selenium github

Added by stb74 on Fri, 21 Feb 2020 09:43:32 +0200