Small file processing topics

Small file processing topics

I. MapReduce

1.1 problems caused by small data

  1. On HDFS, each file occupies 150byte (in memory) on the namenode. If there are too many small files, it will occupy a lot of namenode memory, and the speed of searching metadata will be very slow
  2. In the process of processing MapReduce, each small file should be started to cut into pieces, and a maptask should be started. The default memory of each maptask is 1G, which will consume a lot of NodeManager memory. At the same time, the start and initialization time of a map task is much longer than that of logical processing, resulting in a great waste of resources.

1.2 how to solve

  1. Solve the problem from the data source. If it is to control the file rolling parameters when flume transmission data enters HDFS, or use hbase database to store lost data for periodic consolidation

    ## Do not generate a large number of small files (scrolling strategy, scrolling by time, file size, number of data entries)
    a1.sinks.k1.hdfs.rollInterval = 3600
    a1.sinks.k1.hdfs.rollSize = 134217728
    a1.sinks.k1.hdfs.rollCount = 0
    
  2. Small files stored on hdfs can be merged through har archiving, but it can only solve the memory problem of NameNode, which does not play any role in MapReduce. Because har files are equivalent to a directory, it is still impossible to merge small files into one split, and one split for each small file, which is still inefficient

  3. MR:

    1. During MR input, you can use combineinputformat to set the size of merged small files

      // If InputFormat is not set, it defaults to textinputformat class
      job.setInputFormatClass(CombineTextInputFormat.class);
      //Virtual storage slice maximum setting 4m
      CombineTextInputFormat.setMaxInputSplitSize(job, 134217728);
      
    2. Turn on the uber mode to realize jvm reuse. By default, each task occupies a jvm and will be released after use. Other tasks need to apply again. If the jvm is not released after the task is completed and another task continues to use, the time for task to apply for a jvm will be saved

      https://blog.csdn.net/myproudcodelife/article/details/44477819 uber mode, detailed explanation of jvm use, 1 X supports reuse, 2 X supports uber

      JVM reuse enables the JVM instance to be reused n times in the same job. The value of N can be found in the mapred site of Hadoop XML file. Usually between 10-20

      <property>
      
        <name>mapreduce.job.jvm.numtasks</name>
      
        <value>10</value>
      
        <description>How many tasks to run per jvm,if set to -1 ,there is  no limit</description>
      </property>  
      

      **Disadvantages: * * when JVM reuse is enabled, the slot used by the task will be occupied until the end of the whole query task

    3. Tuning is not just about understanding the value after tuning. Tuning is about re tuning in some special situations. Generally, the default value is adopted. Parameter tuning is to change the default value. The default value is to deal with most situations, and tuning is to deal with special situations.

II. hive

2.1 how to generate hive files

  1. Dynamic partition: hive supports a powerful dynamic partition function, which can be automatically partitioned according to the field. However, if there are too many types of values in the partition field, a lot of small files will be generated in hdfs, resulting in hive processing too many small files in MR
  2. There are too many reduces. Each reducetask outputs a file by default
  3. The data source itself contains a large number of small files

2.2 hazards of small documents:

The same as hadoop's MR

2.2 how to solve

  1. Use Hadoop achieve to archive small files

  2. Rebuild the table and reduce the number of reduce (that is, reduce the number of partitions)

    distribute by Field name,cast(rand(),123) --Rebuild table
    
  3. Set parameters and merge small files before hive program runs

//Maximum input size of each Map (this value determines the number of files after merging)
set mapred.max.split.size=256000000;  
//The minimum size of split on a node (this value determines whether files on multiple datanodes need to be merged)
set mapred.min.split.size.per.node=100000000;
//The minimum size of split under one switch (this value determines whether files on multiple switches need to be merged)  
set mapred.min.split.size.per.rack=100000000;
//Merge small files before executing Map
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; 
  1. Set relevant parameters for merging map output and reduce output, and adjust the number of map and reduce
//Set the map side output to merge. The default value is true
set hive.merge.mapfiles = true
//Set the reduce output to merge. The default value is false
set hive.merge.mapredfiles = true
//Sets the size of the merged file
set hive.merge.size.per.task = 256*1000*1000
//When the average size of the output file is less than this value, start an independent MapReduce task to merge the file.
set hive.merge.smallfiles.avgsize=16000000

Three spark

  1. spark handles small files:

    It mainly reduces partitions through repartition and coalesce to achieve the effect of processing small files

    1. repartition uses shuffle by default, which is generally used
    2. coalesce does not use shuffle by default, which may cause data skew
  2. sparkSQL sets the amount of target data read by reduce

    val sc = new SparkContext(conf)
    val hiveContext = new HiveContext(sc)
    --Enable Adaptive Execution,This enables automatic settings Shuffle 
    hiveContext.setConf("spark.sql.adaptive.enabled","true")
    --Set each Reducer The amount of target data read. The default is 64 M,Generally changed to cluster block size
    hiveContext.setConf("spark.sql.adaptive.shuffle.targetPostShuffleInputSize","128000000")
    
  3. https://blog.csdn.net/a13705510005/article/details/102295768 spark specific small file scenarios and Solutions

  4. Operators can also adjust partitions

Keywords: Big Data

Added by zeus1 on Sat, 05 Mar 2022 06:27:32 +0200