hive distribute by group application tuning
Group by fields in the table
set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.merge.mapfiles = true; set hive.merge.mapredfiles = false; --MR Small files are not merged at the end of the operation set hive.merge.size.per.task = 256000000; set hive.merge.smallfiles.avgsize = 256000000; set hive.merge.orcfile.stripe.level=false; set hive.merge.rcfile.block.level=false; --appoint tmp.tmp_01 The default file storage format is textfile set hive.default.fileformat = orc; drop table if exists tmp.tmp_01; create table tmp.tmp_01 --STORED AS orc as SELECT a.item_sku_id FROM ( SELECT item_sku_id,sku_name,brandname_cn,data_type FROM big_table WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine' distribute by item_sku_id -- Large tables are grouped by fields in the table ) a --Data size 2 T about LEFT OUTER JOIN ( SELECT item_sku_id FROM small_table WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15' ) b --The data size is less than 10 M ON a.item_sku_id = b.item_sku_id WHERE b.item_sku_id IS NULL;
result:
2022-02-16 23:03:47 INFO Table tmp.tmp_ljjtest02 stats: [numFiles=23, numRows=23, totalSize=5894, rawDataSize=2252] 2022-02-16 23:03:47 INFO Stage-1: Map: 18024 Reduce: 1009 Cumulative CPU: 377078.2 sec HDFS Read: 193.287 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 4m18s539ms job_5690061100801_32622232 2022-02-16 23:03:47 INFO Stage-10: Map: 920 Cumulative CPU: 5431.21 sec HDFS Read: 0.008 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 52s336ms job_5690061100801_32622707 2022-02-16 23:03:47 INFO Stage-4: Map: 832 Cumulative CPU: 1078.67 sec HDFS Read: 0.001 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 31s273ms job_8766029101801_36250210 2022-02-16 23:03:47 INFO Total MapReduce CPU Time Spent: 4d10h33m8s80ms 2022-02-16 23:03:47 INFO Total Map: 19776 Total Reduce: 1009 2022-02-16 23:03:47 INFO Total HDFS Read: 193.296 GB Written: 0.000 GB
Result analysis:
Grouped by fields in the table, if the data is uneven, it may cause data skew.
Group by random number
set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.merge.mapfiles = true; set hive.merge.mapredfiles = false; --MR Small files are not merged at the end of the operation set hive.merge.size.per.task = 256000000; set hive.merge.smallfiles.avgsize = 256000000; set hive.merge.orcfile.stripe.level=false; set hive.merge.rcfile.block.level=false; --appoint tmp.tmp_01 The default file storage format is textfile set hive.default.fileformat = orc; drop table if exists tmp.tmp_01; create table tmp.tmp_01 --STORED AS orc as SELECT a.item_sku_id FROM ( SELECT item_sku_id,sku_name,brandname_cn,data_type FROM big_table WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine' distribute by item_sku_id -- Large tables are grouped in random numbers ) a --Data size 2 T about LEFT OUTER JOIN ( SELECT item_sku_id FROM small_table WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15' ) b --The data size is less than 10 M ON a.item_sku_id = b.item_sku_id WHERE b.item_sku_id IS NULL;
result:
2022-02-16 22:53:00 INFO Table tmp.tmp_01 stats: [numFiles=3, numRows=23, totalSize=958, rawDataSize=2235] 2022-02-16 22:53:00 INFO Stage-1: Map: 18024 Reduce: 1009 Cumulative CPU: 372613.35 sec HDFS Read: 193.289 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 5m32s865ms job_5690061100801_32621522 2022-02-16 22:53:00 INFO Stage-10: Map: 924 Cumulative CPU: 5424.53 sec HDFS Read: 0.008 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 41s846ms job_5690061100801_32621870 2022-02-16 22:53:00 INFO Stage-4: Map: 858 Cumulative CPU: 1374.95 sec HDFS Read: 0.001 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 43s592ms job_5690061100801_32621913 2022-02-16 22:53:00 INFO Total MapReduce CPU Time Spent: 4d9h23m32s830ms 2022-02-16 22:53:00 INFO Total Map: 19806 Total Reduce: 1009 2022-02-16 22:53:00 INFO Total HDFS Read: 193.298 GB Written: 0.000 GB
Result analysis:
Grouped by random numbers, the data can be randomly and evenly split from the map side to the reduce side
Summary:
Data grouping (triggering an mr task), which controls how to split the data at the map end to the reduce end. The default is hash
distribute by item_sku_id will cause data skew
distribute by rand() if one of the reduce tasks fails during execution, it may cause data loss or duplication
distribute by rand(int seed) uses random seeds to avoid the problem of inaccurate data
The specific risk principle of distribute by rand() is as follows:
-
Some reduce shuffle s fail, resulting in the data regeneration of individual upstream map task s. If the shuffle fragmentation logic contains random factors, the newly generated data shuffle fragmentation will not be completely consistent with the previous one, resulting in repeated reading or data loss of some data
-
reduce task pulls the data of the corresponding partition from the result file of each map task. The data is already partitioned in the map phase, and there will be an additional index file to record the starting offset of each partition. Therefore, when reducetask fetches data, it directly pulls the data according to the offset.
-
The newly generated data in the shuffle slice will lose data and also contain the data distributed to other reducer s.
Map only and mapReduce job output small file merging and tuning
Small file merge at the end of map only operation
set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.merge.mapfiles = true; -- map-only Small file merge at end set hive.merge.mapredfiles = false; -- MR Small files are not merged at the end of the operation set hive.merge.size.per.task = 256000000; set hive.merge.smallfiles.avgsize = 256000000; set hive.merge.orcfile.stripe.level=false; set hive.merge.rcfile.block.level=false; --appoint tmp.tmp_01 The default file storage format is textfile set hive.default.fileformat = orc; drop table if exists tmp.tmp_01; create table tmp.tmp_01 --STORED AS orc as create table tmp.tmp_01 as SELECT item_sku_id FROM cdm.cdm_goods_360buy_act_basic_da WHERE dt = '2022-02-15' AND barndname_cn = 'Fen Wine'
result:
2022-02-17 14:45:37 INFO Table tmp.tmp_ljjtest021 stats: [numFiles=1, numRows=18415, totalSize=110953, rawDataSize=1762456]
Result analysis:
[set hive.merge.mapfiles = true;] Controls the merging of small files at the end of the map only operation
Merge small files at the end of mapReduce operation
set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.merge.mapfiles = false; -- map-only Small files are not merged at the end set hive.merge.mapredfiles = true; -- MR At the end of the operation, merge the small files set hive.merge.size.per.task = 256000000; set hive.merge.smallfiles.avgsize = 256000000; set hive.merge.orcfile.stripe.level=false; set hive.merge.rcfile.block.level=false; --appoint tmp.tmp_01 The default file storage format is textfile set hive.default.fileformat = orc; drop table if exists tmp.tmp_01; create table tmp.tmp_01 --STORED AS orc as SELECT a.item_sku_id FROM ( SELECT item_sku_id,sku_name,brandname_cn,data_type FROM big_table WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine' ) a --Data size 2 T about LEFT OUTER JOIN ( SELECT item_sku_id FROM small_table WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15' ) b --The data size is less than 10 M ON a.item_sku_id = b.item_sku_id WHERE b.item_sku_id IS NULL;
result:
2022-02-16 23:12:45 INFO Table tmp.tmp_01 stats: [numFiles=1, numRows=23, totalSize=426, rawDataSize=2235]
Result analysis:
[set hive.merge.mapredfiles = true;] Controls the merging of small files at the end of mapReduce and mapjoin operations
Summary:
Here, you should pay attention to the mapjoin operation. Although only the map operation is triggered, the reduce operation is not triggered.
However, this is the result of optimization and is essentially a mapReduce task, so it is [set hive.merge.mapredfiles = true;] Parameter control.
hive output small file problem tuning
Execute mapjoin without merging small files and grouping
set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.merge.mapfiles = true; set hive.merge.mapredfiles = false; --MR Small files are not merged at the end of the operation set hive.merge.size.per.task = 256000000; set hive.merge.smallfiles.avgsize = 256000000; set hive.merge.orcfile.stripe.level=false; set hive.merge.rcfile.block.level=false; --appoint tmp.tmp_01 The default file storage format is textfile set hive.default.fileformat = orc; drop table if exists tmp.tmp_01; create table tmp.tmp_01 --STORED AS orc as SELECT a.item_sku_id FROM ( SELECT item_sku_id,sku_name,brandname_cn,data_type FROM big_table WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine' --distribute by rand(1) -- No grouping ) a --Data size 2 T about LEFT OUTER JOIN ( SELECT item_sku_id FROM small_table WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15' ) b --The data size is less than 10 M ON a.item_sku_id = b.item_sku_id WHERE b.item_sku_id IS NULL;
result:
2022-02-17 16:53:18 INFO Table tmp.tmp_01 stats: [numFiles=18020, numRows=23, totalSize=887747, rawDataSize=2252] 2022-02-17 16:53:18 INFO MapReduce Jobs Launched: 2022-02-17 16:53:19 INFO Stage-4: Map: 18020 Cumulative CPU: 205038.93 sec HDFS Read: 193.386 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 3m28s419ms job_8766029101801_36451400 2022-02-17 16:53:19 INFO 2022-02-17 16:53:19 INFO Total MapReduce CPU Time Spent: 2d8h57m18s930ms 2022-02-17 16:53:19 INFO Total Map: 18020 Total Reduce: 0 2022-02-17 16:53:19 INFO Total HDFS Read: 193.386 GB Written: 0.000 GB
Result analysis:
Execute mapjoin (Association aggregation on the map side), do not merge small files, and each map corresponds to an output file. A large number of small files appear in the output results.
Execute mapjoin without merging small files and grouping large tables
set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.merge.mapfiles = true; set hive.merge.mapredfiles = false; --MR Small files are not merged at the end of the operation set hive.merge.size.per.task = 256000000; set hive.merge.smallfiles.avgsize = 256000000; set hive.merge.orcfile.stripe.level=false; set hive.merge.rcfile.block.level=false; --appoint tmp.tmp_01 The default file storage format is textfile set hive.default.fileformat = orc; drop table if exists tmp.tmp_01; create table tmp.tmp_01 --STORED AS orc as SELECT a.item_sku_id FROM ( SELECT item_sku_id,sku_name,brandname_cn,data_type FROM big_table WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine' --distribute by rand(1) -- Large table grouping ) a --Data size 2 T about LEFT OUTER JOIN ( SELECT item_sku_id FROM small_table WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15' ) b --The data size is less than 10 M ON a.item_sku_id = b.item_sku_id WHERE b.item_sku_id IS NULL;
result:
2022-02-16 22:53:00 INFO Table tmp.tmp_01 stats: [numFiles=3, numRows=23, totalSize=958, rawDataSize=2235] 2022-02-16 22:53:00 INFO Stage-1: Map: 18024 Reduce: 1009 Cumulative CPU: 372613.35 sec HDFS Read: 193.289 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 5m32s865ms job_5690061100801_32621522 2022-02-16 22:53:00 INFO Stage-10: Map: 924 Cumulative CPU: 5424.53 sec HDFS Read: 0.008 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 41s846ms job_5690061100801_32621870 2022-02-16 22:53:00 INFO Stage-4: Map: 858 Cumulative CPU: 1374.95 sec HDFS Read: 0.001 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 43s592ms job_5690061100801_32621913 2022-02-16 22:53:00 INFO Total MapReduce CPU Time Spent: 4d9h23m32s830ms 2022-02-16 22:53:00 INFO Total Map: 19806 Total Reduce: 1009 2022-02-16 22:53:00 INFO Total HDFS Read: 193.298 GB Written: 0.000 GB
Result analysis:
Execute mapjoin (Association and aggregation on the map side), do not merge small files, pre group large tables, trigger MR, and reduce the number of maps in the mapjoin stage. Small output files are greatly reduced.
Execute mapjoin to merge small files
set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.merge.mapfiles = true; set hive.merge.mapredfiles = true; --MR At the end of the operation, merge the small files set hive.merge.size.per.task = 256000000; set hive.merge.smallfiles.avgsize = 256000000; set hive.merge.orcfile.stripe.level=false; set hive.merge.rcfile.block.level=false; --appoint tmp.tmp_01 The default file storage format is textfile set hive.default.fileformat = orc; drop table if exists tmp.tmp_01; create table tmp.tmp_01 --STORED AS orc as SELECT a.item_sku_id FROM ( SELECT item_sku_id,sku_name,brandname_cn,data_type FROM big_table WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine' --distribute by rand(1) -- No grouping,Because grouping is meaningless at this time, it will only increase the task overhead ) a --Data size 2 T about LEFT OUTER JOIN ( SELECT item_sku_id FROM small_table WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15' ) b --The data size is less than 10 M ON a.item_sku_id = b.item_sku_id WHERE b.item_sku_id IS NULL;
result:
2022-02-16 23:12:45 INFO Table tmp.tmp_01 stats: [numFiles=1, numRows=23, totalSize=426, rawDataSize=2235] 2022-02-16 23:12:45 INFO Stage-1: Map: 18024 Reduce: 1009 Cumulative CPU: 365693.05 sec HDFS Read: 193.289 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 3m21s803ms job_5690061100801_32623163 2022-02-16 23:12:45 INFO Stage-10: Map: 918 Cumulative CPU: 5232.29 sec HDFS Read: 0.008 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 40s544ms job_8766029101801_36250732 2022-02-16 23:12:45 INFO Stage-4: Map: 1 Cumulative CPU: 13.95 sec HDFS Read: 0.000 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 43s762ms job_8766029101801_36250769 2022-02-16 23:12:45 INFO Total MapReduce CPU Time Spent: 4d7h2m19s290ms 2022-02-16 23:12:45 INFO Total Map: 18943 Total Reduce: 1009 2022-02-16 23:12:45 INFO Total HDFS Read: 193.297 GB Written: 0.000 GB
Result analysis:
Execute mapjoin (Association aggregation on the map side) and merge small files. The output result is small and the number of files is the least.
Summary:
The optimization improvement of mapjoin in small table and large table join is still obvious, but there are some problems to pay attention to when using mapjoin.
In the above example, when mapjoin is executed, there are 18020 maps, and 18020 small files are output accordingly (executing mapReduce will greatly reduce the number of small files output, and the default maximum number of reduce in the test cluster is 1009).
Therefore, it is necessary to consider the merging of small files in the later stage. In the scenario of a large number of small files, the best way is to start the merging of small files
Enable small file merging of output results:
rcfile and orc need special settings to merge at the document level. parquet, textfile and sequencefile merge at the file level by default
set hive.merge.mapfiles = true;--Open only map(map-only)Small file output merge at the end of the operation set hive.merge.mapredfiles = true;--Open yes map and reduce Small file output merge at the end of the operation set hive.merge.size.per.task = 256000000;--Size of each file after merging set hive.merge.smallfiles.avgsize=256000000;--When the average size of the output file is less than 256000000, start a separate file MR The task performs file merging and triggers the threshold value of small file merging set hive.merge.orcfile.stripe.level=false; --When this parameter is set to true,orc File for stripe Level Level merge,When set to false,orc Files are merged at the file level. set hive.merge.rcfile.block.level=false;--When this parameter is set to true,rcfile File for block Level Level merge,When set to false,rcfile Files are merged at the file level.
Tuning of a large number of small files in hive intermediate process (between two job s)
No grouping, no merging of small files
set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.merge.mapfiles = false; set hive.merge.mapredfiles = false; --MR Small files are not merged at the end of the operation set hive.merge.size.per.task = 256000000; set hive.merge.smallfiles.avgsize = 256000000; set hive.merge.orcfile.stripe.level=false; set hive.merge.rcfile.block.level=false; WITH tmp as ( SELECT a.* FROM ( SELECT item_sku_id,sku_name,brandname_cn,data_type FROM big_table WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine' --distribute by rand(1) -- No grouping ) a --Data size 2 T about LEFT OUTER JOIN ( SELECT item_sku_id FROM small_table WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15' ) b --The data size is less than 10 M ON a.item_sku_id = b.item_sku_id WHERE b.item_sku_id IS NULL ) create table tmp.tmp_01 as SELECT c.item_sku_id FROM tmp c join small_table d on c.item_sku_id=d.item_sku_id
result:
2022-02-17 20:49:21 INFO Stage-8: Map: 18026 Cumulative CPU: 210204.23 sec HDFS Read: 193.388 GB HDFS Write: 0.001 GB SUCCESS Elapsed : 2m55s772ms job_5690061100801_32863650 2022-02-17 20:49:21 INFO Stage-6: Map: 472 Cumulative CPU: 3330.08 sec HDFS Read: 0.334 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 50s42ms job_5690061100801_32865433 2022-02-17 20:49:21 INFO Total MapReduce CPU Time Spent: 2d11h18m54s310ms 2022-02-17 20:49:21 INFO Total Map: 18498 Total Reduce: 0 2022-02-17 20:49:21 INFO Total HDFS Read: 193.722 GB Written: 0.001 GB 2022-02-17 20:49:22 INFO Time taken: 1045.892 seconds, Fetched: 46 row(s) 2022-02-17 20:49:23 INFO End of task! sql Execution time: 0 hour 17 mins 38 s
Result analysis:
It takes 1045 seconds without grouping and merging small files
Merge small files without grouping
set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.merge.mapfiles = true; set hive.merge.mapredfiles = true; --MR At the end of the operation, merge the small files set hive.merge.size.per.task = 256000000; set hive.merge.smallfiles.avgsize = 256000000; set hive.merge.orcfile.stripe.level=false; set hive.merge.rcfile.block.level=false; WITH tmp as ( SELECT a.* FROM ( SELECT item_sku_id,sku_name,brandname_cn,data_type FROM big_table WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine' --distribute by rand(1) -- No grouping ) a --Data size 2 T about LEFT OUTER JOIN ( SELECT item_sku_id FROM small_table WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15' ) b --The data size is less than 10 M ON a.item_sku_id = b.item_sku_id WHERE b.item_sku_id IS NULL ) create table tmp.tmp_01 as SELECT c.item_sku_id FROM tmp c join small_table d on c.item_sku_id=d.item_sku_id
result:
2022-02-17 20:45:25 INFO Stage-8: Map: 18026 Cumulative CPU: 214628.08 sec HDFS Read: 193.388 GB HDFS Write: 0.001 GB SUCCESS Elapsed : 3m41s147ms job_8766029101801_36477823 2022-02-17 20:45:25 INFO Stage-6: Map: 472 Cumulative CPU: 3995.4 sec HDFS Read: 0.334 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 59s677ms job_5690061100801_32865009 2022-02-17 20:45:25 INFO Total MapReduce CPU Time Spent: 2d12h43m43s480ms 2022-02-17 20:45:25 INFO Total Map: 18498 Total Reduce: 0 2022-02-17 20:45:25 INFO Total HDFS Read: 193.722 GB Written: 0.001 GB 2022-02-17 20:45:27 INFO Time taken: 1074.534 seconds, Fetched: 46 row(s) 2022-02-17 20:45:27 INFO End of task! sql Execution time: 0 hour 18 mins 10 s
Result analysis:
It takes 1074 seconds to merge small files without grouping
Grouping, merging small files
set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=10000000; set hive.mapjoin.smalltable.filesize=200000000; set hive.merge.mapfiles = true; set hive.merge.mapredfiles = true; --MR At the end of the operation, merge the small files set hive.merge.size.per.task = 256000000; set hive.merge.smallfiles.avgsize = 256000000; set hive.merge.orcfile.stripe.level=false; set hive.merge.rcfile.block.level=false; WITH tmp as ( SELECT a.* FROM ( SELECT item_sku_id,sku_name,brandname_cn,data_type FROM big_table WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine' distribute by rand(1) -- grouping ) a --Data size 2 T about LEFT OUTER JOIN ( SELECT item_sku_id FROM small_table WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15' ) b --The data size is less than 10 M ON a.item_sku_id = b.item_sku_id WHERE b.item_sku_id IS NULL ) create table tmp.tmp_01 as SELECT c.item_sku_id FROM tmp c join small_table d on c.item_sku_id=d.item_sku_id
result:
2022-02-17 19:57:57 INFO Stage-1: Map: 18026 Reduce: 1009 Cumulative CPU: 362544.56 sec HDFS Read: 193.290 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 4m55s871ms job_5690061100801_32857972 2022-02-17 19:57:57 INFO Stage-9: Map: 928 Cumulative CPU: 5520.28 sec HDFS Read: 0.008 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 2m33s752ms job_5690061100801_32858541 2022-02-17 19:57:57 INFO Stage-7: Map: 472 Cumulative CPU: 3638.63 sec HDFS Read: 0.334 GB HDFS Write: 0.000 GB SUCCESS Elapsed : 1m42s589ms job_5690061100801_32858927 2022-02-17 19:57:57 INFO Total MapReduce CPU Time Spent: 4d7h15m3s470ms 2022-02-17 19:57:57 INFO Total Map: 19426 Total Reduce: 1009 2022-02-17 19:57:57 INFO Total HDFS Read: 193.634 GB Written: 0.000 GB 2022-02-17 19:57:58 INFO Time taken: 690.782 seconds, Fetched: 46 row(s) 2022-02-17 19:57:58 INFO End of task! sql Execution time: 0 hour 11 mins 44 s
Result analysis:
In the case of grouping and merging small files, it takes 690 seconds
Summary:
When a large number of small files are generated in the intermediate result of hive task, the setting of merging small files has no effect on the result;
However, setting a large table to be grouped in advance, triggering MR and reducing the number of maps in the mapjoin stage will greatly reduce the time of computing tasks