hive tuning example analysis

hive distribute by group application tuning

Group by fields in the table

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
set hive.mapjoin.smalltable.filesize=200000000;

set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = false; --MR Small files are not merged at the end of the operation
set hive.merge.size.per.task = 256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set hive.merge.orcfile.stripe.level=false;
set hive.merge.rcfile.block.level=false;

--appoint tmp.tmp_01 The default file storage format is textfile
set hive.default.fileformat = orc;

drop table if exists tmp.tmp_01;
create table tmp.tmp_01 
--STORED AS orc
as
SELECT
      a.item_sku_id 
FROM  (
            SELECT
                    item_sku_id,sku_name,brandname_cn,data_type
            FROM big_table 
            WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine'
            distribute by item_sku_id -- Large tables are grouped by fields in the table
      ) a --Data size 2 T about
LEFT OUTER JOIN
      (     SELECT item_sku_id 
            FROM small_table 
            WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15'
      ) b --The data size is less than 10 M
ON a.item_sku_id = b.item_sku_id
WHERE b.item_sku_id IS NULL;

result:

2022-02-16 23:03:47 INFO Table tmp.tmp_ljjtest02 stats: [numFiles=23, numRows=23, totalSize=5894, rawDataSize=2252]
2022-02-16 23:03:47 INFO Stage-1: Map: 18024  Reduce: 1009   Cumulative CPU: 377078.2 sec   HDFS Read: 193.287 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 4m18s539ms job_5690061100801_32622232
2022-02-16 23:03:47 INFO Stage-10: Map: 920   Cumulative CPU: 5431.21 sec   HDFS Read: 0.008 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 52s336ms job_5690061100801_32622707
2022-02-16 23:03:47 INFO Stage-4: Map: 832   Cumulative CPU: 1078.67 sec   HDFS Read: 0.001 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 31s273ms job_8766029101801_36250210
2022-02-16 23:03:47 INFO Total MapReduce CPU Time Spent: 4d10h33m8s80ms
2022-02-16 23:03:47 INFO Total Map: 19776  Total Reduce: 1009
2022-02-16 23:03:47 INFO Total HDFS Read: 193.296 GB  Written: 0.000 GB

Result analysis:

Grouped by fields in the table, if the data is uneven, it may cause data skew.

Group by random number

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
set hive.mapjoin.smalltable.filesize=200000000;

set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = false; --MR Small files are not merged at the end of the operation
set hive.merge.size.per.task = 256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set hive.merge.orcfile.stripe.level=false;
set hive.merge.rcfile.block.level=false;

--appoint tmp.tmp_01 The default file storage format is textfile
set hive.default.fileformat = orc;

drop table if exists tmp.tmp_01;
create table tmp.tmp_01 
--STORED AS orc
as
SELECT
      a.item_sku_id 
FROM  (
            SELECT
                    item_sku_id,sku_name,brandname_cn,data_type
            FROM big_table 
            WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine'
            distribute by item_sku_id -- Large tables are grouped in random numbers
      ) a --Data size 2 T about
LEFT OUTER JOIN
      (     SELECT item_sku_id 
            FROM small_table 
            WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15'
      ) b --The data size is less than 10 M
ON a.item_sku_id = b.item_sku_id
WHERE b.item_sku_id IS NULL;

result:

2022-02-16 22:53:00 INFO Table tmp.tmp_01 stats: [numFiles=3, numRows=23, totalSize=958, rawDataSize=2235]
2022-02-16 22:53:00 INFO Stage-1: Map: 18024  Reduce: 1009   Cumulative CPU: 372613.35 sec   HDFS Read: 193.289 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 5m32s865ms job_5690061100801_32621522
2022-02-16 22:53:00 INFO Stage-10: Map: 924   Cumulative CPU: 5424.53 sec   HDFS Read: 0.008 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 41s846ms job_5690061100801_32621870
2022-02-16 22:53:00 INFO Stage-4: Map: 858   Cumulative CPU: 1374.95 sec   HDFS Read: 0.001 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 43s592ms job_5690061100801_32621913
2022-02-16 22:53:00 INFO Total MapReduce CPU Time Spent: 4d9h23m32s830ms
2022-02-16 22:53:00 INFO Total Map: 19806  Total Reduce: 1009
2022-02-16 22:53:00 INFO Total HDFS Read: 193.298 GB  Written: 0.000 GB

Result analysis:

Grouped by random numbers, the data can be randomly and evenly split from the map side to the reduce side

Summary:

Data grouping (triggering an mr task), which controls how to split the data at the map end to the reduce end. The default is hash
distribute by item_sku_id will cause data skew
distribute by rand() if one of the reduce tasks fails during execution, it may cause data loss or duplication
distribute by rand(int seed) uses random seeds to avoid the problem of inaccurate data

The specific risk principle of distribute by rand() is as follows:

  • Some reduce shuffle s fail, resulting in the data regeneration of individual upstream map task s. If the shuffle fragmentation logic contains random factors, the newly generated data shuffle fragmentation will not be completely consistent with the previous one, resulting in repeated reading or data loss of some data

  • reduce task pulls the data of the corresponding partition from the result file of each map task. The data is already partitioned in the map phase, and there will be an additional index file to record the starting offset of each partition. Therefore, when reducetask fetches data, it directly pulls the data according to the offset.

  • The newly generated data in the shuffle slice will lose data and also contain the data distributed to other reducer s.

Map only and mapReduce job output small file merging and tuning

Small file merge at the end of map only operation

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
set hive.mapjoin.smalltable.filesize=200000000;

set hive.merge.mapfiles = true; -- map-only Small file merge at end
set hive.merge.mapredfiles = false; -- MR Small files are not merged at the end of the operation
set hive.merge.size.per.task = 256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set hive.merge.orcfile.stripe.level=false;
set hive.merge.rcfile.block.level=false;

--appoint tmp.tmp_01 The default file storage format is textfile
set hive.default.fileformat = orc;

drop table if exists tmp.tmp_01;
create table tmp.tmp_01 
--STORED AS orc
as
create table tmp.tmp_01 as
SELECT
      item_sku_id
FROM
      cdm.cdm_goods_360buy_act_basic_da
WHERE
      dt = '2022-02-15'
      AND barndname_cn = 'Fen Wine'

result:

2022-02-17 14:45:37 INFO Table tmp.tmp_ljjtest021 stats: [numFiles=1, numRows=18415, totalSize=110953, rawDataSize=1762456]

Result analysis:

[set hive.merge.mapfiles = true;] Controls the merging of small files at the end of the map only operation

Merge small files at the end of mapReduce operation

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
set hive.mapjoin.smalltable.filesize=200000000;

set hive.merge.mapfiles = false; -- map-only Small files are not merged at the end
set hive.merge.mapredfiles = true; -- MR At the end of the operation, merge the small files
set hive.merge.size.per.task = 256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set hive.merge.orcfile.stripe.level=false;
set hive.merge.rcfile.block.level=false;

--appoint tmp.tmp_01 The default file storage format is textfile
set hive.default.fileformat = orc;

drop table if exists tmp.tmp_01;
create table tmp.tmp_01 
--STORED AS orc
as
SELECT
      a.item_sku_id 
FROM  (
            SELECT
                    item_sku_id,sku_name,brandname_cn,data_type
            FROM big_table 
            WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine'
      ) a --Data size 2 T about
LEFT OUTER JOIN
      (     SELECT item_sku_id 
            FROM small_table 
            WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15'
      ) b --The data size is less than 10 M
ON a.item_sku_id = b.item_sku_id
WHERE b.item_sku_id IS NULL;

result:

2022-02-16 23:12:45 INFO Table tmp.tmp_01 stats: [numFiles=1, numRows=23, totalSize=426, rawDataSize=2235]

Result analysis:

[set hive.merge.mapredfiles = true;] Controls the merging of small files at the end of mapReduce and mapjoin operations

Summary:

Here, you should pay attention to the mapjoin operation. Although only the map operation is triggered, the reduce operation is not triggered.
However, this is the result of optimization and is essentially a mapReduce task, so it is [set hive.merge.mapredfiles = true;] Parameter control.

hive output small file problem tuning

Execute mapjoin without merging small files and grouping

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
set hive.mapjoin.smalltable.filesize=200000000;

set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = false; --MR Small files are not merged at the end of the operation
set hive.merge.size.per.task = 256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set hive.merge.orcfile.stripe.level=false;
set hive.merge.rcfile.block.level=false;

--appoint tmp.tmp_01 The default file storage format is textfile
set hive.default.fileformat = orc;

drop table if exists tmp.tmp_01;
create table tmp.tmp_01 
--STORED AS orc
as
SELECT
      a.item_sku_id 
FROM  (
            SELECT
                    item_sku_id,sku_name,brandname_cn,data_type
            FROM big_table 
            WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine'
            --distribute by rand(1) -- No grouping
      ) a --Data size 2 T about
LEFT OUTER JOIN
      (     SELECT item_sku_id 
            FROM small_table 
            WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15'
      ) b --The data size is less than 10 M
ON a.item_sku_id = b.item_sku_id
WHERE b.item_sku_id IS NULL;

result:

2022-02-17 16:53:18 INFO Table tmp.tmp_01 stats: [numFiles=18020, numRows=23, totalSize=887747, rawDataSize=2252]
2022-02-17 16:53:18 INFO MapReduce Jobs Launched:
2022-02-17 16:53:19 INFO Stage-4: Map: 18020   Cumulative CPU: 205038.93 sec   HDFS Read: 193.386 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 3m28s419ms job_8766029101801_36451400
2022-02-17 16:53:19 INFO 
2022-02-17 16:53:19 INFO Total MapReduce CPU Time Spent: 2d8h57m18s930ms
2022-02-17 16:53:19 INFO Total Map: 18020  Total Reduce: 0
2022-02-17 16:53:19 INFO Total HDFS Read: 193.386 GB  Written: 0.000 GB

Result analysis:

Execute mapjoin (Association aggregation on the map side), do not merge small files, and each map corresponds to an output file. A large number of small files appear in the output results.

Execute mapjoin without merging small files and grouping large tables

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
set hive.mapjoin.smalltable.filesize=200000000;

set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = false; --MR Small files are not merged at the end of the operation
set hive.merge.size.per.task = 256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set hive.merge.orcfile.stripe.level=false;
set hive.merge.rcfile.block.level=false;

--appoint tmp.tmp_01 The default file storage format is textfile
set hive.default.fileformat = orc;

drop table if exists tmp.tmp_01;
create table tmp.tmp_01 
--STORED AS orc
as
SELECT
      a.item_sku_id 
FROM  (
            SELECT
                    item_sku_id,sku_name,brandname_cn,data_type
            FROM big_table 
            WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine'
            --distribute by rand(1) -- Large table grouping
      ) a --Data size 2 T about
LEFT OUTER JOIN
      (     SELECT item_sku_id 
            FROM small_table 
            WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15'
      ) b --The data size is less than 10 M
ON a.item_sku_id = b.item_sku_id
WHERE b.item_sku_id IS NULL;

result:

2022-02-16 22:53:00 INFO Table tmp.tmp_01 stats: [numFiles=3, numRows=23, totalSize=958, rawDataSize=2235]
2022-02-16 22:53:00 INFO Stage-1: Map: 18024  Reduce: 1009   Cumulative CPU: 372613.35 sec   HDFS Read: 193.289 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 5m32s865ms job_5690061100801_32621522
2022-02-16 22:53:00 INFO Stage-10: Map: 924   Cumulative CPU: 5424.53 sec   HDFS Read: 0.008 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 41s846ms job_5690061100801_32621870
2022-02-16 22:53:00 INFO Stage-4: Map: 858   Cumulative CPU: 1374.95 sec   HDFS Read: 0.001 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 43s592ms job_5690061100801_32621913
2022-02-16 22:53:00 INFO Total MapReduce CPU Time Spent: 4d9h23m32s830ms
2022-02-16 22:53:00 INFO Total Map: 19806  Total Reduce: 1009
2022-02-16 22:53:00 INFO Total HDFS Read: 193.298 GB  Written: 0.000 GB

Result analysis:

Execute mapjoin (Association and aggregation on the map side), do not merge small files, pre group large tables, trigger MR, and reduce the number of maps in the mapjoin stage. Small output files are greatly reduced.

Execute mapjoin to merge small files

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
set hive.mapjoin.smalltable.filesize=200000000;

set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = true; --MR At the end of the operation, merge the small files
set hive.merge.size.per.task = 256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set hive.merge.orcfile.stripe.level=false;
set hive.merge.rcfile.block.level=false;

--appoint tmp.tmp_01 The default file storage format is textfile
set hive.default.fileformat = orc;

drop table if exists tmp.tmp_01;
create table tmp.tmp_01 
--STORED AS orc
as
SELECT
      a.item_sku_id 
FROM  (
            SELECT
                    item_sku_id,sku_name,brandname_cn,data_type
            FROM big_table 
            WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine'
            --distribute by rand(1) -- No grouping,Because grouping is meaningless at this time, it will only increase the task overhead
      ) a --Data size 2 T about
LEFT OUTER JOIN
      (     SELECT item_sku_id 
            FROM small_table 
            WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15'
      ) b --The data size is less than 10 M
ON a.item_sku_id = b.item_sku_id
WHERE b.item_sku_id IS NULL;

result:

2022-02-16 23:12:45 INFO Table tmp.tmp_01 stats: [numFiles=1, numRows=23, totalSize=426, rawDataSize=2235]
2022-02-16 23:12:45 INFO Stage-1: Map: 18024  Reduce: 1009   Cumulative CPU: 365693.05 sec   HDFS Read: 193.289 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 3m21s803ms job_5690061100801_32623163
2022-02-16 23:12:45 INFO Stage-10: Map: 918   Cumulative CPU: 5232.29 sec   HDFS Read: 0.008 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 40s544ms job_8766029101801_36250732
2022-02-16 23:12:45 INFO Stage-4: Map: 1   Cumulative CPU: 13.95 sec   HDFS Read: 0.000 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 43s762ms job_8766029101801_36250769
2022-02-16 23:12:45 INFO Total MapReduce CPU Time Spent: 4d7h2m19s290ms
2022-02-16 23:12:45 INFO Total Map: 18943  Total Reduce: 1009
2022-02-16 23:12:45 INFO Total HDFS Read: 193.297 GB  Written: 0.000 GB

Result analysis:

Execute mapjoin (Association aggregation on the map side) and merge small files. The output result is small and the number of files is the least.

Summary:

The optimization improvement of mapjoin in small table and large table join is still obvious, but there are some problems to pay attention to when using mapjoin.
In the above example, when mapjoin is executed, there are 18020 maps, and 18020 small files are output accordingly (executing mapReduce will greatly reduce the number of small files output, and the default maximum number of reduce in the test cluster is 1009).
Therefore, it is necessary to consider the merging of small files in the later stage. In the scenario of a large number of small files, the best way is to start the merging of small files

Enable small file merging of output results:
rcfile and orc need special settings to merge at the document level. parquet, textfile and sequencefile merge at the file level by default

set hive.merge.mapfiles = true;--Open only map(map-only)Small file output merge at the end of the operation
set hive.merge.mapredfiles = true;--Open yes map and reduce Small file output merge at the end of the operation
set hive.merge.size.per.task = 256000000;--Size of each file after merging
set hive.merge.smallfiles.avgsize=256000000;--When the average size of the output file is less than 256000000, start a separate file MR The task performs file merging and triggers the threshold value of small file merging
set hive.merge.orcfile.stripe.level=false; --When this parameter is set to true,orc File for stripe Level Level merge,When set to false,orc Files are merged at the file level.
set hive.merge.rcfile.block.level=false;--When this parameter is set to true,rcfile File for block Level Level merge,When set to false,rcfile Files are merged at the file level.

Tuning of a large number of small files in hive intermediate process (between two job s)

No grouping, no merging of small files

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
set hive.mapjoin.smalltable.filesize=200000000;

set hive.merge.mapfiles = false;
set hive.merge.mapredfiles = false; --MR Small files are not merged at the end of the operation
set hive.merge.size.per.task = 256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set hive.merge.orcfile.stripe.level=false;
set hive.merge.rcfile.block.level=false;

WITH tmp as 
(
SELECT
      a.* 
FROM  (
            SELECT
                    item_sku_id,sku_name,brandname_cn,data_type
            FROM big_table 
            WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine'
            --distribute by rand(1) -- No grouping
      ) a --Data size 2 T about
LEFT OUTER JOIN
      (     SELECT item_sku_id 
            FROM small_table 
            WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15'
      ) b --The data size is less than 10 M
ON a.item_sku_id = b.item_sku_id
WHERE b.item_sku_id IS NULL
)
create table tmp.tmp_01 
as
SELECT
c.item_sku_id
FROM tmp c
join small_table d
on c.item_sku_id=d.item_sku_id

result:

2022-02-17 20:49:21 INFO Stage-8: Map: 18026   Cumulative CPU: 210204.23 sec   HDFS Read: 193.388 GB HDFS Write: 0.001 GB SUCCESS  Elapsed : 2m55s772ms job_5690061100801_32863650
2022-02-17 20:49:21 INFO Stage-6: Map: 472   Cumulative CPU: 3330.08 sec   HDFS Read: 0.334 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 50s42ms job_5690061100801_32865433
2022-02-17 20:49:21 INFO Total MapReduce CPU Time Spent: 2d11h18m54s310ms
2022-02-17 20:49:21 INFO Total Map: 18498  Total Reduce: 0
2022-02-17 20:49:21 INFO Total HDFS Read: 193.722 GB  Written: 0.001 GB

2022-02-17 20:49:22 INFO Time taken: 1045.892 seconds, Fetched: 46 row(s)
2022-02-17 20:49:23 INFO End of task! sql Execution time: 0 hour 17 mins 38 s

Result analysis:

It takes 1045 seconds without grouping and merging small files

Merge small files without grouping

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
set hive.mapjoin.smalltable.filesize=200000000;

set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = true; --MR At the end of the operation, merge the small files
set hive.merge.size.per.task = 256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set hive.merge.orcfile.stripe.level=false;
set hive.merge.rcfile.block.level=false;

WITH tmp as 
(
SELECT
      a.* 
FROM  (
            SELECT
                    item_sku_id,sku_name,brandname_cn,data_type
            FROM big_table 
            WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine'
            --distribute by rand(1) -- No grouping
      ) a --Data size 2 T about
LEFT OUTER JOIN
      (     SELECT item_sku_id 
            FROM small_table 
            WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15'
      ) b --The data size is less than 10 M
ON a.item_sku_id = b.item_sku_id
WHERE b.item_sku_id IS NULL
)
create table tmp.tmp_01 
as
SELECT
c.item_sku_id
FROM tmp c
join small_table d
on c.item_sku_id=d.item_sku_id

result:

2022-02-17 20:45:25 INFO Stage-8: Map: 18026   Cumulative CPU: 214628.08 sec   HDFS Read: 193.388 GB HDFS Write: 0.001 GB SUCCESS  Elapsed : 3m41s147ms job_8766029101801_36477823
2022-02-17 20:45:25 INFO Stage-6: Map: 472   Cumulative CPU: 3995.4 sec   HDFS Read: 0.334 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 59s677ms job_5690061100801_32865009
2022-02-17 20:45:25 INFO Total MapReduce CPU Time Spent: 2d12h43m43s480ms
2022-02-17 20:45:25 INFO Total Map: 18498  Total Reduce: 0
2022-02-17 20:45:25 INFO Total HDFS Read: 193.722 GB  Written: 0.001 GB

2022-02-17 20:45:27 INFO Time taken: 1074.534 seconds, Fetched: 46 row(s)
2022-02-17 20:45:27 INFO End of task! sql Execution time: 0 hour 18 mins 10 s

Result analysis:

It takes 1074 seconds to merge small files without grouping

Grouping, merging small files

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
set hive.mapjoin.smalltable.filesize=200000000;

set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = true; --MR At the end of the operation, merge the small files
set hive.merge.size.per.task = 256000000;
set hive.merge.smallfiles.avgsize = 256000000;
set hive.merge.orcfile.stripe.level=false;
set hive.merge.rcfile.block.level=false;

WITH tmp as 
(
SELECT
      a.* 
FROM  (
            SELECT
                    item_sku_id,sku_name,brandname_cn,data_type
            FROM big_table 
            WHERE dt = '2022-02-15' AND brandname_cn = 'Fen Wine'
            distribute by rand(1) -- grouping
      ) a --Data size 2 T about
LEFT OUTER JOIN
      (     SELECT item_sku_id 
            FROM small_table 
            WHERE version = '1' AND normal_brandname_en = 'fenjiu' AND dt = '2022-02-15'
      ) b --The data size is less than 10 M
ON a.item_sku_id = b.item_sku_id
WHERE b.item_sku_id IS NULL
)
create table tmp.tmp_01 
as
SELECT
c.item_sku_id
FROM tmp c
join small_table d
on c.item_sku_id=d.item_sku_id

result:

2022-02-17 19:57:57 INFO Stage-1: Map: 18026  Reduce: 1009   Cumulative CPU: 362544.56 sec   HDFS Read: 193.290 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 4m55s871ms job_5690061100801_32857972
2022-02-17 19:57:57 INFO Stage-9: Map: 928   Cumulative CPU: 5520.28 sec   HDFS Read: 0.008 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 2m33s752ms job_5690061100801_32858541
2022-02-17 19:57:57 INFO Stage-7: Map: 472   Cumulative CPU: 3638.63 sec   HDFS Read: 0.334 GB HDFS Write: 0.000 GB SUCCESS  Elapsed : 1m42s589ms job_5690061100801_32858927
2022-02-17 19:57:57 INFO Total MapReduce CPU Time Spent: 4d7h15m3s470ms
2022-02-17 19:57:57 INFO Total Map: 19426  Total Reduce: 1009
2022-02-17 19:57:57 INFO Total HDFS Read: 193.634 GB  Written: 0.000 GB

2022-02-17 19:57:58 INFO Time taken: 690.782 seconds, Fetched: 46 row(s)
2022-02-17 19:57:58 INFO End of task! sql Execution time: 0 hour 11 mins 44 s

Result analysis:

In the case of grouping and merging small files, it takes 690 seconds

Summary:

When a large number of small files are generated in the intermediate result of hive task, the setting of merging small files has no effect on the result;
However, setting a large table to be grouped in advance, triggering MR and reducing the number of maps in the mapjoin stage will greatly reduce the time of computing tasks

Keywords: Hadoop hive Data Warehouse

Added by jaydeee on Thu, 17 Feb 2022 20:24:42 +0200