Are you still using group by? Experts are using grouping sets

Let's take a look at a problem in actual generation:

There is a user access record table, which records the user's access under different channels, operating systems and versions,
As follows:

create table test.test10 as 
select '10001' as uid,'xiaomi' as qid,'android' as os,'1.2.2' as ver
union all
select '10002' as uid,'xiaomi' as qid,'android' as os,'1.2.1' as ver
union all
select '10003' as uid,'oppo' as qid,'android' as os,'1.2.3' as ver
union all
select '10002' as uid,'appstore' as qid,'ios' as os,'1.2.0' as ver
union all
select '10001' as uid,'oppo' as qid,'android' as os,'1.2.3' as ver
union all
select '10003' as uid,'appstore' as qid,'ios' as os,'1.1.9' as ver
union all
select '10001' as uid,'xiaomi' as qid,'android' as os,'1.2.2' as ver
union all
select '10002' as uid,'oppo' as qid,'android' as os,'1.2.1' as ver;

Question:

  1. Number of visits and number of visitors per channel, operating system and version
  2. Number of visits and number of visitors per operating system
  3. Total number of visits and number of visitors

analysis:
The above three questions are about the number of visits and the number of visitors. When it comes to the number of visitors, our first consideration is to de duplicate.
How to get the highest efficiency?

Let's first look at the method of group by:

  1. First remove the duplication according to each channel, operating system and version
  2. According to the de duplication result of 1, and then according to the operating system
  3. According to the result of 2, the user dimension is de duplicated

duplicate removal:
De duplication according to each channel, operating system and version:

create table test.test10_qid_os_ver_uid as 
select
  uid,
  qid,
  os,
  ver,
  count(1) as pv
from test.test10
group by uid,qid,os,ver;

De duplication according to operating system:

create table test.test10_os_uid as 
select
  uid,
  os,
  sum(pv) as pv
from test.test10_qid_os_ver_uid
group by uid,os;

According to user weight removal:

create table test.test10_uid as 
select
  uid,
  sum(pv) as pv
from test.test10_os_uid
group by uid;

Aggregation:
Aggregation of each channel, operating system and version:

spark-sql> select
         >   qid,
         >   os,
         >   ver,
         >   count(1) as uv,
         >   sum(pv) as pv
         > from test.test10_qid_os_ver_uid
         > group by qid,os,ver;
appstore        ios     1.1.9   1       1                                       
appstore		ios		1.2.0	1		1
xiaomi			android	1.2.2	1		2
oppo			android	1.2.3	2		2
xiaomi			android	1.2.1	1		1
oppo			android	1.2.1	1		1
Time taken: 6.842 seconds, Fetched 6 row(s)

Aggregation per operating system:

spark-sql> select
         >   os,
         >   count(1) as uv,
         >   sum(pv) as pv
         > from test.test10_os_uid
         > group by os;
android 3       6                                                               
ios		2		2
Time taken: 10.561 seconds, Fetched 2 row(s)

Aggregation per user:

spark-sql> select
         >   count(1) as uv,
         >   sum(pv) as pv
         > from test.test10_uid;
3       8                                                                       
Time taken: 3.711 seconds, Fetched 1 row(s)

Is it a little scary to see so much code above? De duplication of each dimension and aggregation of each dimension.
If you do not consider code optimization, you can directly de duplicate and aggregate from the most detailed tables of qid, os and ver.

Let's take another look at the method of grouping sets:

  1. First, remove the duplicate according to the required dimensions
  2. Then polymerization

duplicate removal:
The dimensions of our de duplication:

  • uid,qid,os,ver
  • uid,os
  • uid
spark-sql> select
         >   uid,
         >   qid,
         >   os,
         >   ver,
         >   count(1) as pv
         > from test.test10
         > group by uid,qid,os,ver
         > grouping sets
         > (
         >   (uid,qid,os,ver),
         >   (uid,os),
         >   (uid)
         > )
         > order by qid,os,ver;
10001   NULL    	NULL    	NULL    3                                               
10002	NULL		NULL		NULL	3
10003	NULL		NULL		NULL	2
10001	NULL		android		NULL	3
10003	NULL		android		NULL	1
10002	NULL		android		NULL	2
10003	NULL		ios			NULL	1
10002	NULL		ios			NULL	1
10003	appstore	ios			1.1.9	1
10002	appstore	ios			1.2.0	1
10002	oppo		android		1.2.1	1
10003	oppo		android		1.2.3	1
10001	oppo		android		1.2.3	1
10002	xiaomi		android		1.2.1	1
10001	xiaomi		android		1.2.2	2
Time taken: 4.051 seconds, Fetched 15 row(s)

Note:
When grouping sets only removes duplicates according to uid dimensions, qid, os and ver dimensions are NULL, that is, qid, os and ver dimensions are summarized
When grouping sets is de duplicated according to uid and os dimensions, qid and ver are NULL, that is, qid and ver are summarized

Aggregation:
Aggregate according to the results of uid de duplication:

spark-sql> select 
         >   qid,
         >   os,
         >   ver,
         >   count(1) as uv,
         >   sum(pv) as pv
         > from
         > (select
         >   uid,
         >   nvl(qid,'all') as qid,
         >   nvl(os,'all') as os,
         >   nvl(ver,'all') as ver,
         >   count(1) as pv
         > from test.test10
         > group by uid,qid,os,ver
         > grouping sets
         > (
         >   (uid,qid,os,ver),
         >   (uid,os),
         >   (uid)
         > )) t2
         > group by qid,os,ver
         > order by qid,os,ver;
all			all			all		3		8
all     	android 	all     3       6    
all			ios			all		2		2                                           
appstore	ios			1.1.9	1		1
appstore	ios			1.2.0	1		1
xiaomi	android			1.2.2	1		2
oppo	android			1.2.3	2		2
xiaomi	android			1.2.1	1		1
oppo	android			1.2.1	1		1
Time taken: 8.257 seconds, Fetched 9 row(s)

Note: there is a pit here. If there is a NULL value in the qid, os and ver dimensions, the data will be abnormal
Case:

spark-sql> with test1 as
         > (select '10001' as uid,'xiaomi' as qid
         > union all
         > select '10002' as uid,'xiaomi' as qid
         > union all
         > select '10003' as uid,'oppo' as qid
         > union all
         > select '10003' as uid,NULL as qid)
         > select 
         >   qid,
         >   count(1) as uv,
         >   sum(pv) as pv
         > from 
         > (select 
         >   uid,
         >   qid,
         >   count(1) as pv
         > from test1
         > group by uid,qid
         > grouping sets
         > (
         >   (uid,qid),
         >   (uid)
         > )) t1
         > group by qid;
NULL    4       5                                                               
oppo	1		1
xiaomi	2		2
Time taken: 6.396 seconds, Fetched 3 row(s)

There is a NULL value in the qid dimension. As a result, there is a problem with the summary data. Originally, uv is 3 and pv is 4. Now uv is 4 and pv is 5. This is because growing sets calculates it into the summary dimension when processing the qid dimension again. How can we solve this problem? The values of all dimension fields are emptied NVL (qid, 'unknown') as qid
Case:

spark-sql> with test1 as
         > (select '10001' as uid,'xiaomi' as qid
         > union all
         > select '10002' as uid,'xiaomi' as qid
         > union all
         > select '10003' as uid,'oppo' as qid
         > union all
         > select '10003' as uid,NULL as qid)
         > 
         > select 
         >   qid,
         >   count(1) as uv,
         >   sum(pv) as pv
         > from 
         > (select 
         >   uid,
         >   qid,
         >   count(1) as pv
         > from 
         > (select 
         >   uid,
         >   nvl(qid,'unknow') as qid
         > from test1) t1
         > group by uid,qid
         > grouping sets
         > (
         >   (uid,qid),
         >   (uid)
         > )) t2
         > group by qid;
NULL    	3       4                                                               
oppo		1		1
xiaomi		2		2
unknow		1		1
Time taken: 3.47 seconds, Fetched 4 row(s)

In this way, if qid is NULL, the value will be displayed as unknow n

Summary:
Normal group by mode:
Advantages: the uid tables of all dimensions are implemented. If you query a specific dimension, the efficiency will be higher; The logic is clear and intuitive
Disadvantages: large amount of code and troublesome maintenance
grouping sets mode:
Advantages: the code is simple and easy to maintain
Disadvantages: when the amount of data is large, more resources are consumed (the dimensions under grouping sets can be segmented to solve the problem of large amount of data)

Keywords: Big Data Hadoop hive Data Warehouse

Added by jpowermacg4 on Sat, 25 Dec 2021 18:29:07 +0200