Let's take a look at a problem in actual generation:
There is a user access record table, which records the user's access under different channels, operating systems and versions,
As follows:
create table test.test10 as select '10001' as uid,'xiaomi' as qid,'android' as os,'1.2.2' as ver union all select '10002' as uid,'xiaomi' as qid,'android' as os,'1.2.1' as ver union all select '10003' as uid,'oppo' as qid,'android' as os,'1.2.3' as ver union all select '10002' as uid,'appstore' as qid,'ios' as os,'1.2.0' as ver union all select '10001' as uid,'oppo' as qid,'android' as os,'1.2.3' as ver union all select '10003' as uid,'appstore' as qid,'ios' as os,'1.1.9' as ver union all select '10001' as uid,'xiaomi' as qid,'android' as os,'1.2.2' as ver union all select '10002' as uid,'oppo' as qid,'android' as os,'1.2.1' as ver;
Question:
- Number of visits and number of visitors per channel, operating system and version
- Number of visits and number of visitors per operating system
- Total number of visits and number of visitors
analysis:
The above three questions are about the number of visits and the number of visitors. When it comes to the number of visitors, our first consideration is to de duplicate.
How to get the highest efficiency?
Let's first look at the method of group by:
- First remove the duplication according to each channel, operating system and version
- According to the de duplication result of 1, and then according to the operating system
- According to the result of 2, the user dimension is de duplicated
duplicate removal:
De duplication according to each channel, operating system and version:
create table test.test10_qid_os_ver_uid as select uid, qid, os, ver, count(1) as pv from test.test10 group by uid,qid,os,ver;
De duplication according to operating system:
create table test.test10_os_uid as select uid, os, sum(pv) as pv from test.test10_qid_os_ver_uid group by uid,os;
According to user weight removal:
create table test.test10_uid as select uid, sum(pv) as pv from test.test10_os_uid group by uid;
Aggregation:
Aggregation of each channel, operating system and version:
spark-sql> select > qid, > os, > ver, > count(1) as uv, > sum(pv) as pv > from test.test10_qid_os_ver_uid > group by qid,os,ver; appstore ios 1.1.9 1 1 appstore ios 1.2.0 1 1 xiaomi android 1.2.2 1 2 oppo android 1.2.3 2 2 xiaomi android 1.2.1 1 1 oppo android 1.2.1 1 1 Time taken: 6.842 seconds, Fetched 6 row(s)
Aggregation per operating system:
spark-sql> select > os, > count(1) as uv, > sum(pv) as pv > from test.test10_os_uid > group by os; android 3 6 ios 2 2 Time taken: 10.561 seconds, Fetched 2 row(s)
Aggregation per user:
spark-sql> select > count(1) as uv, > sum(pv) as pv > from test.test10_uid; 3 8 Time taken: 3.711 seconds, Fetched 1 row(s)
Is it a little scary to see so much code above? De duplication of each dimension and aggregation of each dimension.
If you do not consider code optimization, you can directly de duplicate and aggregate from the most detailed tables of qid, os and ver.
Let's take another look at the method of grouping sets:
- First, remove the duplicate according to the required dimensions
- Then polymerization
duplicate removal:
The dimensions of our de duplication:
- uid,qid,os,ver
- uid,os
- uid
spark-sql> select > uid, > qid, > os, > ver, > count(1) as pv > from test.test10 > group by uid,qid,os,ver > grouping sets > ( > (uid,qid,os,ver), > (uid,os), > (uid) > ) > order by qid,os,ver; 10001 NULL NULL NULL 3 10002 NULL NULL NULL 3 10003 NULL NULL NULL 2 10001 NULL android NULL 3 10003 NULL android NULL 1 10002 NULL android NULL 2 10003 NULL ios NULL 1 10002 NULL ios NULL 1 10003 appstore ios 1.1.9 1 10002 appstore ios 1.2.0 1 10002 oppo android 1.2.1 1 10003 oppo android 1.2.3 1 10001 oppo android 1.2.3 1 10002 xiaomi android 1.2.1 1 10001 xiaomi android 1.2.2 2 Time taken: 4.051 seconds, Fetched 15 row(s)
Note:
When grouping sets only removes duplicates according to uid dimensions, qid, os and ver dimensions are NULL, that is, qid, os and ver dimensions are summarized
When grouping sets is de duplicated according to uid and os dimensions, qid and ver are NULL, that is, qid and ver are summarized
Aggregation:
Aggregate according to the results of uid de duplication:
spark-sql> select > qid, > os, > ver, > count(1) as uv, > sum(pv) as pv > from > (select > uid, > nvl(qid,'all') as qid, > nvl(os,'all') as os, > nvl(ver,'all') as ver, > count(1) as pv > from test.test10 > group by uid,qid,os,ver > grouping sets > ( > (uid,qid,os,ver), > (uid,os), > (uid) > )) t2 > group by qid,os,ver > order by qid,os,ver; all all all 3 8 all android all 3 6 all ios all 2 2 appstore ios 1.1.9 1 1 appstore ios 1.2.0 1 1 xiaomi android 1.2.2 1 2 oppo android 1.2.3 2 2 xiaomi android 1.2.1 1 1 oppo android 1.2.1 1 1 Time taken: 8.257 seconds, Fetched 9 row(s)
Note: there is a pit here. If there is a NULL value in the qid, os and ver dimensions, the data will be abnormal
Case:
spark-sql> with test1 as > (select '10001' as uid,'xiaomi' as qid > union all > select '10002' as uid,'xiaomi' as qid > union all > select '10003' as uid,'oppo' as qid > union all > select '10003' as uid,NULL as qid) > select > qid, > count(1) as uv, > sum(pv) as pv > from > (select > uid, > qid, > count(1) as pv > from test1 > group by uid,qid > grouping sets > ( > (uid,qid), > (uid) > )) t1 > group by qid; NULL 4 5 oppo 1 1 xiaomi 2 2 Time taken: 6.396 seconds, Fetched 3 row(s)
There is a NULL value in the qid dimension. As a result, there is a problem with the summary data. Originally, uv is 3 and pv is 4. Now uv is 4 and pv is 5. This is because growing sets calculates it into the summary dimension when processing the qid dimension again. How can we solve this problem? The values of all dimension fields are emptied NVL (qid, 'unknown') as qid
Case:
spark-sql> with test1 as > (select '10001' as uid,'xiaomi' as qid > union all > select '10002' as uid,'xiaomi' as qid > union all > select '10003' as uid,'oppo' as qid > union all > select '10003' as uid,NULL as qid) > > select > qid, > count(1) as uv, > sum(pv) as pv > from > (select > uid, > qid, > count(1) as pv > from > (select > uid, > nvl(qid,'unknow') as qid > from test1) t1 > group by uid,qid > grouping sets > ( > (uid,qid), > (uid) > )) t2 > group by qid; NULL 3 4 oppo 1 1 xiaomi 2 2 unknow 1 1 Time taken: 3.47 seconds, Fetched 4 row(s)
In this way, if qid is NULL, the value will be displayed as unknow n
Summary:
Normal group by mode:
Advantages: the uid tables of all dimensions are implemented. If you query a specific dimension, the efficiency will be higher; The logic is clear and intuitive
Disadvantages: large amount of code and troublesome maintenance
grouping sets mode:
Advantages: the code is simple and easy to maintain
Disadvantages: when the amount of data is large, more resources are consumed (the dimensions under grouping sets can be segmented to solve the problem of large amount of data)