Data warehouse user behavior funnel analysis how to implement SQL (Section 1)

"Contents"

  • Demand 1: user active theme
  • Requirement 2: user added theme
  • Demand 3: user retention theme
  • Demand 4: number of silent users
  • Demand 5: number of returned users this week
  • Demand 6: number of lost users
  • Demand 7: number of active users in the last three consecutive weeks
  • Demand 8: number of active users for three consecutive days in the last seven days
  • Demand 9: total turnover of GMV (gross merchant volume) in a period of time
  • Demand 10: conversion rate = new users / daily users
  • Requirement 11: user behavior funnel analysis
  • Demand 12: brand repurchase rate
  • Demand XIII: analysis of ADS layer brand repurchase rate report
  • Demand 14: find the ranking of the top ten commodities with the corresponding repurchase rate of users at each level

Demand 1: user active theme

DWS layer -- (wide surface layer of user behavior) goal: count the details of each equipment in the current day, current week and current month

1. Daily active equipment details DWD_ start_ log--->dws_ uv_ detail_ day

--Collect the same field_ Set to an array and press mid_id grouping (for later statistics)

collect_ Set de summarizes the values of a field to generate an array type field. For example: concat_ws('|', collect_set(user_id)) user_id,

Create partition table dws_uv_detail_day: partitioned by ('dt' string)

drop table if exists dws_uv_detail_day;
create table dws_uv_detail_day( 
    `mid_id` string COMMENT 'Unique identification of equipment',
    `user_id` string COMMENT 'User ID', 
    `version_code` string COMMENT 'Program version number', 
    `version_name` string COMMENT 'Program version name', 
`lang` string COMMENT 'System language', 
`source` string COMMENT 'Channel number', 
`os` string COMMENT 'Android version', 
`area` string COMMENT 'region', 
`model` string COMMENT 'Mobile phone model', 
`brand` string COMMENT 'Mobile phone brand', 
`sdk_version` string COMMENT 'sdkVersion', 
`gmail` string COMMENT 'gmail', 
`height_width` string COMMENT 'Screen width and height',
`app_time` string COMMENT 'Time when the client log was generated',
`network` string COMMENT 'Network mode',
`lng` string COMMENT 'longitude',
`lat` string COMMENT 'latitude'
) COMMENT 'Daily details of active users'
PARTITIONED BY ( `dt` string)
stored as  parquet
location '/warehouse/gmall/dws/dws_uv_detail_day/'
;

Data import

Zoning by week; Filter out the data within one week; Group by device id; = = => Count (*) to get the final result;

partition(dt='2019-02-10') from dwd_ start_ log where dt='2019-02-10' group by mid_ ID (mid_id device unique identification)

Aggregation is based on the user's single day access as the key. If a user uses two operating systems, two system versions and multiple regions in a day, log in to different accounts and take only one of them

hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table dws_uv_detail_day  partition(dt='2019-02-10')
select  
    mid_id,
    concat_ws('|', collect_set(user_id)) user_id,
    concat_ws('|', collect_set(version_code)) version_code,
    concat_ws('|', collect_set(version_name)) version_name,
    concat_ws('|', collect_set(lang))lang,
    concat_ws('|', collect_set(source)) source,
    concat_ws('|', collect_set(os)) os,
    concat_ws('|', collect_set(area)) area, 
    concat_ws('|', collect_set(model)) model,
    concat_ws('|', collect_set(brand)) brand,
    concat_ws('|', collect_set(sdk_version)) sdk_version,
    concat_ws('|', collect_set(gmail)) gmail,
    concat_ws('|', collect_set(height_width)) height_width,
    concat_ws('|', collect_set(app_time)) app_time,
    concat_ws('|', collect_set(network)) network,
    concat_ws('|', collect_set(lng)) lng,
    concat_ws('|', collect_set(lat)) lat
from dwd_start_log
where dt='2019-02-10'  
group by mid_id;

Query and import results;

hive (gmall)> select * from dws_uv_detail_day limit 1;

###The last count(*) is the number of active devices per day;
hive (gmall)> select count(*) from dws_uv_detail_day;

2 weekly (dws_uv_detail_wk) active equipment detail partition(wk_dt)

Monday to Sunday concat (date_add (next_day ('2019-02-10 ',' mo '), - 7),' _ ', date_ Add (next_day ('2019-02-10 ',' mo '), - 1)) is 2019-02-04_ 2019-02-10

Create partition table: partitioned by ('wk_dt 'string)

hive (gmall)>
drop table if exists dws_uv_detail_wk;

create table dws_uv_detail_wk( 
    `mid_id` string COMMENT 'Unique identification of equipment',
    `user_id` string COMMENT 'User ID', 
    `version_code` string COMMENT 'Program version number', 
    `version_name` string COMMENT 'Program version name', 
`lang` string COMMENT 'System language', 
`source` string COMMENT 'Channel number', 
`os` string COMMENT 'Android version', 
`area` string COMMENT 'region', 
`model` string COMMENT 'Mobile phone model', 
`brand` string COMMENT 'Mobile phone brand', 
`sdk_version` string COMMENT 'sdkVersion', 
`gmail` string COMMENT 'gmail', 
`height_width` string COMMENT 'Screen width and height',
`app_time` string COMMENT 'Time when the client log was generated',
`network` string COMMENT 'Network mode',
`lng` string COMMENT 'longitude',
`lat` string COMMENT 'latitude',
    `monday_date` string COMMENT 'Monday date',
    `sunday_date` string COMMENT  'Sunday date' 
) COMMENT 'Details of active users by week'
PARTITIONED BY (`wk_dt` string)
stored as  parquet
location '/warehouse/gmall/dws/dws_uv_detail_wk/'
;

Import data: divided by week; Filter out the data within one month and group by device id;

Monday: date_add(next_day('2019-05-16','MO'),-7);

Sunday: date_add(next_day('2019-05-16','MO'),-1);

Monday to Sunday: concat (date_add (next_day ('may 16, 2019 ',' mo '), - 7), "_", date_ add(next_day('2019-05-16', 'MO'), -1));

insert overwrite table dws_uv_detail_wk partition(wk_dt)
select mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang)) lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area, 
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat,
date_add(next_day('2019-02-10', 'MO'), -7),
date_add(next_day('2019-02-10', 'MO'), -1),
concat(date_add(next_day('2019-02-10', 'MO'), -7), '_', date_add(next_day('2019-02-10', 'MO'), -1))
from dws_uv_detail_day
where dt >= date_add(next_day('2019-02-10', 'MO'), -7) and dt <= date_add(next_day('2019-02-10', 'MO'), -1)
group by mid_id;

Query import results

hive (gmall)> select * from dws_uv_detail_wk limit 1;
hive (gmall)> select count(*) from dws_uv_detail_wk;

3. Monthly active equipment details DWS_ uv_ detail_ Mn partition (MN) - insert daily data

DWS layer creates partition table partitioned by(mn string)

hive (gmall)>
drop table if exists dws_uv_detail_mn;

create  external table dws_uv_detail_mn( 
    `mid_id` string COMMENT 'Unique identification of equipment',
    `user_id` string COMMENT 'User ID', 
    `version_code` string COMMENT 'Program version number', 
    `version_name` string COMMENT 'Program version name', 
`lang` string COMMENT 'System language', 
`source` string COMMENT 'Channel number', 
`os` string COMMENT 'Android version', 
`area` string COMMENT 'region', 
`model` string COMMENT 'Mobile phone model', 
`brand` string COMMENT 'Mobile phone brand', 
`sdk_version` string COMMENT 'sdkVersion', 
`gmail` string COMMENT 'gmail', 
`height_width` string COMMENT 'Screen width and height',
`app_time` string COMMENT 'Time when the client log was generated',
`network` string COMMENT 'Network mode',
`lng` string COMMENT 'longitude',
`lat` string COMMENT 'latitude'
) COMMENT 'Monthly details of active users'
PARTITIONED BY (`mn` string)
stored as  parquet
location '/warehouse/gmall/dws/dws_uv_detail_mn/'
;

Data import is divided by month; Filter out the data within one month and group it according to the device id;

data_format('2019-03-10', 'yyyy-MM') ---> 2019-03

where date_format('dt', 'yyyy-MM') = date_format('2019-02-10', 'yyyy-MM') group by mid_id;

hive (gmall)>
set hive.exec.dynamic.partition.mode=nonstrict;

insert  overwrite table dws_uv_detail_mn  partition(mn)
select  
    mid_id,
    concat_ws('|', collect_set(user_id)) user_id,
    concat_ws('|', collect_set(version_code)) version_code,
    concat_ws('|', collect_set(version_name)) version_name,
    concat_ws('|', collect_set(lang)) lang,
    concat_ws('|', collect_set(source)) source,
    concat_ws('|', collect_set(os)) os,
    concat_ws('|', collect_set(area)) area, 
    concat_ws('|', collect_set(model)) model,
    concat_ws('|', collect_set(brand)) brand,
    concat_ws('|', collect_set(sdk_version)) sdk_version,
    concat_ws('|', collect_set(gmail)) gmail,
    concat_ws('|', collect_set(height_width)) height_width,
    concat_ws('|', collect_set(app_time)) app_time,
    concat_ws('|', collect_set(network)) network,
    concat_ws('|', collect_set(lng)) lng,
    concat_ws('|', collect_set(lat)) lat,
    date_format('2019-02-10','yyyy-MM')
from dws_uv_detail_day
where date_format(dt,'yyyy-MM') = date_format('2019-02-10','yyyy-MM')   
group by mid_id;

Query import results

hive (gmall)> select * from dws_uv_detail_mn limit 1;
hive (gmall)> select count(*) from dws_uv_detail_mn ;

DWS layer loading data script

Create a script in the / home/kris/bin directory of Hadoop 101

[kris@hadoop101 bin]$ vim dws.sh

#!/bin/bash

# Define variables for easy modification
APP=gmall
hive=/opt/module/hive/bin/hive

# If it is the entered date, the entered date will be obtained according to the; If no date is entered, take the day before the current time
if [ -n "$1" ] ;then
    do_date=$1
else 
    do_date=`date -d "-1 day" +%F`  
fi 


sql="
  set hive.exec.dynamic.partition.mode=nonstrict;

  insert overwrite table "$APP".dws_uv_detail_day partition(dt='$do_date')
  select  
    mid_id,
    concat_ws('|', collect_set(user_id)) user_id,
    concat_ws('|', collect_set(version_code)) version_code,
    concat_ws('|', collect_set(version_name)) version_name,
    concat_ws('|', collect_set(lang)) lang,
    concat_ws('|', collect_set(source)) source,
    concat_ws('|', collect_set(os)) os,
    concat_ws('|', collect_set(area)) area, 
    concat_ws('|', collect_set(model)) model,
    concat_ws('|', collect_set(brand)) brand,
    concat_ws('|', collect_set(sdk_version)) sdk_version,
    concat_ws('|', collect_set(gmail)) gmail,
    concat_ws('|', collect_set(height_width)) height_width,
    concat_ws('|', collect_set(app_time)) app_time,
    concat_ws('|', collect_set(network)) network,
    concat_ws('|', collect_set(lng)) lng,
    concat_ws('|', collect_set(lat)) lat
  from "$APP".dwd_start_log
  where dt='$do_date'  
  group by mid_id;


  insert  overwrite table "$APP".dws_uv_detail_wk partition(wk_dt)
  select  
    mid_id,
    concat_ws('|', collect_set(user_id)) user_id,
    concat_ws('|', collect_set(version_code)) version_code,
    concat_ws('|', collect_set(version_name)) version_name,
    concat_ws('|', collect_set(lang)) lang,
    concat_ws('|', collect_set(source)) source,
    concat_ws('|', collect_set(os)) os,
    concat_ws('|', collect_set(area)) area, 
    concat_ws('|', collect_set(model)) model,
    concat_ws('|', collect_set(brand)) brand,
    concat_ws('|', collect_set(sdk_version)) sdk_version,
    concat_ws('|', collect_set(gmail)) gmail,
    concat_ws('|', collect_set(height_width)) height_width,
    concat_ws('|', collect_set(app_time)) app_time,
    concat_ws('|', collect_set(network)) network,
    concat_ws('|', collect_set(lng)) lng,
    concat_ws('|', collect_set(lat)) lat,
    date_add(next_day('$do_date','MO'),-7),
    date_add(next_day('$do_date','SU'),-7),
    concat(date_add( next_day('$do_date','MO'),-7), '_' , date_add(next_day('$do_date','MO'),-1) 
  )
  from "$APP".dws_uv_detail_day 
  where dt>=date_add(next_day('$do_date','MO'),-7) and dt<=date_add(next_day('$do_date','MO'),-1) 
  group by mid_id; 


  insert overwrite table "$APP".dws_uv_detail_mn partition(mn)
  select  
    mid_id,
    concat_ws('|', collect_set(user_id)) user_id,
    concat_ws('|', collect_set(version_code)) version_code,
    concat_ws('|', collect_set(version_name)) version_name,
    concat_ws('|', collect_set(lang))lang,
    concat_ws('|', collect_set(source)) source,
    concat_ws('|', collect_set(os)) os,
    concat_ws('|', collect_set(area)) area, 
    concat_ws('|', collect_set(model)) model,
    concat_ws('|', collect_set(brand)) brand,
    concat_ws('|', collect_set(sdk_version)) sdk_version,
    concat_ws('|', collect_set(gmail)) gmail,
    concat_ws('|', collect_set(height_width)) height_width,
    concat_ws('|', collect_set(app_time)) app_time,
    concat_ws('|', collect_set(network)) network,
    concat_ws('|', collect_set(lng)) lng,
    concat_ws('|', collect_set(lat)) lat,
    date_format('$do_date','yyyy-MM')
  from "$APP".dws_uv_detail_day
  where date_format(dt,'yyyy-MM') = date_format('$do_date','yyyy-MM')   
  group by mid_id;
"

$hive -e "$sql"

Add script execution permission Chmod 777 DWS sh

Script use[ kris@hadoop101 module]$ dws. sh 2019-02-11

Query results

hive (gmall)> select count(*) from dws_uv_detail_day;
hive (gmall)> select count(*) from dws_uv_detail_wk;
hive (gmall)> select count(*) from dws_uv_detail_mn ;

Script execution time; In enterprise development, it is usually from 30:00 to 1:00 every morning

Target of ADS layer: the number of active devices in the current day, week and month uses day_count table join wk_count join mn_count, connect the three tables together

Create table ads_uv_count table:

Field has day_count,wk_count,mn_count is_weekend if(date_add(next_day('2019-02-10', 'MO'), -1) = '2019-02-10', 'Y', 'N') is_monthend if(last_day('2019-02-10') = '2019-02-10', 'Y', 'N')

drop table if exists ads_uv_count;
create external table ads_uv_count(
`dt` string comment 'Statistical date',
`day_count` bigint comment 'Daily user volume',
`wk_count` bigint comment 'Number of users in the current week',
`mn_count` bigint comment 'Number of users in current month',
`is_weekend` string comment 'Y,N Is it a weekend,Used to get the final result of this week',
`is_monthend` string comment 'Y,N Is it month end,Used to get the final result of this month'
) comment 'Number of active users per day'
stored as parquet
location '/warehouse/gmall/ads/ads_uv_count/';

Import data:

hive (gmall)>
insert  overwrite table ads_uv_count 
select  
  '2019-02-10' dt,
   daycount.ct,
   wkcount.ct,
   mncount.ct,
   if(date_add(next_day('2019-02-10','MO'),-1)='2019-02-10','Y','N') ,
   if(last_day('2019-02-10')='2019-02-10','Y','N') 
from 
(
   select  
      '2019-02-10' dt,
       count(*) ct
   from dws_uv_detail_day
   where dt='2019-02-10'  
)daycount   join 
( 
   select  
     '2019-02-10' dt,
     count (*) ct
   from dws_uv_detail_wk
   where wk_dt=concat(date_add(next_day('2019-02-10','MO'),-7),'_' ,date_add(next_day('2019-02-10','MO'),-1) )
)  wkcount  on daycount.dt=wkcount.dt
join 
( 
   select  
     '2019-02-10' dt,
     count (*) ct
   from dws_uv_detail_mn
   where mn=date_format('2019-02-10','yyyy-MM')  
)mncount on daycount.dt=mncount.dt
;

Query import results

hive (gmall)> select * from ads_uv_count ;

ADS layer loading data script

1) Create a script in the / home/kris/bin directory of Hadoop 101

[kris@hadoop101 bin]$ vim ads.sh

#!/bin/bash

# Define variables for easy modification
APP=gmall
hive=/opt/module/hive/bin/hive

# If it is the entered date, the entered date will be obtained according to the; If no date is entered, take the day before the current time
if [ -n "$1" ] ;then
    do_date=$1
else 
    do_date=`date -d "-1 day" +%F`  
fi 

sql="
  set hive.exec.dynamic.partition.mode=nonstrict;

insert into table "$APP".ads_uv_count 
select  
  '$do_date' dt,
   daycount.ct,
   wkcount.ct,
   mncount.ct,
   if(date_add(next_day('$do_date','MO'),-1)='$do_date','Y','N') ,
   if(last_day('$do_date')='$do_date','Y','N') 
from 
(
   select  
      '$do_date' dt,
       count(*) ct
   from "$APP".dws_uv_detail_day
   where dt='$do_date'  
)daycount   join 
( 
   select  
     '$do_date' dt,
     count (*) ct
   from "$APP".dws_uv_detail_wk
   where wk_dt=concat(date_add(next_day('$do_date','MO'),-7),'_' ,date_add(next_day('$do_date','MO'),-1) )
)  wkcount  on daycount.dt=wkcount.dt
join 
( 
   select  
     '$do_date' dt,
     count (*) ct
   from "$APP".dws_uv_detail_mn
   where mn=date_format('$do_date','yyyy-MM')  
)mncount on daycount.dt=mncount.dt;
"

$hive -e "$sql"

Add script execution permission chmod 777 ads.sh

The script uses ads.sh 2019-02-11

Query import result hive (GMALL) > select * from ads_ uv_ count ;

Requirement 2: user added theme

Users who use the application online for the first time. If a user opens an APP for the first time, the user is defined as a new user; Uninstalled and reinstalled devices will not be counted as a new addition. New users include daily new users, weekly new users and monthly new users.

Daily new (not including old users, who have not logged in before, but today is the first time to log in) devices -- there is no partition -- > there is no him in the previous new library, but he is active today, that is, new users;

1 DWS layer (daily newly added equipment list) create daily newly added equipment list: dws_new_mid_day

hive (gmall)>
drop table if exists  dws_new_mid_day;
create  table  dws_new_mid_day
(
    `mid_id` string COMMENT 'Unique identification of equipment',
    `user_id` string COMMENT 'User ID', 
    `version_code` string COMMENT 'Program version number', 
    `version_name` string COMMENT 'Program version name', 
`lang` string COMMENT 'System language', 
`source` string COMMENT 'Channel number', 
`os` string COMMENT 'Android version', 
`area` string COMMENT 'region', 
`model` string COMMENT 'Mobile phone model', 
`brand` string COMMENT 'Mobile phone brand', 
`sdk_version` string COMMENT 'sdkVersion', 
`gmail` string COMMENT 'gmail', 
`height_width` string COMMENT 'Screen width and height',
`app_time` string COMMENT 'Time when the client log was generated',
`network` string COMMENT 'Network mode',
`lng` string COMMENT 'longitude',
`lat` string COMMENT 'latitude',
    `create_date`  string  comment 'Creation time' 
)  COMMENT 'Daily new equipment information'
stored as  parquet
location '/warehouse/gmall/dws/dws_new_mid_day/';

dws_ uv_ detail_ Day (daily active equipment details) left join DWS_ new_ mid_ Day nm (previous new user table, new field create_time2019-02-10) nm mid_ id is null;

Import data

Use the daily active user table left join to add a daily device table. The associated condition is mid_ The IDs are equal. If it is a daily added device, it is null in the daily added device table.

from dws_uv_detail_day ud left join dws_new_mid_day nm on ud.mid_id=nm.mid_id

where ud.dt='2019-02-10' and nm.mid_id is null;

hive (gmall)>
insert into table dws_new_mid_day  
select  
    ud.mid_id,
    ud.user_id , 
    ud.version_code , 
    ud.version_name , 
    ud.lang , 
    ud.source, 
    ud.os, 
    ud.area, 
    ud.model, 
    ud.brand, 
    ud.sdk_version, 
    ud.gmail, 
    ud.height_width,
    ud.app_time,
    ud.network,
    ud.lng,
    ud.lat,
    '2019-02-10'
from dws_uv_detail_day ud left join dws_new_mid_day nm on ud.mid_id=nm.mid_id
where ud.dt='2019-02-10' and nm.mid_id is null;

Query import data

hive (gmall)> select count(*) from dws_new_mid_day ;

2. Ads layer (daily newly added equipment table) creates daily newly added equipment table ads_new_mid_count

hive (gmall)>
drop table if exists  `ads_new_mid_count`;
create  table  `ads_new_mid_count`
(
    `create_date`     string  comment 'Creation time' ,
    `new_mid_count`   BIGINT comment 'Number of new equipment' 
)  COMMENT 'Number of new equipment information per day'
row format delimited  fields terminated by '\t' 
location '/warehouse/gmall/ads/ads_new_mid_count/';

Import data count (*) DWS_ new_ mid_ Just use the day table

Add create_date must be group by create_time, otherwise an error is reported: not in GROUP BY key 'create_date'

hive (gmall)>
insert into table ads_new_mid_count 
select create_date , count(*)  from dws_new_mid_day
where create_date='2019-02-10'
group by create_date ;

Query import data

hive (gmall)> select * from ads_new_mid_count;

Extended monthly additions:

--New monthly
drop table if exists dws_new_mid_mn;
create table dws_new_mid_mn(
    `mid_id` string COMMENT 'Unique identification of equipment',
    `user_id` string COMMENT 'User ID', 
    `version_code` string COMMENT 'Program version number', 
    `version_name` string COMMENT 'Program version name', 
    `lang` string COMMENT 'System language', 
    `source` string COMMENT 'Channel number', 
    `os` string COMMENT 'Android version', 
    `area` string COMMENT 'region', 
    `model` string COMMENT 'Mobile phone model', 
    `brand` string COMMENT 'Mobile phone brand', 
    `sdk_version` string COMMENT 'sdkVersion', 
    `gmail` string COMMENT 'gmail', 
    `height_width` string COMMENT 'Screen width and height',
    `app_time` string COMMENT 'Time when the client log was generated',
    `network` string COMMENT 'Network mode',
    `lng` string COMMENT 'longitude',
    `lat` string COMMENT 'latitude'
)comment "Monthly new details"
partitioned by(mn string)
stored as parquet
location "/warehouse/gmall/dws/dws_new_mid_mn";

insert overwrite table dws_new_mid_mn partition(mn)
select
    um.mid_id,
    um.user_id , 
    um.version_code , 
    um.version_name , 
    um.lang , 
    um.source, 
    um.os, 
    um.area, 
    um.model, 
    um.brand, 
    um.sdk_version, 
    um.gmail, 
    um.height_width,
    um.app_time,
    um.network,
    um.lng,
    um.lat,
    date_format('2019-02-10', 'yyyy-MM')
from dws_uv_detail_mn um left join dws_new_mid_mn nm on um.mid_id = nm.mid_id
where um.mn =date_format('2019-02-10', 'yyyy-MM') and nm.mid_id = null; ----Why is it empty?? I can't find the data
--##Note that date cannot be written here_ format(um.mn, 'yyyy-MM') =date_ format('2019-02-10', 'yyyy-MM') 
    |

Update to requirement 2 first, and I will continue to update the subsequent requirements...... Coming soon!!!!!

Added by hillbilly928 on Wed, 09 Mar 2022 04:55:41 +0200