Development of Big Data Module--Statistical Analysis

After the data warehouse is built, users can write Hive SQL statements to access them and analyze the data.

In actual production, which statistical indicators are needed are usually proposed by the relevant departments of data demand, and new statistical requirements will continue to emerge. The following are some typical indicators in website traffic analysis.

Note: Each statistical index can be drilled with the dimension table.

1. Traffic analysis 1.1. Multidimensional statistics PV total amount by time dimension

- Compute hourly pvs, pay attention to gruop by grammar

select count(*) as pvs,month,day,hour from ods_weblog_detail group by month,day,hour;

Way 1: Query directly on the ods_weblog_detail form

--Calculate the hours in the batch (day) pvs

drop table dw_pvs_everyhour_oneday;

create table dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) partitioned by(datestr string);

insert into table dw_pvs_everyhour_oneday partition(datestr='20130918')

select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from ods_weblog_detail a

where  a.datestr='20130918' group by a.month,a.day,a.hour;

--Calculate daily pvs

drop table dw_pvs_everyday;

create table dw_pvs_everyday(pvs bigint,month string,day string);

insert into table dw_pvs_everyday

select count(*) as pvs,a.month as month,a.day as day from ods_weblog_detail a

group by a.month,a.day;

Mode 2: Association query with time dimension table

--Dimension: Day

drop table dw_pvs_everyday;

create table dw_pvs_everyday(pvs bigint,month string,day string);

insert into table dw_pvs_everyday

select count(*) as pvs,a.month as month,a.day as day from (select distinct month, day from t_dim_time) a

join ods_weblog_detail b

on a.month=b.month and a.day=b.day

group by a.month,a.day;

--Dimension: Month

drop table dw_pvs_everymonth;

create table dw_pvs_everymonth (pvs bigint,month string);

insert into table dw_pvs_everymonth

select count(*) as pvs,a.month from (select distinct month from t_dim_time)  a

join ods_weblog_detail b on a.month=b.month group by a.month;

--In addition, the previous calculation results can also be used directly. For example, count every day from the previous hourly results

Insert into table dw_pvs_everyday

Select sum(pvs) as pvs,month,day from dw_pvs_everyhour_oneday group by month,day having day='18';
 —

By terminal dimension

The field in the data that can reflect user terminal information is http_user_agent.

User Agent is also referred to as UA. It is a special string header. It is an identification that provides information about the type and version of browser, operating system and version, browser kernel, etc. used to visit a website. For example:

User-Agent,Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.276 Safari/537.36

The above UA information can extract the following information:

chrome 58.0, browser chrome, browser version 58.0, system platform windows

Browser Kernel webkit

Instead of expanding your knowledge here, you can see how UA is parsed in Resources.

The following statements can be used for tentative statistics, of course, such accuracy is not very high.

select distinct(http_user_agent) from ods_weblog_detail where http_user_agent like '%Chrome%' limit 200;

By Column Dimension

Website columns can be understood as a collection of topics related to the content of the website. In terms of domain names, different columns will have different secondary directories. For example, a website's website is www.xxxx.cn, and its columns can be accessed as follows:

Column dimension:. / job

Column dimension:. / news

Column dimension:. / sports

Column dimension:. / technology

Then the access column can be parsed according to the user request url, and then statistical analysis can be carried out according to the column.

By referer dimension
--------

--Hourly visits url Produced pv amount

drop table dw_pvs_referer_everyhour;

create table dw_pvs_referer_everyhour(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string);

insert into table dw_pvs_referer_everyhour partition(datestr='20130918')

select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt

from ods_weblog_detail

group by http_referer,ref_host,month,day,hour

having ref_host is not null

order by hour asc,day asc,month asc,pv_referer_cnt desc;
 -------- 
--Hourly visits host Generative pv Number and Sort

drop table dw_pvs_refererhost_everyhour;

create table dw_pvs_refererhost_everyhour(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);

insert into table dw_pvs_refererhost_everyhour partition(datestr='20130918')

select ref_host,month,day,hour,count(1) as ref_host_cnts

from ods_weblog_detail

group by ref_host,month,day,hour

having ref_host is not null

order by hour asc,day asc,month asc,ref_host_cnts desc;
 -------- 

Note: It can also be calculated according to the source area dimension and the visitor terminal dimension.
1.2. Per capita browsing

Requirement Description: Statistics the average number of pages requested by all visitors today.

Per capita browsing volume is also called per capita page number, which can explain the stickiness of the website to users.

Per capita page views indicate the average number of times a user visits a page at a given time.

Calculating method: total page requests / total number of duplicated pages

Remote_addr represents different users. We can first count the pv of different remote_addr, then sum all pv as the total number of page requests, and count all remote_addr as the total number of weightless.

--Total number of page requests/Total weight loss

drop table dw_avgpv_user_everyday;

create table dw_avgpv_user_everyday(

day string,

avgpv string);

insert into table dw_avgpv_user_everyday

select '20130918',sum(b.pvs)/count(b.remote_addr) from

(select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='20130918' group by remote_addr) b;

1.3. TOPN (grouped TOP) is the largest source of total pv.

Requirement Description: The top N (topN) with the largest number of pvs generated by visiting host s per hour is counted.

row_number() function

Ba grammar: row_number() over (partition by XXX order by xxxx) rank, rank is the alias of the group, equivalent to adding a field named rank.

partition by is used for grouping, for example, according to the sex field

order by is used for sorting within groups, for example, by sex and by age within groups.

After sorting, a number is returned for each grouping record in each grouping from the beginning.

If you click a data in a group, you can use a grammar like where table name. rank > x to fetch it.

The following statement ranks the number of visits per hour in reverse order:

select ref_host,ref_host_cnts,concat(month,day,hour),

row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od from dw_pvs_refererhost_everyhour;

The results are as follows:

According to the function of row_number mentioned above, we can write hql to get the number of ref_host visits per hour topn.

drop table dw_pvs_refhost_topn_everyhour;

create table dw_pvs_refhost_topn_everyhour(

hour string,

toporder string,

ref_host string,

ref_host_cnts string

)partitioned by(datestr string);

insert into table dw_pvs_refhost_topn_everyhour partition(datestr='20130918')

select t.hour,t.od,t.ref_host,t.ref_host_cnts from

(select ref_host,ref_host_cnts,concat(month,day,hour) as hour,

row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od

from dw_pvs_refererhost_everyhour) t where od<=3;

The results are as follows:

2. Interview analysis (from page perspective) 2.1. Visit statistics of each page

Mainly for data request s for statistical analysis, such as each page PV, each page UV and so on.

All of the above indicators are based on the field group by the page. For example:

--Statistical pages pv

select request as request,count(request) as request_counts from

ods_weblog_detail group by request having request is not null order by request_counts desc limit 20;
 -------- 

2.2. Hot Page Statistics

--Statistics of the hottest pages of the day top10

drop table dw_hotpages_everyday;

create table dw_hotpages_everyday(day string,url string,pvs string);

insert into table dw_hotpages_everyday

select '20130918',a.request,a.request_counts from

(select request as request,count(request) as request_counts from ods_weblog_detail where datestr='20130918' group by request having request is not null) a

order by a.request_counts desc limit 10;
 -------- 

3. Visitor Analysis 3.1. Independent Visitors

Requirement Description: Independent visitors and their generated pv are counted in terms of time dimension, such as hours.

For the identification of independent visitors, if there is a user ID in the original log, it is very good to implement according to the user ID; here, because there is no user ID in the original log, the simulation with the visitor IP is the same technically, but the accuracy is relatively low.

--Time Dimension: Time

drop table dw_user_dstc_ip_h;

create table dw_user_dstc_ip_h(

remote_addr string,

pvs      bigint,

hour     string);

insert into table dw_user_dstc_ip_h

select remote_addr,count(1) as pvs,concat(month,day,hour) as hour

from ods_weblog_detail

Where datestr='20130918'

group by concat(month,day,hour),remote_addr;

//Based on the results table, further statistics can be made, such as the total number of independent visitors per hour:

select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour;
--Time Dimension: Day

select remote_addr,count(1) as counts,concat(month,day) as day

from ods_weblog_detail

Where datestr='20130918'

group by concat(month,day),remote_addr;
--Time Dimension: Month

select remote_addr,count(1) as counts,month

from ods_weblog_detail

group by month,remote_addr;

3.2. Daily New Visitors

Demand: Statistics new visitors every day.

Realization idea: Create a cumulative table to re-visit visitors, and then compare the cumulative table with the daily visitors.

--Calendar Visitor Accumulation Form

drop table dw_user_dsct_history;

create table dw_user_dsct_history(

day string,

ip string

)

partitioned by(datestr string);

--Daily New Visitor List

drop table dw_user_new_d;

create table dw_user_new_d (

day string,

ip string

)

partitioned by(datestr string);

--Daily New Users Insert New Visitor List

insert into table dw_user_new_d partition(datestr='20130918')

select tmp.day as day,tmp.today_addr as new_ip from

(

select today.day as day,today.remote_addr as today_addr,old.ip as old_addr

from

(select distinct remote_addr as remote_addr,"20130918" as day from ods_weblog_detail where datestr="20130918") today

left outer join

dw_user_dsct_history old

on today.remote_addr=old.ip

) tmp

where tmp.old_addr is null;

--Daily New Users Added to Cumulative Table

insert into table dw_user_dsct_history partition(datestr='20130918')

select day,ip from dw_user_new_d where datestr='20130918';

Verify View:

select count(distinct remote_addr) from ods_weblog_detail;

select count(1) from dw_user_dsct_history where datestr='20130918';

select count(1) from dw_user_new_d where datestr='20130918';

Note: It can also be calculated according to the source area dimension and the visitor terminal dimension.

Visit Visit Analysis (Click Stream Model) 4.1. Back-to-Back/Single Visitor Statistics

Requirement: Query all returning visitors and their visits today.

Ideas for realization: Visitors who appear more than 1 times in the table above are returning visitors; on the contrary, they are single visitors.

drop table dw_user_returning;

create table dw_user_returning(

day string,

remote_addr string,

acc_cnt string)

partitioned by (datestr string);

insert overwrite table dw_user_returning partition(datestr='20130918')

select tmp.day,tmp.remote_addr,tmp.acc_cnt

from

(select '20130918' as day,remote_addr,count(session) as acc_cnt from ods_click_stream_visit group by remote_addr) tmp

where tmp.acc_cnt>1;

4.2. Per capita visits

Requirements: Calculate the average number of visit s per day for all users.

Total number of visit s / total number of users to de-duplicate

select sum(pagevisits)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20130918';

5. Critical Path Conversion Rate Analysis (Funnel Model) 5.1. Demand Analysis

Conversion: In a specified business process, the number of completed steps and the percentage relative to the previous step.

5.2. Model design

Define the page identifier in the business process. The steps in the following example are as follows:

Step1, /item

Step2, /category

Step3, /index

Step4, /order

5.3. Development and Implementation

Query the tota l number of visitors per step

--Query the number of people in each step dw_oute_numbs

create table dw_oute_numbs as

select 'step1' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/item%'

union

select 'step2' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/category%'

union

select 'step3' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/order%'

union

select 'step4' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/index%';

Note: UNION sets the results of multiple SELECT statements into a separate result set.

l Query the proportion of the number of people at each step relative to the starting point of the path

Idea: Cascading queries, using self-join

-dw_oute_numbs Follow Yourself join

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn

inner join

dw_oute_numbs rr;
--Number of people per step/Number of first steps==Proportion of the number of people at each step relative to the starting point

select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio

from

(

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn

inner join

dw_oute_numbs rr) tmp

where tmp.rrstep='step1';

l Query the leak rate of each step relative to the previous step

--since join The table filters out the records that follow each step.

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn

inner join

dw_oute_numbs rr

where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1;
 -------- 
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as leakage_rate

from

(

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn

inner join

dw_oute_numbs rr) tmp

where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1;
 -------- 

l Summarize the above two indicators

select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate

from

(

select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate

from

(

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn

inner join

dw_oute_numbs rr) tmp

where tmp.rrstep='step1'

) abs

left outer join

(

select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate

from

(

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn

inner join

dw_oute_numbs rr) tmp

where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1

) rel

on abs.step=rel.step;

Keywords: Big Data Windows hive SQL Session

Added by davey10101 on Fri, 23 Aug 2019 06:58:37 +0300