Statistical topics on ant forest plant application (Hive example)

Application Statistics of ant forest plants

Create two tables
user_low_carbon: it records the user's daily low-carbon life in ant forest
plant_carbon: ant forest plant exchange form, which is used to record the carbon emission reduction required to apply for environmental protection plants

Table structure

Table 1
table_name: user_low_carbon
user_id data_dt low_carbon
Carbon emission received on user's date (g)
Table II
table_name: plant_carbon
plant_id plant_name low_carbon
Plant number plant name carbon needed to buy plants

Table 1

create table if not exists user_low_carbon(user_id string,date_dt string,low_carbon int)
row format delimited fields terminated by ' '

Table II

create table if not exists user_low_carbon(user_id string,date_dt string,low_carbon int)
row format delimited fields terminated by ' '

Import data
//user_low_carbon.txt

u_001 2017/1/1 10
u_001 2017/1/2 150
u_001 2017/1/2 110
u_001 2017/1/2 10
u_001 2017/1/4 50
u_001 2017/1/4 10
u_001 2017/1/6 45
u_001 2017/1/6 90
u_002 2017/1/1 10
u_002 2017/1/2 150
u_002 2017/1/2 70
u_002 2017/1/3 30
u_002 2017/1/3 80
u_002 2017/1/4 150
u_002 2017/1/5 101
u_002 2017/1/6 68
......
load data local inpath '/home/dt/user_low_carbon.txt' into table user_low_carbon

//plant_carbon.txt

p001 Haloxylon ammodendron 17
p002 Salix psammophila 19
p003 Camphor tree 146
p004 Populus euphratica 215
LOAD DATA LOCAL INPATH '/home/dt/plant_carbon.txt' INTO TABLE plant_carbon

Topic 1

Question: suppose that the user_low_carbon data is recorded on January 1, 2017, and that all users who meet the application conditions have applied for a p004 Populus euphratica before October 1, 2017,
All the remaining energy is used to receive "p002- Salix psammophila".
Statistics of the top 10 users who applied for "p002 Salix" on October 1; And he received several more Salix than the latter.
The statistical results obtained are as follows:

Result example

user_id  plant_count less_count(He received several more Salix than the latter)
u_101    1000         100
u_088    900          400
u_103    500          ...

sql statement

from(
select t3.user_id user_id,floor(t3.residue/t4.low_carbon) as shaliunum
from (
select t1.user_id,t1.sumres-t2.low_carbon as residue
from (SELECT a.user_id,sum(a.low_carbon) as sumres
-- OVER(PARTITION BY a.user_id) 
from user_low_carbon a
group by a.user_id) as t1,(select low_carbon
from plant_carbon
where plant_id="p004") as t2) as t3,(select low_carbon
from plant_carbon
where plant_id="p002") as t4
order by shaliunum desc) as tt
select tt.user_id,tt.shaliunum,tt.shaliunum-lead(shaliunum,1,0) over() diff,row_number() over() rownum
) 

result

Topic 2

Ranking analysis of low carbon users in ant forest
Question: query user_ low_ Daily running records in carbon table, with the following conditions:
In 2017, in three consecutive days (or more),
Users who reduce carbon emissions (low_carbon) by more than 100g per day have low-carbon water.
You need to query and return the user that meets the above conditions_ low_ Record flow in carbon table.
For example, user u_002 the qualified records are as follows, because the sum of carbon emissions for four consecutive days from January 2, 2017 to January 5, 2017 is greater than or equal to 100g:

Result example

seq(key) user_id data_dt  low_carbon
xxxxx10    u_002  2017/1/2  150
xxxxx11    u_002  2017/1/2  70
xxxxx12    u_002  2017/1/3  30
xxxxx13    u_002  2017/1/3  80
xxxxx14    u_002  2017/1/4  150
xxxxx14    u_002  2017/1/5  101

The first step is to find the number of days in three consecutive days (or more),
Users who reduce carbon emissions by more than 100g per day.
sql statement

from(--Find the total amount of carbon in the first two days of each day
  from(
    -- duplicate removal
    from(
    -- Total carbon per user per day
    select user_id,date_dt,sum(low_carbon) over(partition by date_dt) carbon_byday
    from user_low_carbon
    )as t
    select user_id,date_dt,collect_set(carbon_byday)\[0\] carbon_byday
    group by user_id,date_dt
  ) as t1
  select t1.user_id,t1.date_dt,t1.carbon_byday,lag(carbon_byday,1,0) over(partition by user_id) as carbon_lastday,
  lag(carbon_byday,2,0) over(partition by user_id) as carbon_last2day,
  lag(date_dt,1,0) over(partition by user_id) as lastday,
  lag(date_dt,2,0) over(partition by user_id) as last2day
)as t2
select * 
--More than 100 in three days
where t2.carbon_lastday>=100 and t2.carbon_byday>=100 and t2.carbon_last2day>=100
--See if the three days are continuous
and datediff(t2.date_dt,t2.lastday)=1 and datediff(t2.date_dt,t2.last2day)=2

Operation results

The second step is to connect with the original table to calculate the flow records of users who have exceeded 100g for three consecutive days
sql statement

from(from(--Find the total amount of carbon in the first two days of each day
  from(
    -- duplicate removal
    from(
    -- Total carbon per user per day
    select user_id,date_dt,sum(low_carbon) over(partition by date_dt) carbon_byday
    from user_low_carbon
    )as t
    select user_id,date_dt,collect_set(carbon_byday)[0] carbon_byday
    group by user_id,date_dt
  ) as t1
  select t1.user_id,t1.date_dt,t1.carbon_byday,lag(carbon_byday,1,0) over(partition by user_id) as carbon_lastday,
  lag(carbon_byday,2,0) over(partition by user_id) as carbon_last2day
  ,lag(date_dt,1,0) over(partition by user_id) as l1,
  lag(date_dt,2,0) over(partition by user_id) as l2
)as t2
select distinct(t2.user_id)
where t2.carbon_lastday>=100 and t2.carbon_byday>=100 and t2.carbon_last2day>=100
and (substr(t2.l1,8)+1)=substr(t2.date_dt,8) and (substr(t2.l2,8)+2)=substr(t2.date_dt,8)
) as t3,user_low_carbon
select user_low_carbon.*
where t3.user_id=user_low_carbon.user_id

Operation results

Topic 3
Find the flow records of stores with sales records for three consecutive days
raw data

A,2017-10-11,300
A,2017-10-12,200
A,2017-10-13,100
A,2017-10-15,100
A,2017-10-16,300
A,2017-10-17,150
A,2017-10-18,340
A,2017-10-19,360
B,2017-10-11,400
B,2017-10-12,200
B,2017-10-15,600
C,2017-10-11,350
C,2017-10-13,250
C,2017-10-14,300
C,2017-10-15,400
C,2017-10-16,200
D,2017-10-13,500
E,2017-10-14,600
E,2017-10-15,500
D,2017-10-14,600


Problem solution

CREATE TABLE IF NOT EXISTS xs(dp STRING,dt STRING,xl int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOAD DATA LOCAL INPATH '/home/dt/xs.txt' INTO TABLE xs

The first step is to find out the stores with sales records for three consecutive days
sql statement

from(
  from(
    from(
    select dp,dt,sum(xl) over(partition by dt) xl_byday
    from xs
    )as t
    select dp,dt,collect_set(xl_byday)[0] xl_byday
    group by dp,dt
    ) as t1
  select t1.dp,t1.dt,t1.xl_byday,lag(xl_byday,1,0) over(partition by dp) as xl_lastday,
  lag(xl_byday,2,0) over(partition by dp) as xl_last2day,
  lag(dt,1,0) over(partition by dp) as ldt,
  lag(dt,2,0) over(partition by dp) as last2day
  )as t2
select * 
--There are records in three days
where t2.xl_lastday>0 and t2.xl_byday>0 and t2.xl_last2day>0
--See if the three days are continuous
and datediff(t2.dt,t2.ldt)=1 and datediff(t2.dt,t2.last2day)=2

Operation results

The second step is to connect with the original table to find the flow records of stores with sales records for three consecutive days
sql statement

from(from(
  from(
    from(
    -- Total carbon per user per day
    select dp,dt,sum(xl) over(partition by dt) xl_byday
    from xs
    )as t
    select dp,dt,collect_set(xl_byday)[0] xl_byday
    group by dp,dt
    ) as t1
  select t1.dp,t1.dt,t1.xl_byday,lag(xl_byday,1,0) over(partition by dp) as xl_lastday,
  lag(xl_byday,2,0) over(partition by dp) as xl_last2day,
  lag(dt,1,0) over(partition by dp) as ldt,
  lag(dt,2,0) over(partition by dp) as last2day
  )as t2
select distinct(t2.dp)
where t2.xl_lastday>0 and t2.xl_byday>0 and t2.xl_last2day>0 and datediff(t2.dt,t2.ldt)=1 
and datediff(t2.dt,t2.last2day)=2) as t3,xs
select xs.*
where t3.dp=xs.dp

Operation results

Keywords: hive SQL

Added by TheSeeker on Sun, 30 Jan 2022 14:59:02 +0200