Application Statistics of ant forest plants
Create two tables
user_low_carbon: it records the user's daily low-carbon life in ant forest
plant_carbon: ant forest plant exchange form, which is used to record the carbon emission reduction required to apply for environmental protection plants
Table structure
Table 1
table_name: user_low_carbon
user_id data_dt low_carbon
Carbon emission received on user's date (g)
Table II
table_name: plant_carbon
plant_id plant_name low_carbon
Plant number plant name carbon needed to buy plants
Table 1
create table if not exists user_low_carbon(user_id string,date_dt string,low_carbon int) row format delimited fields terminated by ' '
Table II
create table if not exists user_low_carbon(user_id string,date_dt string,low_carbon int) row format delimited fields terminated by ' '
Import data
//user_low_carbon.txt
u_001 2017/1/1 10 u_001 2017/1/2 150 u_001 2017/1/2 110 u_001 2017/1/2 10 u_001 2017/1/4 50 u_001 2017/1/4 10 u_001 2017/1/6 45 u_001 2017/1/6 90 u_002 2017/1/1 10 u_002 2017/1/2 150 u_002 2017/1/2 70 u_002 2017/1/3 30 u_002 2017/1/3 80 u_002 2017/1/4 150 u_002 2017/1/5 101 u_002 2017/1/6 68 ......
load data local inpath '/home/dt/user_low_carbon.txt' into table user_low_carbon
//plant_carbon.txt
p001 Haloxylon ammodendron 17 p002 Salix psammophila 19 p003 Camphor tree 146 p004 Populus euphratica 215
LOAD DATA LOCAL INPATH '/home/dt/plant_carbon.txt' INTO TABLE plant_carbon
Topic 1
Question: suppose that the user_low_carbon data is recorded on January 1, 2017, and that all users who meet the application conditions have applied for a p004 Populus euphratica before October 1, 2017,
All the remaining energy is used to receive "p002- Salix psammophila".
Statistics of the top 10 users who applied for "p002 Salix" on October 1; And he received several more Salix than the latter.
The statistical results obtained are as follows:
Result example
user_id plant_count less_count(He received several more Salix than the latter) u_101 1000 100 u_088 900 400 u_103 500 ...
sql statement
from( select t3.user_id user_id,floor(t3.residue/t4.low_carbon) as shaliunum from ( select t1.user_id,t1.sumres-t2.low_carbon as residue from (SELECT a.user_id,sum(a.low_carbon) as sumres -- OVER(PARTITION BY a.user_id) from user_low_carbon a group by a.user_id) as t1,(select low_carbon from plant_carbon where plant_id="p004") as t2) as t3,(select low_carbon from plant_carbon where plant_id="p002") as t4 order by shaliunum desc) as tt select tt.user_id,tt.shaliunum,tt.shaliunum-lead(shaliunum,1,0) over() diff,row_number() over() rownum )
result
Topic 2
Ranking analysis of low carbon users in ant forest
Question: query user_ low_ Daily running records in carbon table, with the following conditions:
In 2017, in three consecutive days (or more),
Users who reduce carbon emissions (low_carbon) by more than 100g per day have low-carbon water.
You need to query and return the user that meets the above conditions_ low_ Record flow in carbon table.
For example, user u_002 the qualified records are as follows, because the sum of carbon emissions for four consecutive days from January 2, 2017 to January 5, 2017 is greater than or equal to 100g:
Result example
seq(key) user_id data_dt low_carbon xxxxx10 u_002 2017/1/2 150 xxxxx11 u_002 2017/1/2 70 xxxxx12 u_002 2017/1/3 30 xxxxx13 u_002 2017/1/3 80 xxxxx14 u_002 2017/1/4 150 xxxxx14 u_002 2017/1/5 101
The first step is to find the number of days in three consecutive days (or more),
Users who reduce carbon emissions by more than 100g per day.
sql statement
from(--Find the total amount of carbon in the first two days of each day from( -- duplicate removal from( -- Total carbon per user per day select user_id,date_dt,sum(low_carbon) over(partition by date_dt) carbon_byday from user_low_carbon )as t select user_id,date_dt,collect_set(carbon_byday)\[0\] carbon_byday group by user_id,date_dt ) as t1 select t1.user_id,t1.date_dt,t1.carbon_byday,lag(carbon_byday,1,0) over(partition by user_id) as carbon_lastday, lag(carbon_byday,2,0) over(partition by user_id) as carbon_last2day, lag(date_dt,1,0) over(partition by user_id) as lastday, lag(date_dt,2,0) over(partition by user_id) as last2day )as t2 select * --More than 100 in three days where t2.carbon_lastday>=100 and t2.carbon_byday>=100 and t2.carbon_last2day>=100 --See if the three days are continuous and datediff(t2.date_dt,t2.lastday)=1 and datediff(t2.date_dt,t2.last2day)=2
Operation results
The second step is to connect with the original table to calculate the flow records of users who have exceeded 100g for three consecutive days
sql statement
from(from(--Find the total amount of carbon in the first two days of each day from( -- duplicate removal from( -- Total carbon per user per day select user_id,date_dt,sum(low_carbon) over(partition by date_dt) carbon_byday from user_low_carbon )as t select user_id,date_dt,collect_set(carbon_byday)[0] carbon_byday group by user_id,date_dt ) as t1 select t1.user_id,t1.date_dt,t1.carbon_byday,lag(carbon_byday,1,0) over(partition by user_id) as carbon_lastday, lag(carbon_byday,2,0) over(partition by user_id) as carbon_last2day ,lag(date_dt,1,0) over(partition by user_id) as l1, lag(date_dt,2,0) over(partition by user_id) as l2 )as t2 select distinct(t2.user_id) where t2.carbon_lastday>=100 and t2.carbon_byday>=100 and t2.carbon_last2day>=100 and (substr(t2.l1,8)+1)=substr(t2.date_dt,8) and (substr(t2.l2,8)+2)=substr(t2.date_dt,8) ) as t3,user_low_carbon select user_low_carbon.* where t3.user_id=user_low_carbon.user_id
Operation results
Topic 3
Find the flow records of stores with sales records for three consecutive days
raw data
A,2017-10-11,300 A,2017-10-12,200 A,2017-10-13,100 A,2017-10-15,100 A,2017-10-16,300 A,2017-10-17,150 A,2017-10-18,340 A,2017-10-19,360 B,2017-10-11,400 B,2017-10-12,200 B,2017-10-15,600 C,2017-10-11,350 C,2017-10-13,250 C,2017-10-14,300 C,2017-10-15,400 C,2017-10-16,200 D,2017-10-13,500 E,2017-10-14,600 E,2017-10-15,500 D,2017-10-14,600
Problem solution
CREATE TABLE IF NOT EXISTS xs(dp STRING,dt STRING,xl int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOAD DATA LOCAL INPATH '/home/dt/xs.txt' INTO TABLE xs
The first step is to find out the stores with sales records for three consecutive days
sql statement
from( from( from( select dp,dt,sum(xl) over(partition by dt) xl_byday from xs )as t select dp,dt,collect_set(xl_byday)[0] xl_byday group by dp,dt ) as t1 select t1.dp,t1.dt,t1.xl_byday,lag(xl_byday,1,0) over(partition by dp) as xl_lastday, lag(xl_byday,2,0) over(partition by dp) as xl_last2day, lag(dt,1,0) over(partition by dp) as ldt, lag(dt,2,0) over(partition by dp) as last2day )as t2 select * --There are records in three days where t2.xl_lastday>0 and t2.xl_byday>0 and t2.xl_last2day>0 --See if the three days are continuous and datediff(t2.dt,t2.ldt)=1 and datediff(t2.dt,t2.last2day)=2
Operation results
The second step is to connect with the original table to find the flow records of stores with sales records for three consecutive days
sql statement
from(from( from( from( -- Total carbon per user per day select dp,dt,sum(xl) over(partition by dt) xl_byday from xs )as t select dp,dt,collect_set(xl_byday)[0] xl_byday group by dp,dt ) as t1 select t1.dp,t1.dt,t1.xl_byday,lag(xl_byday,1,0) over(partition by dp) as xl_lastday, lag(xl_byday,2,0) over(partition by dp) as xl_last2day, lag(dt,1,0) over(partition by dp) as ldt, lag(dt,2,0) over(partition by dp) as last2day )as t2 select distinct(t2.dp) where t2.xl_lastday>0 and t2.xl_byday>0 and t2.xl_last2day>0 and datediff(t2.dt,t2.ldt)=1 and datediff(t2.dt,t2.last2day)=2) as t3,xs select xs.* where t3.dp=xs.dp
Operation results