hive small case - Comprehensive windowing function, judgment statement, date conversion, time mean calculation


What needs to be done: a full scale to calculate the average start and end time of the task in seven days

1, Introduction data

The data table is a full synchronization table. The partition is based on the date. It contains the start time, end time, total seconds of start time (total seconds to the early morning), and total seconds of end time (total seconds to the early morning).

Partition value example: log_date = '20218', not '2022-02-18'

Now you need to find the average end and start time of each task in seven days

2, ER model

When analyzing a requirement, try to clarify the field relationship and the results we want. After the model is established, the idea will be opened. The ER model of this requirement is actually very clear. The data granularity is naturally a record of a task, and then it is nothing more than finding the average start time and average end time, and then judging the range of time, Then connect it

3, Train of thought combing

  1. First of all, there is no doubt that you need to use the windowing function
  2. Secondly, how to set the partition in the windowing function
  3. Because it is within seven days, you need to sort the data according to which field
  4. Because the partition format is' 20220218 ', if the date is automatically obtained, the format is' 2022-02-18', so how to convert
  5. The average time of seven days cannot be calculated directly from the sum of dates. How to average the time
  6. How to convert the obtained time average into time

We need to consider the above issues. When these issues are considered, we can write normal code

4, Problem solving

Solve the above small problems one by one

  1. Windowing function

    Recall the use of windowing function,

    Usage syntax:

    udf() over(partition by ... order by ... rows ... ... ...)
    

    The function we want to implement here is to take the average value, so it is

    avg(values) over(...)
    
  2. How to set partition in windowing function

    First of all, we need to understand the purpose we need to achieve. Our data granularity is a record of a task and the required seven day average value. Therefore, we need to enter the data of the same job into a window, so it is natural to partition according to the job

    avg(values) over(partition by job_id ... ... ...)
    
  3. How to sort the data

    Because there are a lot of data entering the window, but we only need to sort the data in the previous seven days in reverse order to get the data in the previous seven days, so we should sort according to our start time. Note here that if it is the start time, we use the start time; if it is the end time, we use the end time

    avg(values) over(partition by job_id order by start_time ...)
    
  4. How to convert date format

    If the date is obtained automatically, the default date format is' 2022-02-18 ', which is different from the partition requirements, so how to convert it

    At the beginning, I didn't find the corresponding method, that is, when I first convert it to time stamp, and then convert it from time stamp to time, I set it to 'yyyyMMdd'

    select from_unixtime(unix_timestamp('2022-02-18','yyyy-mm-dd'),'yyyymmdd')
    
  5. Average time

    We can't calculate it directly, such as' 2022-02-18 12:00:00 'and' 2022-02-17 12:00:00 '. The average value is obviously' 2022-02-18 00:00 ', which obviously doesn't meet the requirements. What we need should be "12:00:00". Therefore, when calculating the average value, we need to convert it to the total seconds to zero in the morning, and then calculate the average value

    hour(start_time)*3600 + minute(start_time)*60 + second(start_time)
    

    However, there is a field in the table that has recorded this data, so it can be used directly

  6. When we get the average of the total seconds, how can we convert it into time format

    For example, how can 3600s be converted to time '01:00:00'

    You can do this:

    concat(floor(36839/3600),':',floor((36839%3600)/60),':',floor((36839%3600)%60))
    

    This can be converted, but there is a problem. The time "12:00:00" will be parsed into "12:0:0", which is obviously not beautiful. Then you can use the judgment function to make up a 0 if it is one bit

    concat_ws(':',
                if(length(cast(floor(t2.avg_start_time/3600) as string)) = 1,concat('0',cast(floor(t2.avg_start_time/3600) as string)),cast(floor(t2.avg_start_time/3600) as string))
                ,if(length(cast(floor(t2.avg_start_time%3600/60) as string)) = 1,concat('0',cast(floor(t2.avg_start_time%3600/60) as string)),cast(floor(t2.avg_start_time%3600/60) as string)) 
                ,if(length(cast(t2.avg_start_time%3600%60 as string)) = 1,concat('0',cast(t2.avg_start_time%3600%60 as string)),cast(t2.avg_start_time%3600%60 as string))) start_time_week_avg, -- Seconds are converted to minutes and seconds
    

    However, in real production, it is suggested that this function can be encapsulated into udf. One is to simplify the code, and the other is that this function is really commonly used

5, Total sql preparation

When all the small problems are solved, sum them up

SELECT
    t2.jb job_id,
    t2.st start_time,
    t2.et end_time,
    t2.sts start_time_ss,
    t2.ets end_time_ss,
    case 
    when t2.avg_start_time<28800 then 1
    when t2.avg_start_time<36000 then 2
    else 3 end avg_start_time_range,
    case
    when t2.avg_end_time<28800 then 1
    when t2.avg_end_time<36000 then 2
    else 3 end avg_end_time_range,
    concat_ws(':',
            if(length(cast(floor(t2.avg_start_time/3600) as string)) = 1,concat('0',cast(floor(t2.avg_start_time/3600) as string)),cast(floor(t2.avg_start_time/3600) as string))
            ,if(length(cast(floor(t2.avg_start_time%3600/60) as string)) = 1,concat('0',cast(floor(t2.avg_start_time%3600/60) as string)),cast(floor(t2.avg_start_time%3600/60) as string)) 
            ,if(length(cast(t2.avg_start_time%3600%60 as string)) = 1,concat('0',cast(t2.avg_start_time%3600%60 as string)),cast(t2.avg_start_time%3600%60 as string))) start_time_week_avg, -- Seconds are converted to minutes and seconds
            
  concat_ws(':',
            if(length(cast(floor(t2.avg_end_time/3600) as string)) = 1,concat('0',cast(floor(t2.avg_end_time/3600) as string)),cast(floor(t2.avg_end_time/3600) as string))
            ,if(length(cast(floor(t2.avg_end_time%3600/60) as string)) = 1,concat('0',cast(floor(t2.avg_end_time%3600/60) as string)),cast(floor(t2.avg_end_time%3600/60) as string)) 
            ,if(length(cast(t2.avg_end_time%3600%60 as string)) = 1,concat('0',cast(t2.avg_end_time%3600%60 as string)),cast(t2.avg_end_time%3600%60 as string))) end_time_week_avg -- Seconds are converted to minutes and seconds
from
  (SELECT
      t1.job_id jb,
      t1.start_time st,
   	  t1.end_time et,
      t1.start_time_ss sts,
      t1.end_time_ss ets,
      floor(t1.avg_start) avg_start_time,
      floor(t1.avg_end) avg_end_time
  from
    (select
        job_id,
        start_time,
     	end_time,
        start_time_ss,
        end_time_ss,
        AVG(start_time_ss)
        over(partition by job_id order by start_time desc rows BETWEEN 7 preceding and current row) avg_start,
       AVG(end_time_ss)
          over(partition by job_id order by end_time desc rows BETWEEN 7 preceding and current row) avg_end
    from table_name
    where log_date='20220218') t1) t2;

6, Display of operation results

Keywords: Big Data hive Data Warehouse

Added by quicknik on Mon, 21 Feb 2022 16:58:37 +0200