hiveSQL advanced -- 10__ Tests whether a value exists in a set of data

0 - problem description

SQL tests whether there is a value in a group of data. You want to generate a Boolean value according to whether a group of row records contain a specific value. Imagine an example where a student will take several exams over a period of time. Suppose he takes three exams every three months. As long as he has passed any one, a flag will be returned to indicate that he has passed the exam. If he has failed in all three games, a flag will also be returned to indicate that he has failed the exam.

The following data are available:

create table v
as
select 1 student_id,
       1 test_id,
       2 grade_id,
       1 period_id,
       '2005-02-01' test_date,
       0 pass_fail
union all
select 1, 2, 2, 1, '2005-03-01', 1 
union all
select 1, 3, 2, 1, '2005-04-01', 0 
union all
select 1, 4, 2, 2, '2005-05-01', 0 
union all
select 1, 5, 2, 2, '2005-06-01', 0 
union all
select 1, 6, 2, 2, '2005-07-01', 0 

View data:

hive> select * from v;
OK
1	1	2	1	2005-02-01	0
1	2	2	1	2005-03-01	1
1	3	2	1	2005-04-01	0
1	4	2	2	2005-05-01	0
1	5	2	2	2005-06-01	0
1	6	2	2	2005-07-01	0
Time taken: 0.275 seconds, Fetched: 6 row(s)
 

Carefully observing the above result set, we can see that the student took a total of 6 exams in two semesters. He passed one of them (1 means "pass" and 0 means "fail"), so his academic performance in the first semester was passed. He was in the second semester (the next 3 months) did not pass any exam, so the PASS_FAIL column of the three exams is 0. You want to return a result set to indicate whether the student passed the exam in a semester. Finally, you want to get the result set shown below.

STUDENT_ID TEST_ID GRADE_ID PERIOD_ID TEST_DATE   METREQ IN_PROGRESS
---------- ------- -------- --------- ----------- ------ -----------
         1       1        2         1 2005-02-01      +           0
         1       2        2         1 2005-03-01      +           0
         1       3        2         1 2005-04-01      +           0
         1       4        2         2 2005-05-01      -           0
         1       5        2         2 2005-06-01      -           0
         1       6        2         2 2005-07-01      -           1

The values of METREQ (indicating whether to pass) are "+" and "-", indicating whether the student has passed at least one exam in a semester (3 months). If a student has passed at least one exam in a semester, the IN_PROGRESS value is 0. If he has not passed any exam, the IN_PROGRESS value corresponding to the last exam he took should be 1.

Requirements: the value in METQER is aligned with the last character of the field, the same as in_ The progress field value is also aligned with the last character.

1 - problem solving

  • Through the problem description, we know that the essence of this problem is to add two columns on the basis of the original table as auxiliary judgment. Therefore, based on previous experience, we need to complete it with the help of window function construction.

  • The first column METQER: indicates whether students have passed at least one exam in a semester (3 months). The keyword is at least, which means that as long as one exam is passed, because pass_fail has only 0 and 1, we only need to require the value of max (pass_fail) over (partition by student_id, grade_id, period_id). As long as the maximum value is 1, it means passing, At this time, all the values of METQER are set to +. If it is 0, it is set to-

  • Second column IN_PROGRESS: if a student has passed at least one exam in a semester, in_ The progress value is 0. If he fails to pass any exam, the in corresponding to the last exam he took_ The progress value should be 1. Here, whether the test passes or not is solved by max() over(), but the trouble is that it is the last test, so how to judge whether it is the last test? Judging by time, max (test_date) over (partition by student_id, grade_id, period_id) represents the last test of the candidate in the same period.

  • With the above analysis, we analyze the following with the help of window function:
    (1) Obtain judgment conditions

select *
      ,max(pass_fail) over(partition by student_id,grade_id,period_id)
      ,max(test_date) over(partition by student_id,grade_id,period_id) 
from v
OK
1	2	2	1	2005-03-01	1	1	2005-04-01
1	1	2	1	2005-02-01	0	1	2005-04-01
1	3	2	1	2005-04-01	0	1	2005-04-01
1	4	2	2	2005-05-01	0	0	2005-07-01
1	5	2	2	2005-06-01	0	0	2005-07-01
1	6	2	2	2005-07-01	0	0	2005-07-01

(2) Make conditional judgment according to (1) result set to obtain the final result.

For the last requirement in the title, we use the left complement function lpad() to align the last character of the field. The specific SQL is as follows

select student_id
      ,test_id
      ,grade_id
      ,period_id
      ,test_date
      ,if(max_p=1,lpad('+',6,' '),lpad('-',6,' ')) as metreq 
      ,if( max_p=1,lpad('0',11,' ')
                    ,if(datediff(last_d,test_date)=0
                                 ,lpad('1',11,' '),lpad('0',11,' '))) as in_progress
from(
     select *
            ,max(pass_fail) over(partition by student_id,grade_id,period_id) as max_p --Condition 1
            ,max(test_date) over(partition by student_id,grade_id,period_id) as last_d --Condition 2
     from v
) m
+-------------+----------+-----------+------------+-------------+---------+--------------+--+
| student_id  | test_id  | grade_id  | period_id  |  test_date  | metreq  | in_progress  |
+-------------+----------+-----------+------------+-------------+---------+--------------+--+
| 1           | 1        | 2         | 1          | 2005-02-01  |      +  |           0  |
| 1           | 2        | 2         | 1          | 2005-03-01  |      +  |           0  |
| 1           | 3        | 2         | 1          | 2005-04-01  |      +  |           0  |
| 1           | 4        | 2         | 2          | 2005-05-01  |      -  |           0  |
| 1           | 5        | 2         | 2          | 2005-06-01  |      -  |           0  |
| 1           | 6        | 2         | 2          | 2005-07-01  |      -  |           1  |
+-------------+----------+-----------+------------+-------------+---------+--------------+--+

2 - Summary

This paper analyzes whether there is a value in a group of data through window function, and uses the technique of window function to construct auxiliary judgment conditions to solve this kind of problem, in which the technique of lpad() function is used for numerical alignment.

Keywords: Big Data hive SQL Interview Data Warehouse

Added by Satria Ox41464b on Thu, 16 Dec 2021 13:58:28 +0200