On Hive advanced functions

Basic operation of function

  • View the description information of the specified function: desc function function name;
  • Display function extension content: desc function extended function name;

Typical advanced functions

Group sorting takes TopN

To implement the grouping sorting function, you need to use row_number and over functions. row_ The number and over functions are usually used together.

  • row_number will number the data, starting with 1
  • over can be understood as dividing the data into a window, where partition by can be added to indicate that the data is grouped according to fields, and order by can also be added to indicate that the data in each group is sorted according to a certain field

Example of group sorting using TopN

Demand: there is a student's test score information, including Chinese, mathematics and English. Now it is necessary to calculate the names of the top three students in a single subject in the class

  1. Test data: student_score.data
id      name    sub     score
1       zs1     chinese 80
2       zs1     math    90
3       zs1     english 89
4       zs2     chinese 60
5       zs2     math    75
6       zs2     english 80
7       zs3     chinese 79
8       zs3     math    83
9       zs3     english 72
10      zs4     chinese 90
11      zs4     math    76
12      zs4     english 80
13      zs5     chinese 98
14      zs5     math    80
15      zs5     english 70

  1. Build table
create external table student_score(
id int,
name string,
sub string,
score int
)row format delimited
fields terminated by '\t'
location '/data/student_score';
  1. Import data
load data local inpath '/data/soft/hivedata/student_score.data' into table student_score
  1. Query data_ Grouping sorting
 select * ,row_number() over(partition by sub order by score desc)as num from student_score;

13      zs5     chinese 98      1
10      zs4     chinese 90      2
1       zs1     chinese 80      3
7       zs3     chinese 79      4
4       zs2     chinese 60      5
3       zs1     english 89      1
6       zs2     english 80      2
12      zs4     english 80      3
9       zs3     english 72      4
15      zs5     english 70      5
2       zs1     math    90      1
8       zs3     math    83      2
14      zs5     math    80      3
11      zs4     math    76      4
5       zs2     math    75      5
  • Query data_ Get the top three
 select * from (
               select *, row_number() over (partition by sub order by score desc) as num 
               from student_score 
               ) s where s.num<=3>;
               
13      zs5     chinese 98      1
10      zs4     chinese 90      2
1       zs1     chinese 80      3
3       zs1     english 89      1
6       zs2     english 80      2
12      zs4     english 80      3
2       zs1     math    90      1
8       zs3     math    83      2
14      zs5     math    80      3

[note]

  • row_number() over() is the normal sort
  • rank() over() is a skip sort. When there are two first places, the next is the third place (in each group)
  • dense_rank() over() is a continuous sort. When there are two first names, it is still followed by the second name (within each group)

Row to column

Row to column is to convert multiple rows of data into one column of data. For row column conversion, concat is mainly used_ WS(),COLLECT_SET() ,COLLECT_LIST() function.

  • CONCAT_WS() function can splice the values of multiple fields according to the specified delimiter, and finally convert them into a string with delimiter.
  • COLLECT_SET() returns a set set. The elements in the set are not repeated. It is generally used together with group by
  • COLLECT_LIST() returns a list set, and the elements in the set will be repeated. It is generally used in combination with group by

Row to column example

  1. Test data: student_favors.data
zs      swing
zs      footbal
zs      sing
zs      codeing
zs      swing
  1. Build table
create external table student_favors(
name string,
favor string
)row format delimited
fields terminated by '\t'
location '/data/student_favors';
  1. Import data
load data local inpath '/data/soft/hivedata/student_favors.data' into table student_favors
  1. Column to row
select 
      name,concat_ws(',',collect_set(favor)) as favors 
from 
      student_favors
group by name;

name    favors
zs      swing,footbal,sing,codeing

[note] if the group by command is not added, a SemanticException error will be reported.

Column to row

Column to row conversion mainly uses SPLIT(), expand() and final view functions.

  • The Split function accepts a string character and cutting rules, uses the cutting rules to cut the data in the string, and finally returns an array
  • Expand (array): means to convert each element in the array into a row
  • Expand (map): means to convert each key value pair in the map into a row. Key is a column and value is a column
  • lateral view can generate a virtual table that supports aliases

Column to row example

  1. Test data: student_favors2.data
zs      swing,footbal,sing
ls      codeing,swing
  1. Build table
hive (default)> create external table student_favors_2(
              > name string,
              > favorList string
              > )row format delimited
              > fields terminated by '\t'
              > location '/data/student_favors_2';
  1. Import data
load data local inpath '/data/soft/hivedata/student_favors2.data' into table student_favors_2
  1. Row to column
# First, use split to cut the favorlist field
# Then use explode to operate the data
# Use the lateral view to take the data returned by expand as a virtual table, named table1, and then name the column of data in the table favor_new
hive (default)> SELECT
              >      name,favor_new
              > FROM
              >      student_favors_2 
              > lateral view
              >      explode(split(favorList,',')) table1 as favor_new;
OK
name    favor_new
zs      swing
zs      footbal
zs      sing
ls      codeing
ls      swing

Sorting function

  • ORDER BY: perform a global sort on the query results. When using this statement, there is only one reduce task generated
  • SORT BY: local sort
  • Distribute by: only the data will be partitioned according to the specified key, but will not be sorted. In general, it can be used in combination with sort by to partition the data first and then sort it. When the two are used together, distribute by must be written before sort by
  • Cluster by: the function of cluster by is the short form of distribution by and sort by, that is, cluster by id equals distribution by ID sort by ID

[note] the columns specified by cluster by can only be in ascending order, and asc and desc cannot be specified

Grouping and de duplication function

  • GROUP BY: group the data according to the specified field
  • DISTINCT: de duplicates the duplicate values of the specified fields in the data

realization

First: select count(distinct name) from order
 Second: select count(tmp.name) from (select name from order group by name) tmp

analysis

First: use distinct Will put all name all shuffle To one reducer Inside, low performance
 Second: first name Grouping, because grouping is actually de duplication. At this time, it can be calculated in parallel and then calculated count Just enough, high performance.

reference

Muke network https://www.imooc.com/wiki/BigData

Keywords: Hadoop hive Data Warehouse

Added by alexhard on Fri, 14 Jan 2022 10:47:21 +0200