Basic operation of function
- View the description information of the specified function: desc function function name;
- Display function extension content: desc function extended function name;
Typical advanced functions
Group sorting takes TopN
To implement the grouping sorting function, you need to use row_number and over functions. row_ The number and over functions are usually used together.
- row_number will number the data, starting with 1
- over can be understood as dividing the data into a window, where partition by can be added to indicate that the data is grouped according to fields, and order by can also be added to indicate that the data in each group is sorted according to a certain field
Example of group sorting using TopN
Demand: there is a student's test score information, including Chinese, mathematics and English. Now it is necessary to calculate the names of the top three students in a single subject in the class
- Test data: student_score.data
id name sub score 1 zs1 chinese 80 2 zs1 math 90 3 zs1 english 89 4 zs2 chinese 60 5 zs2 math 75 6 zs2 english 80 7 zs3 chinese 79 8 zs3 math 83 9 zs3 english 72 10 zs4 chinese 90 11 zs4 math 76 12 zs4 english 80 13 zs5 chinese 98 14 zs5 math 80 15 zs5 english 70
- Build table
create external table student_score( id int, name string, sub string, score int )row format delimited fields terminated by '\t' location '/data/student_score';
- Import data
load data local inpath '/data/soft/hivedata/student_score.data' into table student_score
- Query data_ Grouping sorting
select * ,row_number() over(partition by sub order by score desc)as num from student_score; 13 zs5 chinese 98 1 10 zs4 chinese 90 2 1 zs1 chinese 80 3 7 zs3 chinese 79 4 4 zs2 chinese 60 5 3 zs1 english 89 1 6 zs2 english 80 2 12 zs4 english 80 3 9 zs3 english 72 4 15 zs5 english 70 5 2 zs1 math 90 1 8 zs3 math 83 2 14 zs5 math 80 3 11 zs4 math 76 4 5 zs2 math 75 5
- Query data_ Get the top three
select * from ( select *, row_number() over (partition by sub order by score desc) as num from student_score ) s where s.num<=3>; 13 zs5 chinese 98 1 10 zs4 chinese 90 2 1 zs1 chinese 80 3 3 zs1 english 89 1 6 zs2 english 80 2 12 zs4 english 80 3 2 zs1 math 90 1 8 zs3 math 83 2 14 zs5 math 80 3
[note]
- row_number() over() is the normal sort
- rank() over() is a skip sort. When there are two first places, the next is the third place (in each group)
- dense_rank() over() is a continuous sort. When there are two first names, it is still followed by the second name (within each group)
Row to column
Row to column is to convert multiple rows of data into one column of data. For row column conversion, concat is mainly used_ WS(),COLLECT_SET() ,COLLECT_LIST() function.
- CONCAT_WS() function can splice the values of multiple fields according to the specified delimiter, and finally convert them into a string with delimiter.
- COLLECT_SET() returns a set set. The elements in the set are not repeated. It is generally used together with group by
- COLLECT_LIST() returns a list set, and the elements in the set will be repeated. It is generally used in combination with group by
Row to column example
- Test data: student_favors.data
zs swing zs footbal zs sing zs codeing zs swing
- Build table
create external table student_favors( name string, favor string )row format delimited fields terminated by '\t' location '/data/student_favors';
- Import data
load data local inpath '/data/soft/hivedata/student_favors.data' into table student_favors
- Column to row
select name,concat_ws(',',collect_set(favor)) as favors from student_favors group by name; name favors zs swing,footbal,sing,codeing
[note] if the group by command is not added, a SemanticException error will be reported.
Column to row
Column to row conversion mainly uses SPLIT(), expand() and final view functions.
- The Split function accepts a string character and cutting rules, uses the cutting rules to cut the data in the string, and finally returns an array
- Expand (array): means to convert each element in the array into a row
- Expand (map): means to convert each key value pair in the map into a row. Key is a column and value is a column
- lateral view can generate a virtual table that supports aliases
Column to row example
- Test data: student_favors2.data
zs swing,footbal,sing ls codeing,swing
- Build table
hive (default)> create external table student_favors_2( > name string, > favorList string > )row format delimited > fields terminated by '\t' > location '/data/student_favors_2';
- Import data
load data local inpath '/data/soft/hivedata/student_favors2.data' into table student_favors_2
- Row to column
# First, use split to cut the favorlist field # Then use explode to operate the data # Use the lateral view to take the data returned by expand as a virtual table, named table1, and then name the column of data in the table favor_new hive (default)> SELECT > name,favor_new > FROM > student_favors_2 > lateral view > explode(split(favorList,',')) table1 as favor_new; OK name favor_new zs swing zs footbal zs sing ls codeing ls swing
Sorting function
- ORDER BY: perform a global sort on the query results. When using this statement, there is only one reduce task generated
- SORT BY: local sort
- Distribute by: only the data will be partitioned according to the specified key, but will not be sorted. In general, it can be used in combination with sort by to partition the data first and then sort it. When the two are used together, distribute by must be written before sort by
- Cluster by: the function of cluster by is the short form of distribution by and sort by, that is, cluster by id equals distribution by ID sort by ID
[note] the columns specified by cluster by can only be in ascending order, and asc and desc cannot be specified
Grouping and de duplication function
- GROUP BY: group the data according to the specified field
- DISTINCT: de duplicates the duplicate values of the specified fields in the data
realization
First: select count(distinct name) from order Second: select count(tmp.name) from (select name from order group by name) tmp
analysis
First: use distinct Will put all name all shuffle To one reducer Inside, low performance Second: first name Grouping, because grouping is actually de duplication. At this time, it can be calculated in parallel and then calculated count Just enough, high performance.
reference
Muke network https://www.imooc.com/wiki/BigData