[hard Hive] the four sorts in Hive are explained in detail, and the usage will no longer be confused

Welcome to the blog home page: https://blog.csdn.net/u013411339
Welcome to like, collect, leave messages, and exchange messages!
This article was originally written by [Wang Zhiwu] and started on CSDN blog!
This article is the first CSDN forum. It is strictly prohibited to reprint without the permission of the official and myself!

This article is right [hard big data learning route] learning guide for experts from zero to big data (fully upgraded version) The Hive section of.

Four sorts in Hive

Sorting is a common operation, especially in data analysis. We often need to sort the data. hive has four keywords related to sorting. Today we'll take a look at their functions.

Data preparation

Now we have a temperature data, tab segmentation

2008    32.0
2008    21.0
2008    31.5
2008    17.0
2013    34.0
2015    32.0
2015    33.0
2015    15.9
2015    31.0
2015    19.9
2015    27.0
2016    23.0
2016    39.9
2016    32.0

Create table and load data

create table ods_temperature(
    `year` int,
    temper float
)
row format delimited fields terminated by '\t';
load data local inpath '/Users/liuwenqiang/workspace/hive/temperature.data' overwrite into table ods_temperature;

1. Order by

order by will sort the input globally, so there is only one reducer (multiple reducers cannot ensure global order). However, there is only one reducer, which will consume a long calculation time when the input scale is large.

Descending order: desc
Ascending order: asc does not need to be specified. The default is ascending order

It should be noted that it is affected by hive.mapred.mode. In strict mode, limit must be used to limit the amount of sorted data. Because the amount of data is large, OOM or running time will be too long if there is only one reducer. Therefore, in strict mode, an error will be reported if limit is not applicable. Please refer to Hive's strict mode and local mode for more information.

Error: Error while compiling statement: FAILED: SemanticException 1:39 Order by-s without limit are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.. Error encountered near token 'year' (state=42000,code=40000)

Next, let's take a look at the sorting result of order by, select * from ods_temperature order by year;

2. Sort by

It is not a global sort. It completes the sort before the data enters the reducer, that is, it will generate a sorted file for each reducer before the data enters the reduce. Therefore, if sort by is used for sorting and MapReduce. Job. Reductions > 1 is set, sort by only ensures the output order of each reducer, not the global order.

It is not affected by the Hive.mapred.mode attribute. The data of sort by can only ensure that the data in the same reduce can be sorted by the specified field. Using sort by, you can specify the number of reductions to be performed (specified by set mapred.reduce.tasks=n), and then merge and sort the output data to get all the results.

set mapred.reduce.tasks=3;
select * from ods_temperature sort by year;

It seems that the above output is not in order. Haha, let's output the data to a file in another way, because we set the reduce number to 3, there should be three file outputs.

set mapred.reduce.tasks=3;
insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t' select * from ods_temperature sort by year;

It can be seen that this is much clearer. We can see that the year in a partition does not agree, and the data of that year are available.

Execution efficiency of sort by and order by

First, let's look at a phenomenon. In general, we think that sort by should be faster than order by, because order by can only use one reducer to sort all, but it is not necessarily when the amount of data is relatively small, because the startup time of reducer may be much longer than that of data processing, as in the following example, order by is faster than sort by.

limt in sort by

You can use the limit clause in sort by to reduce the amount of data. After using limit n, the number of data records transmitted to the reduce end will be reduced to n * (the number of map s), that is, we use limit in sort by to limit the number of data in each reducer, then order by according to the sorting field of sort by, and finally return n pieces of data to the client, In other words, if you use the limit clause in sort by, you will still use order by for the final sorting.

The use of limit in order by is to limit the sorted result file and then give it to reducer. You can see that the limit clause in sort by will reduce the amount of data involved in sorting, but not in order by, which will only limit the amount of data returned to the client.

From the above execution efficiency, we can see that sort by limit is almost twice as high as order by limit. I guess there should be an additional link.

Next, let's take a look at the implementation plans of order by limit and sort by limit respectively

explain select * from ods_temperature order by year limit 2;

explain select * from ods_temperature sort by year limit 2;

From the screenshot above, I can see where I circled

sort by limit has one more stage(order limit) than order by limit
sort by limit actually executes limit twice, reducing the amount of data involved in sorting

3. Distribute by

Distribution by controls how to split data on the map side to the reduce side. It is similar to partitioning data in MapReduce with partitioner

hive will distribute the data to the corresponding reducer according to the following columns of distribute by. By default, the method of hash algorithm + remainder is adopted.

Sort by generates a sort file for each reduce. In some cases, you need to control which reducer a specific line should be written to. This is usually for subsequent aggregation operations. Distribute by can just do it. Therefore, distribute by is often used with sort by.

For example, in the above example of sort by, we found that the data of different years are not in the same file, that is, not in the same reducer. Next, let's see how to output the same years together and sort them in ascending order of temperature

First, let's try the implementation of SQL without distribute by

insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t'  select * from ods_temperature sort by temper ;

It is found that the data of the same year are not allocated together. Next, let's use distribute by

insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t' 
select * from ods_temperature distribute by year sort by temper ;

We can see that 2013 and 2016 are put together, but there is no certain order. At this time, we can sort the distribution by field again

insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t' 
select * from ods_temperature distribute by year sort by year,temper ;

4. cluster by

In addition to the function of distribute by, cluster by also has the function of sort by. However, sorting can only be in ascending order, and the sorting rule cannot be ASC or DESC.

When the partition field and sorting field are the same, cluster by can simplify the SQL writing method of distribution by + sort by, that is, when the distribution by and sort by fields are the same, cluster by can be used instead of distribution by and sort by

insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t' 
select * from ods_temperature  distribute by year sort by year ;

insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t' 
select * from ods_temperature cluster by year;

We can see that the output results of the above two SQL writing methods are the same, which proves our statement that when the distribution by and sort by fields are the same, cluster by can be used instead of distribution by and sort by

When you try to specify the sorting direction for cluster by, you will get the following error.

Error: Error while compiling statement: FAILED: ParseException line 2:46 extraneous input 'desc' expecting EOF near '<EOF>' (state=42000,code=40000)

summary

order by is a global sort, which may have poor performance;
The sort by partition is orderly. It is often used in conjunction with the distribution by to determine what data the partition has;
Distribution by determines the rules of data distribution, and the data meeting the same conditions is distributed to a reducer;
Cluster by when the fields of distribution by and sort by are the same, cluster by can be used instead of distribution by and sort by, but cluster by defaults to ascending order and cannot specify sorting direction;
sort by limit is equivalent to the data limit of each reduce. After that, order by and then limit;

Keywords: Big Data hive

Added by mshen on Tue, 07 Sep 2021 00:19:08 +0300

Programming VIP