How to limit the footwall volume of temporary data files in data warehouse

Absrtact: if the intermediate result set of query is too large, resulting in the drop of the temporary data file generated, this paper provides two schemes to limit the amount of data in the lower disk of the temporary data file to prevent affecting the normal business operation.

This article is shared from Huawei cloud community< How does GaussDB(DWS) limit the amount of temporary data files >, author: Wang Xiaojuan 8.

For some SQL statements, the intermediate result set is too large to fit in memory and needs to be dropped to external memory (for example, there are operations such as aggregating a large amount of data, resulting in the drop of the intermediate result set of the aggregation operation when it cannot fit in memory), and the space occupied by the temporary data file generated by the drop is too large, which will affect the normal data writing business and cannot be executed, Disks can only provide read-only operations.

For the above scenario, there are two ways to control the amount that can be dropped in the intermediate result set during user execution. When the limit is exceeded, an error will be reported and the execution of the statement will be terminated to prevent the temporary data file from occupying too much space:

1. Scheme 1: set the limit on the amount of temporary file falling disk data for each thread
2. Scheme 2: set the falling disk space limit of intermediate result set for users

Scheme 1: set the limit on the amount of temporary file falling disk data for each thread

Set the GUC parameter temp_file_limit can limit the amount of temporary file falling disk data for each thread. temp_file_limit is a SUSET type parameter. Value range: integer, unit: KB. Where - 1 means there is no limit. Default: - 1.

1. How to set temp_file_limit parameter

Available through GS_ The global settings of the GUC tool are as follows:
gs_guc reload -Z coordinator -Z datanode -N all -I all -c "temp_file_limit = 1024"

2. temp_file_limit value calculation formula

You can use the following formula to roughly calculate a temp_file_limit value: temp_file_limit = estimated total number of footwall threads / number of simultaneous footwall threads

The total footwall volume can generally be set to 20% of the available space, which can be adjusted according to the user's acceptability. At the same time, the number of threads in the footwall is the number of threads that generate intermediate temporary data footwall in concurrent query in business operation. As the amount of data stored in the database increases, temp_ file_ The value of limit should be adjusted in time.

be careful:

This parameter limits the amount of temporary file disk dropping data of each thread. If a query has multiple threads and the amount of disk dropping data of a single thread exceeds the limit of this parameter, the query will report an error and exit. If each thread does not exceed the limit, but the cumulative amount of footwall data of multiple threads exceeds the limit of this parameter, there will be no error and exit.

3. Examples

Customer in TPC-DS 1x data_ Demographics table as an example.

The SQL query does not push down, and the intermediate result set is only placed on the CN

postgres=# show temp_file_limit;
 temp_file_limit 
-----------------
 1MB
(1 row)

postgres=# set enable_stream_operator=off;
SET

postgres=# explain select * from customer_demographics c1, customer_demographics c2 order by c1.cd_demo_sk;
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
  id |                                operation                                 |    E-rows     | E-width |     E-costs      
 ----+--------------------------------------------------------------------------+---------------+---------+------------------
   1 | ->  Sort                                                                 | 3689472640000 |     112 | 2329492473090.72 
   2 |    ->  Nested Loop (3,4)                                                 | 3689472640000 |     112 | 36894726400.00   
   3 |       ->  Data Node Scan on customer_demographics "_REMOTE_TABLE_QUERY_" |       1920800 |      56 | 0.00             
   4 |       ->  Data Node Scan on customer_demographics "_REMOTE_TABLE_QUERY_" |       1920800 |      56 | 0.00             
(6 rows)

postgres=# select * from customer_demographics c1, customer_demographics c2 order by c1.cd_demo_sk;
ERROR:  temporary file size exceeds temp_file_limit (1024kB)

Scheme 2: set the falling disk space limit of intermediate result set for users

1. How to set the space limit of user intermediate result set

There are two ways to set the falling disk space limit of user intermediate result sets:
1) Specify spin space through CREATE USER to set the drop limit of intermediate result set for new users
CREATE USER user_name ... SPILL SPACE 'spillspacelimit';
2) Specify spin space through ALTER USER to modify the landing space limit of intermediate result set of existing users
ALTER USER user_name ... SPILL SPACE 'spillspacelimit';

For example:
CREATE USER u1 PASSWORD ‘ abcd@1234 ’ SPILL SPACE 'unlimited'; -- Create a user and set the landing limit of intermediate result set to unlimited
ALTER USER u1 SPILL SPACE '1G'; -- Modify the setting limit of user U1's intermediate result set to 1g

explain:
1) This setting is effective for all nodes, that is, if the sum of the drop disk data of an SQL in the CN and all DN S of the cluster exceeds the limit, the statement will report an error and terminate.
2) When the intermediate result set is dropped, the amount of temporary file drop data of the user increases accordingly; When the temporary file is deleted, the amount of data of the user's temporary file will be reduced accordingly.
3) This setting is user level. If the same user runs multiple queries simultaneously, the amount of disk falling data of each query intermediate result set will be accumulated.

be careful:
To make the above settings effective, you need to set the GUC parameter enable_perm_space is on.
If multiple users will perform a large number of footwall operations of intermediate result sets, you need to set each user involved.

2. Examples

Example 1: the intermediate result set will drop on both CN and DN, and the total amount of drop data will exceed 1G

postgres=# create user u1 password 'abcd@1234';
CREATE USER
postgres=# grant select on customer_demographics to u1;
GRANT
postgres=# alter user u1 spill space '1G';
ALTER USER
postgres=# alter session set session authorization u1 password 'abcd@1234';
SET
postgres=> select * from customer_demographics c1, customer_demographics c2 order by c1.cd_demo_sk;
ERROR:  spill space is out of user's spill space limit

Example 2: the SQL query does not push down, and the intermediate result set is only available on the CN

postgres=# set enable_stream_operator=off;
SET
postgres=# alter session set session authorization u1 password 'abcd@1234';
SET
postgres=> explain select * from customer_demographics c1, customer_demographics c2 order by c1.cd_demo_sk;
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
  id |                                operation                                 |    E-rows     | E-width |     E-costs      
 ----+--------------------------------------------------------------------------+---------------+---------+------------------
   1 | ->  Sort                                                                 | 3689472640000 |     112 | 2329492473090.72 
   2 |    ->  Nested Loop (3,4)                                                 | 3689472640000 |     112 | 36894726400.00   
   3 |       ->  Data Node Scan on customer_demographics "_REMOTE_TABLE_QUERY_" |       1920800 |      56 | 0.00             
   4 |       ->  Data Node Scan on customer_demographics "_REMOTE_TABLE_QUERY_" |       1920800 |      56 | 0.00             
(6 rows)

postgres=> select * from customer_demographics c1, customer_demographics c2 order by c1.cd_demo_sk;
ERROR: spill space is out of user's spill space limit

summary

The first scheme focuses on limiting the amount of temporary file footwall of each thread, and the second scheme focuses on limiting the amount of temporary file footwall of users. More appropriate parameters and parameter settings should be selected in combination with the purpose of business to avoid excessive amount of temporary file footwall affecting normal business execution.

Click focus to learn about Huawei cloud's new technologies for the first time~

Keywords: Big Data GaussDB

Added by Brusca on Tue, 23 Nov 2021 12:29:29 +0200

Programming VIP