preface:
Recently, a DB2 paging query in the project was very slow, and the team leader assigned the optimization of this paging to me; Then, after a meal of optimization (random operation), the time consumption of DB2 paging query is reduced to a satisfactory situation, [Kaisen];
Then immediately report the results to the team leader. After checking my presentation, the team leader found that the paging query is indeed much faster, which can be "acceptable", which is quite acceptable than the page before optimization;
Note: the optimized presentation environment is basically the same as that when paging query is found to be slow, including the amount of data in the library, DB2 configuration, server configuration, etc.
Firstly, by checking the execution plan, it is found that the indexes in the SQL statement are used, so it is not the index problem for the time being. Finally, it is found that there is a problem in the SQL statement. After optimizing the SQL, the query is fast;
The following is a brief description of how DB2 paging SQL is optimized and binggou starts;
The main line of this paper:
- Prerequisite: select the logical execution order of query SQL
- Optimization process: DB2 paging SQL optimization process
- Knowledge expansion: share some special cases that lead to slow paging query
Prerequisites:
When optimizing query SQL statements, you need to know their logical execution order, which is very helpful for SQL optimization;
The logical execution order of SQL refers to how a whole SQL statement should be executed according to certain rules, and at what time each keyword and clause part should be executed;
1. The logical execution sequence of a simple select query SQL is as follows:
- First, execute from table join table to obtain the table to be operated and the associated table, and calculate the Cartesian product of them to obtain a virtual table v1;
- Then, execute the on condition, query and filter the virtual table v1 to get a virtual table v2;
- Then, execute the where condition to filter the data in the virtual table v2 to obtain the virtual table v3;
- Then, after executing the group by statement, group the data in the virtual table v3 to obtain the virtual table v4;
- Then, the existing condition filtering is performed, and the grouped virtual table v4 is conditionally filtered to obtain the virtual table v5;
- Then execute select projection column on the data of virtual table v5, and only keep the displayed fields in select to obtain virtual table v6;
- Then execute the order by sorting statement to sort the data in the virtual table v6 to get the virtual table v7;
- Finally, execute page splitting statements such as limit (limit the number of entries) to get the virtual table v8.
2. Query slow paging SQL under simple analysis:
The above briefly describes the logical execution sequence of query SQL. Next, we will analyze the logical execution sequence of paging statements with slow query;
SELECT * FROM ( SELECT B.*, ROWNUMBER() OVER() AS RN FROM ( select ts.name, ts.age, tc.class_name, ts.describe, ts.birthday from t_student ts LEFT JOIN t_class tc on tc.class_id = ts.class_id where ts.age = 23 AND ts.birthday >= TO_DATE('2020-06-12','yyyy-MM-dd') AND ts.birthday <= TO_DATE('2020-07-15' , 'yyyy-MM-dd') order by ts.birthday desc ) AS B )AS A WHERE A.RN BETWEEN 10 AND 20
t_ There is a secondary index in the student table: index(age, birthday)
t_ Primary key of class table: class_id
2.1. SQL logic execution sequence analysis:
- First, t_student ts and t_ The class tc table calculates the Cartesian product. If the amount of data in the TS table is 10 and the amount of data in the tc table is 10, then the Cartesian product is 10 * 10 = 100 to obtain the virtual table v1
- Then the on condition is filtered to get the virtual table v2
- Then, the virtual table v3 is obtained by filtering the where condition
- Then execute select to project the column, and only keep the displayed fields in the select to get the virtual table v4
- Then sort according to the birthday field to get the virtual table v5
- Finally, the data in the virtual table v5 is paged according to the row number to obtain the data required by the current page and get the final result set
2.2. The logical execution sequence obtained through analysis shows that:
- In the first step, calculate the Cartesian product. If the amount of data in the two tables is large, the Cartesian product will be very large, and the subsequent on condition screening and where condition screening will be much slower. Here you can find ways to optimize it
- In the select projection column, you must return the fields as needed, especially the large fields not used in the front-end page. You must not add them to the projection column. For example, the describe personal description field page in this SQL does not need to be displayed
- The last paging statement is to filter out the data required by the current page according to the line number after filtering out all the data. I think, can I? Don't filter out all the data, but filter out the data required by the current page
Optimization process:
Through the logical execution sequence analysis of SQL above, we know that there are three places where we can try to optimize to see if the query is getting faster;
1. Reduce Cartesian product:
Because the following where condition filtering is for the main table t_student table, so you can use the where condition to filter t in advance_ Filter the student table to get the virtual table,
Then use the virtual table and the connected table t_class calculates the Cartesian product. At this time, the Cartesian product is much smaller;
The SQL statement is as follows:
SELECT * FROM ( SELECT B.*, ROWNUMBER() OVER() AS RN FROM ( select ts.name, ts.age, tc.class_name, ts.describe, ts.birthday from (select class_id, name, age, describe, birthday from t_student where age = 23 AND birthday >= TO_DATE('2020-06-12','yyyy-MM-dd') AND birthday <= TO_DATE('2020-07-15' , 'yyyy-MM-dd') ) ts LEFT JOIN t_class tc on tc.class_id = ts.class_id order by ts.birthday desc ) AS B )AS A WHERE A.RN BETWEEN 10 AND 20
2. Unnecessary large fields cannot be returned:
In the select projection column, do not put the unnecessary large field describe into it, because if it is put into the projection column, it will occupy more buffer pool space during query. If the buffer pool space is full, disk IO will be carried out, which is very time-consuming;
In addition, if there are large fields in the response, the transmission speed will slow down during network IO and the page will slow down when loading data;
The SQL statement is as follows:
SELECT * FROM ( SELECT B.*, ROWNUMBER() OVER() AS RN FROM ( select ts.name, ts.age, tc.class_name, ts.birthday from (select class_id, name, age, birthday from t_student where age = 23 AND birthday >= TO_DATE('2020-06-12','yyyy-MM-dd') AND birthday <= TO_DATE('2020-07-15' , 'yyyy-MM-dd') ) ts LEFT JOIN t_class tc on tc.class_id = ts.class_id order by ts.birthday desc ) AS B )AS A WHERE A.RN BETWEEN 10 AND 20
3. Restrict data filtering during paging:
Paging queries are based on pageNo and pageSize to filter the required page data;
For example, if the current page is the second page, each page displays 10 pieces of data. The filtering of the current page data is rownum between ((pageNo-1) * pageSize) and (pageNo * pageSize). When filtering the data before paging, you can use fetch first (page * rows) rows only to limit the amount of filtered data and don't filter all of them, Instead, you can only filter to the front of the largest line number in the current page at most;
Due to the use of fetch first (page * rows) rows only, a lot of filtering operations are reduced and the speed will be much faster. Especially when clicking on the first few pages, the speed is very fast. The more you turn back the page, the slower the response will be. However, when paging, you basically view the first few pages and rarely look at the later ones, so the effect looks very good at this time
The SQL statement is as follows:
select ts.name, ts.age, tc.class_name, ts.birthday from ( select aa.class_id, aa.name, aa.age, aa.birthday from( select class_id, name, age, birthday, ROWNUMBER() OVER () AS RN from t_student where age = 23 AND birthday >= TO_DATE('2020-06-12','yyyy-MM-dd') AND birthday <= TO_DATE('2020-07-15' , 'yyyy-MM-dd') order by birthday desc fetch first 20 rows only ) aa where aa.RN BETWEEN 10 AND 20 ) ts LEFT JOIN t_class tc on tc.class_id = ts.class_id
So far, the optimization of DB2 paging SQL has been basically completed, and the query speed has been increased by more than half. However, there is still room for optimization of this paging SQL. You can further improve the query speed by modifying the SQL statement and index(age, birthday);
The following is a brief description of how to optimize again in the knowledge expansion part;
Knowledge expansion:
1. Further optimization of paging SQL:
Let's take a look at the following SQL statement:
select class_id, name, age, birthday, ROWNUMBER() OVER () AS RN from t_student where age = 23 AND birthday >= TO_DATE('2020-06-12','yyyy-MM-dd') AND birthday <= TO_DATE('2020-07-15' , 'yyyy-MM-dd') order by birthday desc fetch first 20 rows only
The secondary index(age, birthday) will be used during the execution of this SQL statement. Since all field values cannot be queried on this index, it is necessary to query the class in the back table_ The values of ID and name fields will slow down the query due to too many back to table operations, so we should find ways to avoid back to table at this time;
1.1. Method 1 to avoid returning the table:
Only modify the index. Change the secondary index index(age, birthday) to index(age, birthday, name, class_id). The SQL statement does not need to be changed;
However, when there are more index fields, it will have a great impact on the operations of adding, updating and deleting SQL, which will lead to longer execution time, because the index needs to be maintained; Therefore, if there are many write operations in these two tables, it is not recommended to modify the index directly;
1.2. Method 2 to avoid returning the table:
In fact, this method also needs to return to the table, but the number of times of returning to the table will be much less than that before, which is almost negligible;
This method does not need to modify the index, but modify the SQL statement. First, it is no longer necessary to return class when querying the secondary index(age, birthday)_ The fields ID, name, age and birthday are, but only t is returned_ Primary key stu of student table_ ID, there is no need to return to the table at this time;
Then, it will be used in the secondary index(age, birthday) to sort by default. Finally, only the first 20 data items will be returned according to the fetch first 20 rows only. At this time, 20 primary key stus will be returned_ ID value, which can then be based on the primary key stu_id left outer connection t_student to obtain other field values;
At this time, it is equivalent to that you only need to return the table 20 times, and you don't need to return all the tables;
The SQL statement is as follows:
select stu2.class_id, stu2.name, stu2.age, stu2.birthday, ROWNUMBER() OVER () AS RN from ( select stu_id from t_student where age = 23 AND birthday >= TO_DATE('2020-06-12','yyyy-MM-dd') AND birthday <= TO_DATE('2020-07-15' , 'yyyy-MM-dd') order by birthday desc fetch first 20 rows only ) stu1 left join t_student stu2 on stu1.stu_id = stu2.stu_id
The final optimized paging SQL is as follows:
select ts.name, ts.age, tc.class_name, ts.birthday from ( select aa.class_id, aa.name, aa.age, aa.birthday from( select stu2.class_id, stu2.name, stu2.age, stu2.birthday, ROWNUMBER() OVER () AS RN from ( select stu_id from t_student where age = 23 AND birthday >= TO_DATE('2020-06-12','yyyy-MM-dd') AND birthday <= TO_DATE('2020-07-15' , 'yyyy-MM-dd') order by birthday desc fetch first 20 rows only ) stu1 left join t_student stu2 on stu1.stu_id = stu2.stu_id ) aa where aa.RN BETWEEN 10 AND 20 ) ts LEFT JOIN t_class tc on tc.class_id = ts.class_id
2. Special cases in paging query:
When performing paging query, you usually first query the total amount of data that meets the conditions. This amount of data is used for paging to obtain the number of pages;
Then we may encounter some very special situations. The SQL query of the total data volume is as follows:
select count(*) from t_student ts LEFT JOIN t_class tc on tc.class_id = ts.class_id where ts.age = 23 AND ts.birthday >= TO_DATE('2020-06-12','yyyy-MM-dd') AND ts.birthday <= TO_DATE('2020-07-15' , 'yyyy-MM-dd')
In fact, the above SQL statement can be optimized. In fact, it does not need to be associated with t_class table, the specific reasons are as follows:
- First, the where condition is the T of the main table_ Student filtering
- The second is the main table t_student and appearance t_ The data relationship in class is one-to-one
You can directly query the qualified data in the main table, so you can remove the redundant table connections, which will greatly improve the query speed of SQL;
Under gossip, why does someone write that?
The guess is mainly for the convenience of the figure, and the paging SQL statements copied directly are changed [neither crying nor laughing];
❤ Like + comment + forward
If this article is helpful to you, please wave your little hand and praise it. Your support is the driving force of my continuous creation. Thank you!
You can search WeChat Muzi, the official account, and learn a lot of dry cargo articles from Java. You can take a look at it!