Interviewer: 10 million data, how to query quickly?

preface

Interviewer: let's say, how do you query 10 million data?
Brother B: direct paging query, using limit paging.
Interviewer: have you practiced it?
Brother B: there must be

Here's a song "cool"

Maybe some people have never met a table with tens of millions of data, and they don't know what will happen when querying tens of millions of data.

Today, let's take you to practice. This time, we do the test based on MySQL 5.7.26

Prepare data

What if you don't have 10 million data?

Create it

Code creation 10 million? That's impossible. It's too slow. It may really take a day to run. You can use database scripts to execute much faster.

Create table

CREATE TABLE `user_operation_log`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `ip` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `op_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr4` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr7` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr8` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr9` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr10` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr11` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  `attr12` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;

Create data script

Using batch insertion, the efficiency will be much faster, and every 1000 pieces will commit. The amount of data is too large, which will also lead to the slow efficiency of batch insertion

DELIMITER ;;
CREATE PROCEDURE batch_insert_log()
BEGIN
  DECLARE i INT DEFAULT 1;
  DECLARE userId INT DEFAULT 10000000;
 set @execSql = 'INSERT INTO `test`.`user_operation_log`(`user_id`, `ip`, `op_data`, `attr1`, `attr2`, `attr3`, `attr4`, `attr5`, `attr6`, `attr7`, `attr8`, `attr9`, `attr10`, `attr11`, `attr12`) VALUES';
 set @execData = '';
  WHILE i<=10000000 DO
   set @attr = "'Test long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long properties'";
  set @execData = concat(@execData, "(", userId + i, ", '10.0.69.175', 'User login operation'", ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ")");
  if i % 1000 = 0
  then
     set @stmtSql = concat(@execSql, @execData,";");
    prepare stmt from @stmtSql;
    execute stmt;
    DEALLOCATE prepare stmt;
    commit;
    set @execData = "";
   else
     set @execData = concat(@execData, ",");
   end if;
  SET i=i+1;
  END WHILE;

END;;
DELIMITER ;

Start testing

GE's computer configuration is relatively low: win10 standard slag i5 reads and writes about 500MB SSD

Due to the low configuration, this test only prepared 3148000 pieces of data, occupied 5g of disk (without index), and ran for 38min. Students with computer configuration can insert multi-point data test

SELECT count(1) FROM `user_operation_log`

Return result: 3148000

The three query times are:

14060 ms
13755 ms
13447 ms

General paging query

MySQL supports the LIMIT statement to select the specified number of pieces of data, and Oracle can use ROWNUM to select.

MySQL paging query syntax is as follows:

SELECT * FROM table LIMIT [offset,] rows | rows OFFSET offset

The first parameter specifies the offset of the first return record line
The second parameter specifies the maximum number of record rows to return

Let's begin to test the query results:

SELECT * FROM `user_operation_log` LIMIT 10000, 10

The query times are as follows:

59 ms
49 ms
50 ms

It seems that the speed is OK, but it is a local database. The speed is naturally faster.

Test from another angle

Same offset, different data

SELECT * FROM `user_operation_log` LIMIT 10000, 10
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 1000
SELECT * FROM `user_operation_log` LIMIT 10000, 10000
SELECT * FROM `user_operation_log` LIMIT 10000, 100000
SELECT * FROM `user_operation_log` LIMIT 10000, 1000000

The query time is as follows:

quantity	for the first time	The second time	third time
Article 10	53ms	52ms	47ms
100 articles	50ms	60ms	55ms
1000 articles	61ms	74ms	60ms
10000 articles	164ms	180ms	217ms
100000 articles	1609ms	1741ms	1764ms
1000000 articles	16219ms	16889ms	17081ms

From the above results, we can conclude that the larger the amount of data, the longer it takes

Same data amount, different offset

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT * FROM `user_operation_log` LIMIT 1000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 100000, 100
SELECT * FROM `user_operation_log` LIMIT 1000000, 100

Offset	for the first time	The second time	third time
100	36ms	40ms	36ms
1000	31ms	38ms	32ms
10000	53ms	48ms	51ms
100000	622ms	576ms	627ms
1000000	4891ms	5076ms	4856ms

From the above results, we can conclude that the larger the offset, the longer it takes

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT id, attr FROM `user_operation_log` LIMIT 100, 100

How to optimize

Now that we have come to the conclusion after the above tossing, we will optimize the above two problems: large offset and large amount of data

Optimization of large offset

Adopt sub query method

We can locate the id of the offset position first, and then query the data

SELECT * FROM `user_operation_log` LIMIT 1000000, 10

SELECT id FROM `user_operation_log` LIMIT 1000000, 1

SELECT * FROM `user_operation_log` WHERE id >= (SELECT id FROM `user_operation_log` LIMIT 1000000, 1) LIMIT 10

The query results are as follows:

sql	Spend time
Article 1	4818ms
Article 2 (without index)	4329ms
Article 2 (with index)	199ms
Article 3 (without index)	4319ms
Article 3 (with index)	201ms

From the above results, it can be concluded that:

The first one takes the most time, and the third one is a little better than the first one
Subqueries are faster using indexes

Disadvantages: it is only applicable to the case of increasing id

If the id is not incremented, you can use the following method, but this disadvantage is that the paging query can only be placed in the sub query

Note: some mysql versions do not support the use of limit in the in clause, so multiple nested select ions are used

SELECT * FROM `user_operation_log` WHERE id IN (SELECT t.id FROM (SELECT id FROM `user_operation_log` LIMIT 1000000, 10) AS t)

Using id restriction

This method requires higher requirements. The id must be incremented continuously, and the range of id must be calculated. Then use between. The sql is as follows

SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100

SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100

The query results are as follows:

sql	Spend time
Article 1	22ms
Article 2	21ms

It can be seen from the results that this method is very fast

Note: the LIMIT here limits the number of entries and does not use offset

Optimize the problem of large amount of data

The amount of data returned will also directly affect the speed

SELECT * FROM `user_operation_log` LIMIT 1, 1000000

SELECT id FROM `user_operation_log` LIMIT 1, 1000000

SELECT id, user_id, ip, op_data, attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9, attr10, attr11, attr12 FROM `user_operation_log` LIMIT 1, 1000000

The query results are as follows:

sql	Spend time
Article 1	15676ms
Article 2	7298ms
Article 3	15960ms

The results show that the query efficiency can be significantly improved by reducing unnecessary columns

The first and third queries are almost the same speed. When you are sure to make complaints about it, then I write so many fields.

Note that my MySQL server and client are on the same machine, so the query data is not much different. Students with conditions can separate the test client from mysql

SELECT * doesn't it smell good?

By the way, I would like to add why SELECT * is prohibited. Isn't it simple and brainless? Isn't it fragrant?

There are two main points:

Using "SELECT *" database needs to parse more objects, fields, permissions, attributes and other related contents. In the case of complex SQL statements and more hard parsing, it will cause a heavy burden on the database.
Increase the network overhead, * sometimes bring useless and large text fields such as log and IconMD5 by mistake, and the data transmission size will increase geometrically. In particular, MySQL and applications are not on the same machine, and this overhead is very obvious.

end

Finally, I hope you can practice by yourself. You can certainly gain more. Welcome to leave a message!!

I'll just give you the script. What are you waiting for!!!

Added by jd57 on Thu, 17 Feb 2022 05:52:19 +0200

Programming VIP

Interviewer: 10 million data, how to query quickly?

preface

Prepare data

Create table

Create data script

Start testing

General paging query

Same offset, different data

Same data amount, different offset

How to optimize

Optimization of large offset

Adopt sub query method

Using id restriction

Optimize the problem of large amount of data

SELECT * doesn't it smell good?

end

Popular Keywords