Issue 38: specific implementation of MySQL time class partition

The scenario with the most partitions or tables is still split for the time field. In this section, we will talk about how to better split based on the time field. According to the implementation methods of year, month and day dimensions and some details.

First, take the year as the dimension to split

The selection of date field splitting granularity is closely related to business retrieval requests. For example, if the data is kept for 10 years and each query is based on a specific year as the filter condition, it is certainly best to split by year. For example, the following SQL:

select * from ytt_pt1 where log_date >='2018-01-01' and log_date < '2019-01-01';

Let's take a look at the actual example of splitting by year: table ytt_pt1, including 1000W records, and the partition table is established with the granularity of year.

mysql> create table ytt_pt1(id bigint, log_date date);
Query OK, 0 rows affected (0.18 sec)

mysql> insert into ytt_pt1 select id,log_date from ytt_p1 limit 10000000;
Query OK, 10000000 rows affected (3 min 49.53 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE ytt_pt1 PARTITION BY RANGE (year(log_date))
    -> (
    -> PARTITION p0001 VALUES LESS THAN (2012),
    -> PARTITION p0002 VALUES LESS THAN (2013),
    -> PARTITION p0003 VALUES LESS THAN (2014),
    -> PARTITION p0004 VALUES LESS THAN (2015),
    -> PARTITION p0005 VALUES LESS THAN (2016),
    -> PARTITION p0006 VALUES LESS THAN (2017),
    -> PARTITION p0007 VALUES LESS THAN (2018),
    -> PARTITION p0008 VALUES LESS THAN (2019),
    -> PARTITION p0009 VALUES LESS THAN (2020),
    -> PARTITION p0010 VALUES LESS THAN (2021),
    -> PARTITION p_max VALUES LESS THAN (maxvalue)
    -> );
Query OK, 10000000 rows affected (2 min 33.31 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

Let's look at the query effect by year: the following SQL directly goes to partition p0008, and the query time is 0.91 seconds. This time is not short. In the later stage, you can add filter conditions to reduce the query time.

mysql> select count(*) from ytt_pt1 where log_date >='2018-01-01' and log_date < '2019-01-01';
+----------+
| count(*) |
+----------+
|  1000204 |
+----------+
1 row in set (0.91 sec)

mysql> explain  select count(*) from ytt_pt1 where log_date >='2018-01-01' and log_date < '2019-01-01'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_pt1
   partitions: p0008
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 998002
     filtered: 11.11
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

Note: queries can only be filtered directly based on fields. If they are filtered based on field expressions, MySQL does not know which partition to go, and will scan all partitions. The processing method is the same as that of single table queries. For example, the following statement:

select count(*) from ytt_pt1 where year(log_date) = '2018' ;

Look at the execution: MySQL scans all partitions, and the query execution time is more than 9 seconds.

mysql> select count(*) from ytt_pt1 where year(log_date) = '2018' ;
+----------+
| count(*) |
+----------+
|  1000204 |
+----------+
1 row in set (9.19 sec)

mysql> explain select count(*) from ytt_pt1 where year(log_date) = '2018' \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_pt1
   partitions: p0001,p0002,p0003,p0004,p0005,p0006,p0007,p0008,p0009,p0010,p_max
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 9982648
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

If you do not adhere to this method, you can give the optimizer a prompt to retrieve data in the specified partition, or create a virtual column based on the field expression:

mysql> select count(*) from ytt_pt1 partition(p0008) where year(log_date) = '2018' ;
+----------+
| count(*) |
+----------+
|  1000204 |
+----------+
1 row in set (0.84 sec)

If the query is frequently filtered by month as a dimension, it is better to split it by month; For example, some records of the current month in 2020 need to be retrieved for subsequent data processing, which is roughly as follows:

select * from ytt_pt1_Split table by month where log_date in ('2020-01-01'，'2020-01-02'，...)

Second, take the month as the dimension to split

There are two ways to split by month:

The first method: directly dismantle 12 partitions by month: the following table YTT_ pt1_ The month1 partition type is LIST, which is calculated directly based on the function month.

mysql> show create table ytt_pt1_month1\G
*************************** 1. row ***************************
       Table: ytt_pt1_month1
Create Table: CREATE TABLE `ytt_pt1_month1` (
  `id` bigint DEFAULT NULL,
  `log_date` date DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY LIST (month(`log_date`))
(PARTITION p0001 VALUES IN (1) ENGINE = InnoDB,
 PARTITION p0002 VALUES IN (2) ENGINE = InnoDB,
 PARTITION p0003 VALUES IN (3) ENGINE = InnoDB,
 PARTITION p0004 VALUES IN (4) ENGINE = InnoDB,
 PARTITION p0005 VALUES IN (5) ENGINE = InnoDB,
 PARTITION p0006 VALUES IN (6) ENGINE = InnoDB,
 PARTITION p0007 VALUES IN (7) ENGINE = InnoDB,
 PARTITION p0008 VALUES IN (8) ENGINE = InnoDB,
 PARTITION p0009 VALUES IN (9) ENGINE = InnoDB,
 PARTITION p0010 VALUES IN (10) ENGINE = InnoDB,
 PARTITION p0011 VALUES IN (11) ENGINE = InnoDB,
 PARTITION p0012 VALUES IN (12) ENGINE = InnoDB) */
1 row in set (0.00 sec)

For example, to query the records of the first half of 2020: the query is limited to partition p0001, but the time is not ideal. It takes 0.66 seconds.

mysql> select count(*) from ytt_pt1_month1 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15');
+----------+
| count(*) |
+----------+
|    41540 |
+----------+
1 row in set (0.66 sec)

mysql> explain select count(*) from ytt_pt1_month1 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_pt1_month1
   partitions: p0001
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 848224
     filtered: 50.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

The second: for the data of each year, it is divided into 12 partitions separately, that is, according to the joint dimension of month and year. There are 144 partitions in total, and each partition corresponds to the data of a month of a specific year.

Adding this partition is a little more troublesome. Clone the table ytt_pt1_month1 is ytt_pt1_month2, write a stored procedure here to add partition information:

DELIMITER $$

USE `ytt`$$

DROP PROCEDURE IF EXISTS `sp_add_partition_ytt_pt1_month2`$$

CREATE DEFINER=`root`@`%` PROCEDURE `sp_add_partition_ytt_pt1_month2`()
BEGIN
    DECLARE i,j INT UNSIGNED DEFAULT 1;
    DECLARE v_tmp_date DATE;
    SET @stmt = '';
    SET @stmt_begin = 'ALTER TABLE ytt_pt1_month2 PARTITION BY RANGE COLUMNS (log_date)(';
        SET i = 2010;        
        WHILE i <= 2020 DO
          SET j = 1;
          WHILE j <= 12 DO
            SET v_tmp_date = CONCAT(i,'-01-01');
            SET @stmt = CONCAT(@stmt,'PARTITION p',i,'_',LPAD(j,2,"0"),' VALUES LESS THAN (''',DATE_ADD(v_tmp_date,INTERVAL j MONTH),'''),');
            SET j = j + 1;
          END WHILE;
          SET i = i + 1;
        END WHILE;    
    SET @stmt_end = 'PARTITION p_max VALUES LESS THAN (maxvalue))';
        SET @stmt = CONCAT(@stmt_begin,@stmt,@stmt_end);
        PREPARE s1 FROM @stmt;
        EXECUTE s1;
        DROP PREPARE s1;

        SET @stmt = NULL;
        SET @stmt_begin = NULL;
        SET @stmt_end = NULL;    
    END$$

DELIMITER ;

mysql> call sp_add_partition_ytt_pt1_month2;
Query OK, 0 rows affected (2 min 20.48 sec)

The results are similar:

PARTITION p2010_01 VALUES LESS THAN ('2010-02-01') ENGINE = InnoDB,
 ...
 PARTITION p2010_12 VALUES LESS THAN ('2011-01-01') ENGINE = InnoDB,
 PARTITION p2011_01 VALUES LESS THAN ('2011-02-01') ENGINE = InnoDB,
...
 PARTITION p2011_12 VALUES LESS THAN ('2012-01-01') ENGINE = InnoDB,
 ...
 PARTITION p2020_12 VALUES LESS THAN ('2021-01-01') ENGINE = InnoDB,
 PARTITION p_max VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB)

After adding partitions, take a look at the query just now to obtain the records of the first half of 2020:

mysql> select count(*) from ytt_pt1_month2 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15');
+----------+
| count(*) |
+----------+
|    41540 |
+----------+
1 row in set (0.06 sec)

mysql> explain   select count(*) from ytt_pt1_month2 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_pt1_month2
   partitions: p2020_01
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 85498
     filtered: 50.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

The query time is 0.06 seconds, which is 10 times faster than the first splitting method.

Does that mean that zoning according to the combination of months and months must be more optimized than zoning according to a single month? Not necessarily. If the query filter conditions include the fixed month of the partition, such as every December, it is better to partition directly by month than by year.

Third, take the day as the dimension to split

Filter frequent queries based on days. It is best to split queries based on days. For example, you only want to query the data on January 1, 2020, which is roughly as follows:

select * from ytt_pt1 where log_date = '2020-01-01'

Similarly, write a script or stored procedure to add partitions according to the joint dimension of month and year. The only thing to note here is that there is a limit on the number of MySQL partition tables, up to 8192. Therefore, if you partition by day and store data for 10 years, the number of partitions is 3650, which is also within the limit.

Modify the previous stored procedure. The code is as follows:

DELIMITER $$

USE `ytt`$$

DROP PROCEDURE IF EXISTS `sp_add_partition_ytt_pt1_day`$$

CREATE DEFINER=`root`@`%` PROCEDURE `sp_add_partition_ytt_pt1_day`(
IN f_year_start YEAR,
IN f_year_end YEAR
)
BEGIN
    DECLARE v_days INT UNSIGNED DEFAULT 365;
    DECLARE v_year DATE DEFAULT '2010-01-01';
    DECLARE v_partition_name VARCHAR(64) DEFAULT '';
    DECLARE v_log_date DATE;
    DECLARE i,j INT UNSIGNED DEFAULT 1;
    SET @stmt = '';
    SET @stmt_begin = 'ALTER TABLE ytt_pt1_day PARTITION BY RANGE COLUMNS (log_date)(';
    SET i = f_year_start;
    WHILE i <= f_year_end DO 
      SET v_year = CONCAT(i,'-01-01');
      SET v_days = DATEDIFF(DATE_ADD(v_year,INTERVAL 1 YEAR),v_year);            
      SET j = 1;
      WHILE j <= v_days DO
        SET v_log_date = DATE_ADD(v_year,INTERVAL j DAY);
       SET v_partition_name = CONCAT('p',i,'_',LPAD(j,3,'0'));
       SET @stmt = CONCAT(@stmt,'PARTITION ',v_partition_name,' VALUES LESS THAN (''',v_log_date,'''),');
       SET j = j + 1;        
      END WHILE;
      SET i = i + 1;    
    END WHILE;
    SET @stmt_end = 'PARTITION p_max VALUES LESS THAN (maxvalue))';
    SET @stmt = CONCAT(@stmt_begin,@stmt,@stmt_end);
    PREPARE s1 FROM @stmt;
    EXECUTE s1;
    DROP PREPARE s1;
    SELECT NULL,NULL,NULL INTO @stmt,@stmt_begin,@stmt_end;
END$$

DELIMITER ;

mysql> CALL sp_add_partition_ytt_pt1_day('2010','2020');
Query OK, 1 row affected (14 min 13.69 sec)

Next, the query time in days must be the shortest, only 0.01 seconds.

mysql> select count(*) from ytt_pt1_day where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
|     2675 |
+----------+
1 row in set (0.01 sec)

At this time, if such queries are based on year or month, the performance is certainly not optimal.

mysql> select count(*) from ytt_pt1 where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
|     2675 |
+----------+
1 row in set (0.68 sec)

mysql> select count(*) from ytt_pt1_month1 where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
|     2675 |
+----------+
1 row in set (0.87 sec)

mysql> select count(*) from ytt_pt1_month2 where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
|     2675 |
+----------+
1 row in set (0.09 sec)

You can see that the partition time of such queries based on other methods is significantly longer than that based on days.

To sum up:

This article mainly describes the most commonly used data splitting method in daily work: splitting by time. The time field is often implemented in three different dimensions: year, month and day, and the specific implementation methods and applicable scenarios are described in detail.

What else do you want to know about the technical content of MySQL? Leave a message and tell Xiaobian!

Keywords: MySQL

Added by wardo on Tue, 28 Dec 2021 02:35:16 +0200

Programming VIP