Fault case | troubleshooting process of tokudb engine error in master-slave replication environment

Welcome to the MySQL technical articles shared by the great SQL community. If you have any questions or want to learn, you can leave a message in the comment area below. After seeing it, you will answer it

  • The original content of GreatSQL community cannot be used without authorization. Please contact Xiaobian and indicate the source for reprint.

0. Background

In order to ensure the compressibility of historical data, tokudb engine is used to store data in a system.

The total size of the data disk of the machine where the slave node is located is 33TB, and the remaining disk space in case of failure is 1.1TB.

[root@redhat76-greatdb greatdb]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        33T   32T  1.1T  97% /
devtmpfs         63G     0   63G   0% /dev
tmpfs            63G     0   63G   0% /dev/shm
tmpfs            63G  4.0G   59G   7% /run

1. Phenomenon description

The master node works normally, and the database error log of the slave node is as follows:

2021-05-08T18:31:00.210203+08:00 44458 [ERROR] Slave SQL for channel '': Worker 1 failed executing transaction 'fb18799a-afeb-11eb-a3f0-fa163e18e1d9:513684180' at master log greatdb-bin.031344, end_log_pos 8397; Could not execute write_rows event on table test.t1; Got error 28 from storage engine, Error_code: 1030; handler error No Error!; the event's master log FIRST, end_log_pos 8397, Error_code:1030
2021-05-08T18:31:00.210236+08:00 44457 [Warning] Slave SQL for channel '': ... The slave coordinator and worker threads are stoppped, possibly leaving data in inconsistent state, A restart should restore consistency automatically, althougn using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code:1756 

At this time, the slave cannot read or write, and playback is aborted

2. Cause analysis

Because tokudb related problems have not been handled before, and it is affected by the remaining 1.1TB disk space at this time. Therefore, the first reaction is to check whether the tmpdir space is sufficient.

MySQL [(none)]> show variables like '%tmpdir%';
+-------------------+----------------------------------------+
| Variable_name     | Value                                  |
+-------------------+----------------------------------------+
| innodb_tmpdir     |                                        |
| slave_load_tmpdir | /greatdb/tmp                           |
| tmpdir            | /greatdb/dbdata/greatdb57_data3307/tmp |
+-------------------+----------------------------------------+
3 rows in set (0.01 sec)

Check the space and content of the tmpdir directory and confirm that the space is sufficient and should not be the cause of the problem.

Check the error log again. 1030 Got error 28 from storage engine indicates that it may be a limitation of tokudb. Therefore, check the relevant parameters of tokudb engine.

Note the parameter tokudb_fs_reserve_percent:

MySQL [(none)]> show variables like '%tokudb%fs%';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| tokudb_fs_reserve_percent | 5     |
| tokudb_fsync_log_period   | 0     |
+---------------------------+-------+
2 rows in set (0.00 sec)

The explanation of this parameter in the manual is:

variable tokudb_fs_reserve_percent
Command Line: Yes
Config File: Yes
Scope: Global
Dynamic: No
Variable Type: Numeric
Range: 0-100
Default Value: 5
This variable controls the percentage of the file system that must be available for inserts to be allowed. By default, this is set to 5. We recommend that this reserve be at least half the size of your physical memory. See Full Disks for more information.

See that the default setting is 5, that is, when the remaining free space of the disk is less than 5%, refuse to write until more space is released.

At this time, the slave node data disk remains 3%, which should be the cause of the problem.

3. Treatment method

  • If the evaluation has restart conditions, tokudb can be changed_ fs_ reserve_ Restart after percent (this scheme is not adopted in this environment considering the large amount of instance data, security and other factors).
  • Clear expired binlog s and relaylog s, purge binary logs & relay_ log_ purge.
  • Since multiple slave nodes are deployed on the machine, and each slave has a large backlog of relay logs (12tb), after confirming the master binlog information, use reset slave to temporarily clean up the relay logs and obtain them again.
  • Clean up relevant logs, useless installation packages, etc.
  • Contact the business to clean up useless data tables. The master node truncates first, and the slave node executes set sql_log_bin=0;truncate table;set sql_log_bin=1.
  • Monitor the remaining disk space at any time to ensure the normal playback of master-slave.

4. Summary

In order to ensure the normal database service, tokudb detects the remaining disk space every 5 seconds. By default, writes are blocked when the remaining 5% is available until more space is released.

This restriction is determined by the read-only static parameter tokudb_fs_reserve_percent controls the remaining percentage. This parameter is not configurable on INNODB, MYISAM and other engines, so the disk can be written to 100%.

When using tokudb, you should consider this parameter setting in advance. When it is detected that the disk is 95% used, you should prepare for capacity expansion. Of course, 5% is only the default percona recommended value, which should be adjusted according to the data disk size in actual use.

Reference documents

Enjoy GreatSQL :)

Article recommendation:

GreatSQL MGR FAQ
https://mp.weixin.qq.com/s/J6...

Wan Da #12, how can the MGR cluster automatically select the master without manual intervention
https://mp.weixin.qq.com/s/07...

"2021 data technology Carnival · ON LINE": evolution and practice of MySQL high availability architecture
https://mp.weixin.qq.com/s/u7...

Packet capture analysis of where an sql statement is slow
https://mp.weixin.qq.com/s/AY...

Wan Da #15, what are the conditions that may cause the MGR service to fail to start
https://mp.weixin.qq.com/s/in...

Technology sharing | why MGR consistency mode does not recommend AFTER
https://mp.weixin.qq.com/s/rN...

About GreatSQL

GreatSQL is a MySQL branch maintained by Wanli database. It focuses on improving the reliability and performance of MGR and supports the parallel query feature of InnoDB. It is a branch version of MySQL suitable for financial applications.

Gitee:
https://gitee.com/GreatSQL/Gr...

GitHub:
https://github.com/GreatSQL/G...

Bilibili:
https://space.bilibili.com/13...

Wechat & QQ group:
You can search and add GreatSQL community assistant wechat friends, send verification information "add group" to join GreatSQL/MGR communication wechat group

QQ group: 533341697
Wechat assistant: wanlidbc

This article is composed of blog one article multi posting platform OpenWrite release!

Keywords: Database MySQL SQL

Added by ferozsho on Thu, 23 Dec 2021 14:13:42 +0200