Fault analysis | case analysis of master-slave synchronization error after MySQL slave fault restart

Author: Fu Xiang

Now living in Zhuhai, he is mainly responsible for the maintenance of Oracle, MySQL, mongoDB and Redis.

Source: original contribution

*It is produced by aikesheng open source community. The original content cannot be used without authorization. For reprint, please contact Xiaobian and indicate the source.

1. Environmental description

1.1. MySQL version

root@3306 (none)> SELECT @@VERSION;
+------------+
| @@VERSION |
+------------+
| 5.7.34-log |
+------------+
1 row in set (0.01 sec)

1.2. binlog related parameters

root@3306 (none)> select @@log_bin,@@log_slave_updates;
+-----------+---------------------+
| @@log_bin | @@log_slave_updates |
+-----------+---------------------+
| 1 | 1 |
+-----------+---------------------+
1 row in set (0.00 sec)
root@3306 (none)>

1.3. GTID related parameters

root@3306 (none)> select
@@binlog_gtid_simple_recovery,@@enforce_gtid_consistency,@@gtid_mode;
+-------------------------------+----------------------------+-------------+
| @@binlog_gtid_simple_recovery | @@enforce_gtid_consistency | @@gtid_mode |
+-------------------------------+----------------------------+-------------+
| 1 | ON | ON |
+-------------------------------+----------------------------+-------------+
1 row in set (0.01 sec)
root@3306 (none)>

1.4. Semi synchronous related parameters

root@3306 (none)> show global variables like '%semi%';
+-------------------------------------------+------------+
| Variable_name | Value |
+-------------------------------------------+------------+
| rpl_semi_sync_master_enabled | ON |
| rpl_semi_sync_master_timeout | 1000 |
| rpl_semi_sync_master_trace_level | 32 |
| rpl_semi_sync_master_wait_for_slave_count | 1 |
| rpl_semi_sync_master_wait_no_slave | ON |
| rpl_semi_sync_master_wait_point | AFTER_SYNC |
| rpl_semi_sync_slave_enabled | ON |
| rpl_semi_sync_slave_trace_level | 32 |
+-------------------------------------------+------------+
8 rows in set (0.00 sec)
root@3306 (none)>

1.5. Multithread synchronization related parameters

root@3306 (none)> select
@@binlog_transaction_dependency_tracking,@@slave_parallel_type,@@slave_parallel_
workers;
+------------------------------------------+-----------------------+---------
-----------------+
| @@binlog_transaction_dependency_tracking | @@slave_parallel_type |
@@slave_parallel_workers |
+------------------------------------------+-----------------------+---------
-----------------+
| COMMIT_ORDER | LOGICAL_CLOCK |
2 |
+------------------------------------------+-----------------------+---------
-----------------+
1 row in set (0.00 sec)
root@3306 (none)>

2. Fault phenomenon

After MySQL restarts from the host where the database is located, sql_thread error:

root@3306 (none)> show slave status\G
-- Useful information is extracted as follows:
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1062
Last_Error: Coordinator stopped because there were
error(s) in the worker(s). The most recent failure being: Worker 1 failed
executing transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207' at master
log MySql-bin.000685, end_log_pos 1004756557. See error log and/or
performance_schema.replication_applier_status_by_worker table for more details
about this failure or others, if any.
......ellipsis..................
Retrieved_Gtid_Set: 471c2974-f9bb-11eb-afb1-
52540010fb89:88313207-88341912
Executed_Gtid_Set: 471c2974-f9bb-11eb-afb1-52540010fb89:1-
88313206,
d4c228df-f9c6-11eb-a2d8-5254006f63b6:1-5
Auto_Position: 1
root@3306 (none)> select * from
performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 471c2974-f9bb-11eb-afb1-52540010fb89:88313207
LAST_ERROR_NUMBER: 1062
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '471c2974-f9bb-
11eb-afb1-52540010fb89:88313207' at master log MySql-bin.000685, end_log_pos
1004756557; Could not execute Write_rows event on table kefumobile.i_sms_proxy;
Duplicate entry '14765130' for key 'PRIMARY', Error_code: 1062; handler error
HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1004756557
LAST_ERROR_TIMESTAMP: 2022-01-24 23:05:02
*************************** 2. row ***************************
CHANNEL_NAME:
WORKER_ID: 2
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION:
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
2 rows in set (0.00 sec)

It can be seen from the error message that when the worker thread replays the transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207', it reports an error due to the conflict of the record primary key to be inserted.

3. Fault analysis

Before the host restarts, the master-slave synchronization is normal. After the host restarts, the master-slave synchronization reports an error due to the conflict of primary keys. The row records of the conflicting primary keys are compared
The records in the master-slave database are consistent. According to the preliminary analysis, the transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207' is in the host
The transaction has been replayed from the database before the failure, so why is the transaction replayed repeatedly?

When gtid mode is enabled, if master is specified_ auto_ Position = 1, start slave
Retrieved_Gtid_Set and executed_ Gtid_ The union of set is sent to the main library, which will receive the Union and its own
gtid_ The executed comparison sends all the missing transactions in the slave library gtid set to the slave library.

After the host restarts, the transaction is replayed repeatedly, indicating that it has been Retrieved_Gtid_Set and executed_ GTID_ GTID transaction in set
Missing, resulting in primary key conflict error caused by repeated acquisition transaction execution. Retrieved_Gtid_Set and Executed_Gtid_Set is memory variable
Quantity, retrieved after MySQL restart_ GTID_ Set is initialized to a null value to infer Executed_Gtid_Set has GTID transaction missing
Lost.

Executed_Gtid_Set from gtid_executed variable, gtid_executed variable persistence media
mysql.gtid_executed table and binlog, where mysql GTID_ The executed table was introduced after MySQL 5.7. In MySQL 5.6, to use GTID from the database, log must be set first_ bin=on,log_ slave_ Updates = on, because the GTID executed from the library is only kept in binlog.

When log_bin=on ，log_ slave_ Gtid when updates = off_ The update of the executed variable is persisted to MySQL in real time gtid_ In the executed table, gtid after MySQL restart_ The value of the executed variable is from mysql gtid_executed table read.
When log_bin=on ，log_ slave_ When updates = on, mysql will be updated only when binlog is switched GTID_ The executed table saves the GTID set executed until the last binlog. After mysql restarts, set binlog as the default parameter_ GTID_ simple_ When recovery = 1, GTID_ The value of the executed variable is calculated from the last binlog file.

gtid_ We can infer the value of the real-time variable, sync, and execute log_ binlog :

root@3306 (none)> show variables like 'sync_binlog';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| sync_binlog | 600 |
+---------------+-------+
1 row in set (0.00 sec)

Through the above analysis, the cause and effect of the fault is clear:

Worker thread reported 1062 primary key conflict error -- > gtid_ The executed information is stale -- > binlog is not persisted in real time

4. Test verification

Build a master-slave test environment, simulate the concurrent insertion of the master library through sysbench, and the fault reappears after the slave library host is shut down violently:

root@mysql.sock][(none)]> select * from
performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 4a0ad3da-b89e-11eb-9d0b-000c299b4d6c:452362
LAST_ERROR_NUMBER: 1062
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '4a0ad3da-b89e-
11eb-9d0b-000c299b4d6c:452362' at master log MySql-bin.000012, end_log_pos
1011339749; Could not execute Write_rows event on table sbtest.sbtest5;
Duplicate entry '111305' for key 'PRIMARY', Error_code: 1062; handler error
HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1011339749
LAST_ERROR_TIMESTAMP: 2022-01-26 09:56:38
*************************** 2. row ***************************
CHANNEL_NAME:
WORKER_ID: 2
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION:
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
2 rows in set (0.00 sec)
[root@mysql.sock][(none)]>

5. Troubleshooting

Since the cause of the error is the repeated execution of transactions, it's good to skip the error. There are two methods, one of which can be selected as needed:

5.1. Replace the error reporting transaction with an empty transaction

set gtid_next='xxxxxx';
begin;
commit;
set gtid_next=AUTOMATIC;
start slave sql_thread

If the latest binglog loses a lot of gtids, manual execution is cumbersome and trial and error is required. A stored procedure can be written and executed in batches:

set sql_log_bin=0;
DELIMITER $$
create procedure p_fx_gtid_next(i_master_Executed_Gtid_max varchar(100))
begin
declare v_gtid_next varchar(100);
declare master_Executed_Gtid_max varchar(100);
declare slave_Executed_Gtid_max varchar(100);
-- Currently executed by the main library gtid Maximum value, as the condition for exiting the cycle show master status
set master_Executed_Gtid_max=i_master_Executed_Gtid_max;
loop_name:loop
SELECT ifnull(min(LAST_SEEN_TRANSACTION),'empty') FROM
performance_schema.replication_applier_status_by_worker WHERE
LAST_ERROR_NUMBER=1062
into v_gtid_next;
if v_gtid_next <> 'empty' then
set gtid_next = v_gtid_next;
start transaction;
commit;
set gtid_next =AUTOMATIC;
start slave sql_thread;
end if;
select max(LAST_SEEN_TRANSACTION) from
performance_schema.replication_applier_status_by_worker into
slave_Executed_Gtid_max;
if slave_Executed_Gtid_max = master_Executed_Gtid_max then
leave loop_name;
end if;
select SLEEP(1);
end loop;
end $$
DELIMITER ;
set sql_log_bin=1;
call p_fx_gtid_next('XXXXX:XXX');

5.2. Slave with parameters_ skip_ Errors = 1062 restart MySQL

After the master-slave synchronization is normal, cancel the parameter slave_ skip_ Set errors to restart MySQL.

Keywords: MySQL

Added by crees on Tue, 22 Feb 2022 09:22:34 +0200

Programming VIP