Author: Fu Xiang
Now living in Zhuhai, he is mainly responsible for the maintenance of Oracle, MySQL, mongoDB and Redis.
Source: original contribution
*It is produced by aikesheng open source community. The original content cannot be used without authorization. For reprint, please contact Xiaobian and indicate the source.
1. Environmental description
1.1. MySQL version
root@3306 (none)> SELECT @@VERSION; +------------+ | @@VERSION | +------------+ | 5.7.34-log | +------------+ 1 row in set (0.01 sec)
1.2. binlog related parameters
root@3306 (none)> select @@log_bin,@@log_slave_updates; +-----------+---------------------+ | @@log_bin | @@log_slave_updates | +-----------+---------------------+ | 1 | 1 | +-----------+---------------------+ 1 row in set (0.00 sec) root@3306 (none)>
1.3. GTID related parameters
root@3306 (none)> select @@binlog_gtid_simple_recovery,@@enforce_gtid_consistency,@@gtid_mode; +-------------------------------+----------------------------+-------------+ | @@binlog_gtid_simple_recovery | @@enforce_gtid_consistency | @@gtid_mode | +-------------------------------+----------------------------+-------------+ | 1 | ON | ON | +-------------------------------+----------------------------+-------------+ 1 row in set (0.01 sec) root@3306 (none)>
1.4. Semi synchronous related parameters
root@3306 (none)> show global variables like '%semi%'; +-------------------------------------------+------------+ | Variable_name | Value | +-------------------------------------------+------------+ | rpl_semi_sync_master_enabled | ON | | rpl_semi_sync_master_timeout | 1000 | | rpl_semi_sync_master_trace_level | 32 | | rpl_semi_sync_master_wait_for_slave_count | 1 | | rpl_semi_sync_master_wait_no_slave | ON | | rpl_semi_sync_master_wait_point | AFTER_SYNC | | rpl_semi_sync_slave_enabled | ON | | rpl_semi_sync_slave_trace_level | 32 | +-------------------------------------------+------------+ 8 rows in set (0.00 sec) root@3306 (none)>
1.5. Multithread synchronization related parameters
root@3306 (none)> select @@binlog_transaction_dependency_tracking,@@slave_parallel_type,@@slave_parallel_ workers; +------------------------------------------+-----------------------+--------- -----------------+ | @@binlog_transaction_dependency_tracking | @@slave_parallel_type | @@slave_parallel_workers | +------------------------------------------+-----------------------+--------- -----------------+ | COMMIT_ORDER | LOGICAL_CLOCK | 2 | +------------------------------------------+-----------------------+--------- -----------------+ 1 row in set (0.00 sec) root@3306 (none)>
2. Fault phenomenon
After MySQL restarts from the host where the database is located, sql_thread error:
root@3306 (none)> show slave status\G -- Useful information is extracted as follows: Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207' at master log MySql-bin.000685, end_log_pos 1004756557. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. ......ellipsis.................. Retrieved_Gtid_Set: 471c2974-f9bb-11eb-afb1- 52540010fb89:88313207-88341912 Executed_Gtid_Set: 471c2974-f9bb-11eb-afb1-52540010fb89:1- 88313206, d4c228df-f9c6-11eb-a2d8-5254006f63b6:1-5 Auto_Position: 1 root@3306 (none)> select * from performance_schema.replication_applier_status_by_worker\G *************************** 1. row *************************** CHANNEL_NAME: WORKER_ID: 1 THREAD_ID: NULL SERVICE_STATE: OFF LAST_SEEN_TRANSACTION: 471c2974-f9bb-11eb-afb1-52540010fb89:88313207 LAST_ERROR_NUMBER: 1062 LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '471c2974-f9bb- 11eb-afb1-52540010fb89:88313207' at master log MySql-bin.000685, end_log_pos 1004756557; Could not execute Write_rows event on table kefumobile.i_sms_proxy; Duplicate entry '14765130' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1004756557 LAST_ERROR_TIMESTAMP: 2022-01-24 23:05:02 *************************** 2. row *************************** CHANNEL_NAME: WORKER_ID: 2 THREAD_ID: NULL SERVICE_STATE: OFF LAST_SEEN_TRANSACTION: LAST_ERROR_NUMBER: 0 LAST_ERROR_MESSAGE: LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00 2 rows in set (0.00 sec)
It can be seen from the error message that when the worker thread replays the transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207', it reports an error due to the conflict of the record primary key to be inserted.
3. Fault analysis
Before the host restarts, the master-slave synchronization is normal. After the host restarts, the master-slave synchronization reports an error due to the conflict of primary keys. The row records of the conflicting primary keys are compared
The records in the master-slave database are consistent. According to the preliminary analysis, the transaction '471c2974-f9bb-11eb-afb1-52540010fb89:88313207' is in the host
The transaction has been replayed from the database before the failure, so why is the transaction replayed repeatedly?
When gtid mode is enabled, if master is specified_ auto_ Position = 1, start slave
Retrieved_Gtid_Set and executed_ Gtid_ The union of set is sent to the main library, which will receive the Union and its own
gtid_ The executed comparison sends all the missing transactions in the slave library gtid set to the slave library.
After the host restarts, the transaction is replayed repeatedly, indicating that it has been Retrieved_Gtid_Set and executed_ GTID_ GTID transaction in set
Missing, resulting in primary key conflict error caused by repeated acquisition transaction execution. Retrieved_Gtid_Set and Executed_Gtid_Set is memory variable
Quantity, retrieved after MySQL restart_ GTID_ Set is initialized to a null value to infer Executed_Gtid_Set has GTID transaction missing
Lost.
Executed_Gtid_Set from gtid_executed variable, gtid_executed variable persistence media
mysql.gtid_executed table and binlog, where mysql GTID_ The executed table was introduced after MySQL 5.7. In MySQL 5.6, to use GTID from the database, log must be set first_ bin=on,log_ slave_ Updates = on, because the GTID executed from the library is only kept in binlog.
- When log_bin=on ,log_ slave_ Gtid when updates = off_ The update of the executed variable is persisted to MySQL in real time gtid_ In the executed table, gtid after MySQL restart_ The value of the executed variable is from mysql gtid_executed table read.
- When log_bin=on ,log_ slave_ When updates = on, mysql will be updated only when binlog is switched GTID_ The executed table saves the GTID set executed until the last binlog. After mysql restarts, set binlog as the default parameter_ GTID_ simple_ When recovery = 1, GTID_ The value of the executed variable is calculated from the last binlog file.
gtid_ We can infer the value of the real-time variable, sync, and execute log_ binlog :
root@3306 (none)> show variables like 'sync_binlog'; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | sync_binlog | 600 | +---------------+-------+ 1 row in set (0.00 sec)
Through the above analysis, the cause and effect of the fault is clear:
Worker thread reported 1062 primary key conflict error -- > gtid_ The executed information is stale -- > binlog is not persisted in real time
4. Test verification
Build a master-slave test environment, simulate the concurrent insertion of the master library through sysbench, and the fault reappears after the slave library host is shut down violently:
root@mysql.sock][(none)]> select * from performance_schema.replication_applier_status_by_worker\G *************************** 1. row *************************** CHANNEL_NAME: WORKER_ID: 1 THREAD_ID: NULL SERVICE_STATE: OFF LAST_SEEN_TRANSACTION: 4a0ad3da-b89e-11eb-9d0b-000c299b4d6c:452362 LAST_ERROR_NUMBER: 1062 LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '4a0ad3da-b89e- 11eb-9d0b-000c299b4d6c:452362' at master log MySql-bin.000012, end_log_pos 1011339749; Could not execute Write_rows event on table sbtest.sbtest5; Duplicate entry '111305' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1011339749 LAST_ERROR_TIMESTAMP: 2022-01-26 09:56:38 *************************** 2. row *************************** CHANNEL_NAME: WORKER_ID: 2 THREAD_ID: NULL SERVICE_STATE: OFF LAST_SEEN_TRANSACTION: LAST_ERROR_NUMBER: 0 LAST_ERROR_MESSAGE: LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00 2 rows in set (0.00 sec) [root@mysql.sock][(none)]>
5. Troubleshooting
Since the cause of the error is the repeated execution of transactions, it's good to skip the error. There are two methods, one of which can be selected as needed:
5.1. Replace the error reporting transaction with an empty transaction
set gtid_next='xxxxxx'; begin; commit; set gtid_next=AUTOMATIC; start slave sql_thread
If the latest binglog loses a lot of gtids, manual execution is cumbersome and trial and error is required. A stored procedure can be written and executed in batches:
set sql_log_bin=0; DELIMITER $$ create procedure p_fx_gtid_next(i_master_Executed_Gtid_max varchar(100)) begin declare v_gtid_next varchar(100); declare master_Executed_Gtid_max varchar(100); declare slave_Executed_Gtid_max varchar(100); -- Currently executed by the main library gtid Maximum value, as the condition for exiting the cycle show master status set master_Executed_Gtid_max=i_master_Executed_Gtid_max; loop_name:loop SELECT ifnull(min(LAST_SEEN_TRANSACTION),'empty') FROM performance_schema.replication_applier_status_by_worker WHERE LAST_ERROR_NUMBER=1062 into v_gtid_next; if v_gtid_next <> 'empty' then set gtid_next = v_gtid_next; start transaction; commit; set gtid_next =AUTOMATIC; start slave sql_thread; end if; select max(LAST_SEEN_TRANSACTION) from performance_schema.replication_applier_status_by_worker into slave_Executed_Gtid_max; if slave_Executed_Gtid_max = master_Executed_Gtid_max then leave loop_name; end if; select SLEEP(1); end loop; end $$ DELIMITER ; set sql_log_bin=1; call p_fx_gtid_next('XXXXX:XXX');
5.2. Slave with parameters_ skip_ Errors = 1062 restart MySQL
After the master-slave synchronization is normal, cancel the parameter slave_ skip_ Set errors to restart MySQL.