mysql IO replication exception - heartbeat is not compatible with local info

mysql IO replication exception - heartbeat is not compatible with local info

1. Problem description

An IO replication exception case was recorded. One set of MySQL Cluster slave database on a system line had the problem of IO replication thread interruption at about 22:57 every day in the recent week. Check the error information "heartbeat is not compatible with local info" recorded in the MySQL log. It is preliminarily judged that it is related to the heartbeat of master-slave replication. The DBA can recover by enabling the IO replication thread through the command.

2. Error log

2.1. Master database log

2021-09-11T22:54:02.307477+08:00 7337734 [Warning] Timeout waiting for reply of binlog (file: mysql-bin.001050, pos: 8081671672), semi-sync up to file mysql-bin.001050, position 234.
2021-09-11T22:55:15.606620+08:00 0 [ERROR] Read semi-sync reply magic number error
2021-09-11T22:55:15.607093+08:00 0 [ERROR] /usr/local/mysql/bin/mysqld: Got an error reading communication packets
2021-09-11T22:55:36.919916+08:00 7333647 [Warning] Timeout waiting for reply of binlog (file: mysql-bin.001050, pos: 25398346083), semi-sync up to file mysql-bin.001050, position 8081671672.
...
2021-09-12T22:54:48.389928+08:00 7533299 [Warning] Timeout waiting for reply of binlog (file: mysql-bin.001078, pos: 8081671672), semi-sync up to file mysql-bin.001078, position 234.
2021-09-12T22:56:02.291112+08:00 0 [ERROR] Read semi-sync reply magic number error
2021-09-12T22:56:02.291666+08:00 0 [ERROR] /usr/local/mysql/bin/mysqld: Got an error reading communication packets
2021-09-12T22:56:23.248059+08:00 7529261 [Warning] Timeout waiting for reply of binlog (file: mysql-bin.001078, pos: 25398613574), semi-sync up to file mysql-bin.001078, position 8081671672.
...
2021-09-13T22:54:29.329231+08:00 7728527 [Warning] Timeout waiting for reply of binlog (file: mysql-bin.001106, pos: 8081671672), semi-sync up to file mysql-bin.001106, position 234.
2021-09-13T22:55:43.308640+08:00 0 [ERROR] Read semi-sync reply magic number error
2021-09-13T22:55:43.308954+08:00 0 [ERROR] /usr/local/mysql/bin/mysqld: Got an error reading communication packets
2021-09-13T22:56:04.073410+08:00 7724514 [Warning] Timeout waiting for reply of binlog (file: mysql-bin.001106, pos: 25399063738), semi-sync up to file mysql-bin.001106, position 8081671672.

Print semi synchronous replication exception information from master database log:

  • Read semi-sync reply magic number error

  • Got an error reading communication packets

This is the exception of semi synchronous replication, which is also caused by the interruption of IO replication thread, and is not the root cause. Therefore, the exceptions recorded in the main database log can be ignored temporarily.

2.2. Log from library

2021-09-13T22:55:43.306352+08:00 1 [ERROR] Slave I/O for channel '': Unexpected master's heartbeat data: heartbeat is not compatible with local info; the event's data: log_file_name mysql-bin.001106ùÁò! log_pos 3786704376, Error_code: 1623
2021-09-13T22:55:43.313095+08:00 1 [ERROR] Slave I/O for channel '': Relay log write failure: could not queue event from master, Error_code: 1595
2021-09-13T23:01:03.132595+08:00 94910 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.

Print IO copy thread exception information from library log:

  • MySQL Error_code: 1623 ER_SLAVE_HEARTBEAT_FALLURE Unexpected master's heartbeat data:%s

  • MySQL Error_code: 1595 ER_SLAVE_RELAY_LOG_WRITE_FALLURE Relay log write failure:%s

Here we need to pay special attention. From the literal meaning of the error message, it is related to the heartbeat and displays "the heartbeat is incompatible with the local information". Then locate the source code content through the error message, as follows:

ER_SLAVE_HEARTBEAT_FALLURE source code (rpl_slave.cc, lines 8339 to 8364)

/* 
   compare local and event's versions of log_file, log_pos.
   
   Heartbeat is sent only after an event corresponding to the corrdinates
   the heartbeat carries.
   Slave can not have a difference in coordinates except in the only
   special case when mi->get_master_log_name(), mi->get_master_log_pos() have never
   been updated by Rotate event i.e when slave does not have any history
   with the master (and thereafter mi->get_master_log_pos() is NULL).

   TODO: handling `when' for SHOW SLAVE STATUS' snds behind
*/
if (memcmp(const_cast<char *>(mi->get_master_log_name()),
           hb.get_log_ident(), hb.get_ident_len())
    || (mi->get_master_log_pos() > hb.common_header->log_pos))
{
  /* missed events of heartbeat from the past */
  char errbuf[1024];
  char llbuf[22];
  sprintf(errbuf, "heartbeat is not compatible with local info; "
          "the event's data: log_file_name %-.512s log_pos %s",
          hb.get_log_ident(), llstr(hb.common_header->log_pos, llbuf));
  mi->report(ERROR_LEVEL, ER_SLAVE_HEARTBEAT_FAILURE,
             ER(ER_SLAVE_HEARTBEAT_FAILURE), errbuf);
  goto err;
}

From the content of the source code, if the received position is less than the current position, the heartbeat mechanism will detect this change, think it is a missed heartbeat to get past events, and return an error.

ER_SLAVE_RELAY_LOG_WRITE_FALLURE source code (rpl_slave.cc, lines 5789 to 5798)

/* XXX: 'synced' should be updated by queue_event to indicate
     whether event has been synced to disk */
bool synced= 0;
if (queue_event(mi, event_buf, event_len))
{
    mi->report(ERROR_LEVEL, ER_SLAVE_RELAY_LOG_WRITE_FAILURE,
               ER(ER_SLAVE_RELAY_LOG_WRITE_FAILURE),
               "could not queue event from master");
    goto err;
}

3. Task inspection

After communicating with the project team, it was learned that the system will perform data exchange every night, and the full amount of data will be written into the latest data of the day, which is mainly provided to the report platform for business report display. At the same time, the amount of table data written is tens of millions, which belongs to mass transaction execution. Locate the specific data exchange task according to the alarm time. The project team adjusted the task time, staggered the execution, and no IO replication exception occurred again.

In terms of task execution, the two large transactions are basically committed at a relatively close time. At the same time, combined with the content of the source code, it is speculated that the IO replication thread interrupt problem is related to the position of the two large transactions, that is, one transaction is sent to the slave library, and the position of the next transaction is less than that of the previous transaction.

4. Event analysis

The currently configured binlog size is up to 1G. After a normal large transaction is completed, it will directly switch to the next log file. Unless two large transactions are committed at almost the same time and are divided into the same commit group, two large transactions may be written in the same binlog file.

Secondly, the header structure of binlog event is next_ The length of the position field is only 4 bytes, that is, the maximum value of the field is the 31st power of 2, that is, 4G. When the current binglog position is greater than 4G, the field will overflow, which leads to binlog event transmission, next_position will suddenly become smaller, and the heartbeat mechanism will detect this change and report an error.

There have been similar BUG feedback before, Error 95418.

5. Simulation test

Test reference: Question 29: MySQL replication heartbeat says it doesn't want to jump

6. Problem solving

  • Split large transactions;
  • Adjust the execution time of large transaction tasks.

Keywords: Database MySQL DBA

Added by niroshan on Tue, 04 Jan 2022 15:25:53 +0200