Fault case | the whole process of MySQL high availability switching troubleshooting caused by slow SQL

Author: Liang hang Wanli database DBA is good at database performance problem diagnosis, transaction and lock problem analysis, is responsible for dealing with problems in the daily operation and maintenance of MySQL, and is very interested in open source database related technologies.

  • The original content of GreatSQL community cannot be used without authorization. Please contact Xiaobian and indicate the source for reprint.

1, Phenomenon description

During troubleshooting, it was found that the active and standby MySQL services were switched, and it was normal to view the MySQL service. The DBA did not switch, and the server did not maintain. Fortunately, the business has not been greatly affected. Why on earth is this?

Assume that the original primary server address is 172.16 87.72, the address of the original and standby primary server is 172.16 87.123.

2, Troubleshooting ideas

View various indicators of MySQL through monitoring
View the keepalived service switching log
MySQL error log information

3, Troubleshooting

3.1 check MySQL monitoring indicators through the monitoring system to determine the specific time of failure (roughly determine the switching time point through traffic)

Through monitoring, check the monitoring indicators of the current main database MySQL (supporting business), and check the one-day and fifteen day (taking the query quantity rate per second as an example)

It is found that the number has increased sharply since the 22nd. It is preliminarily suspected that the master-slave switch may have occurred on the 22nd. Take a look at the broken line diagram of the standby database, and further confirm to check the dual enabled service switching log to determine the master-slave switching time

3.2 check keepalived to determine why master-slave switching occurs

Log in to 87.72 to view the keepalived switching log. The log information is as follows: Note: the system is ubuntu and keepalived does not specify the output log output, so keepalived will output the log to the system default log file syslog Log

shell> cat syslog.6|grep "vrrp"
Oct 23 15:50:34 bjuc-mysql1 Keepalived_vrrp: VRRP_Script(check_mysql) failed
Oct 23 15:50:34 bjuc-mysql1 syslog-ng[31634]: Error opening file for writing; filename='/var/log/tang/Keepalived_vrrp.log', error='No such file or directory (2)'
Oct 23 15:50:35 bjuc-mysql1 Keepalived_vrrp: VRRP_Instance(VI_74) Received higher prio advert
Oct 23 15:50:35 bjuc-mysql1 Keepalived_vrrp: VRRP_Instance(VI_74) Entering BACKUP STATE
Oct 23 15:50:35 bjuc-mysql1 syslog-ng[31634]: Error opening file for writing; filename='/var/log/tang/Keepalived_vrrp.log', error='No such file or directory (2)'
Oct 23 15:50:35 bjuc-mysql1 syslog-ng[31634]: Error opening file for writing; filename='/var/log/tang/Keepalived_vrrp.log', error='No such file or directory (2)'
Oct 23 15:50:35 bjuc-mysql1 syslog-ng[31634]: Error opening file for writing; filename='/var/log/tang/Keepalived_vrrp.log', error='No such file or directory (2)'
Oct 23 15:50:35 bjuc-mysql1 syslog-ng[31634]: Error opening file for writing; filename='/var/log/tang/Keepalived_vrrp.log', error='No such file or directory (2)'
Oct 23 15:50:56 bjuc-mysql1 Keepalived_vrrp: VRRP_Script(check_mysql) succeeded

Log analysis results

It can be found from the log that at Oct 23 15:50:34, the detection script failed to detect the MySQL service, resulting in vip switching. Then go to 87.123 to check the keepalived switching and determine the switching time

Oct 23 15:50:35 prod-UC-yewu-db-slave2 Keepalived_vrrp: VRRP_Instance(VI_74) forcing a new MASTER election 
Oct 23 15:50:36 prod-UC-yewu-db-slave2 Keepalived_vrrp: VRRP_Instance(VI_74) Transition to MASTER STATE 
Oct 23 15:50:37 prod-UC-yewu-db-slave2 Keepalived_vrrp: VRRP_Instance(VI_74) Entering MASTER STATE 

Through the above log information, we found that the dual master switch occurred at Oct 23 15:50:35

3.3 questions? So why does the detection script fail?

View detection script content

shell> cat check_mysql.sh
#!/bin/bash
export PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
USER=xxx
PASSWORD=xxx
PORT=3306
IP=127.0.0.1

mysql  --connect-timeout=1 -u$USER -p$PASSWORD -h$IP -P$PORT -e "select 1;"
exit $?

We found that the detection script is very simple. Execute the "select 1" command with a timeout of 1s. This sql does not affect any MySQL operations and consumes little performance. Why does it fail? Let's go to MySQL and check the log to see if we find anything

3.4 viewing MySQL logs

191023 15:50:54 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:55 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:55 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:56 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:56 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:57 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:58 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:58 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:59 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:51:28 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE ticket SET max_id=max_id+step WHERE biz_tag='p2p_file'', Error_code: 1205
191023 15:54:43 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
191023 15:54:43 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
191023 15:54:43 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.001432' position 973809430
191023 16:04:13 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
191023 16:04:13 [Note] Slave I/O thread killed while reading event
191023 16:04:13 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.001432', position 998926454
191023 16:04:20 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.001432' at position 973809430, relay log './mysql-relay-bin.000188' position: 54422
191023 16:04:20 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./mysql-relay-bin.000188', relay_log_pos='54422', master_log_name='mysql-bin.001432', master_log_pos='973809430' and new position at relay_log_file='./mysql-relay-bin.000188', relay_log_pos='54661', master_log_name='mysql-bin.001432', master_log_pos='973809669'
191023 16:04:20 [Note] Slave I/O thread: connected to master 'zhu@x:3306',replication started in log 'mysql-bin.001432' at position 998926454

It can be found from the log that MySQL reported a Server shutdown in progress error at 15:50:54, the master-slave relationship hung up at 15:54:43, and the master-slave relationship automatically returned to normal at 16:04:20. So what is the error? What does Server shutdown in progress mean? This error message is because in MySQL 5.5, if you actively kill a long slow query, and the sql also uses the sort buffer, a situation similar to MySQL restart will occur, and the connection will be closed.

Official bug address: https://bugs.mysql.com/bug.php?id=18256

Hurry to the online environment to see if your environment is MySQL 5.5

mysql> \s
--------------
mysql  Ver 14.14 Distrib 5.5.22, for debian-linux-gnu (x86_64) using readline 6.2

Connection id:  76593111
Current database:
Current user:  xxx@127.0.0.1
SSL:   Not in use
Current pager:  stdout
Using outfile:  ''
Using delimiter: ;
Server version:  5.5.22-0ubuntu1-log (Ubuntu)
Protocol version: 10
Connection:  127.0.0.1 via TCP/IP
Server characterset: utf8
Db     characterset: utf8
Client characterset: utf8
Conn.  characterset: utf8
TCP port:  3306
Uptime:   462 days 11 hours 50 min 11 sec

Threads: 9  Questions: 14266409536  Slow queries: 216936  Opens: 581411  Flush tables: 605  Open tables: 825  Queries per second avg: 357.022

We can see that the online environment is MySQL version 5.5, with a running time of 462 days 11 hours 50 min 11 sec, which proves that the MySQL process has not been restarted. Then, through the error log (master-slave relationship restart) just now, we verified the MySQL bug so, keepalived. In this case, we found that MySQL is unavailable, resulting in failed.

3.5 let's dig deeper. What causes slow sql to be kill ed?

We continue to find the slow log at the corresponding time point

# Time: 191023 15:50:39
# User@Host: dstatusnet[dstatusnet] @  [x]
# Query_time: 35.571088  Lock_time: 0.742899 Rows_sent: 1  Rows_examined: 72
SET timestamp=1571817039;
SELECT g.id,g.group_pinyin, g.external_type,g.external_group_id,g.mount_id,g.is_display as group_type,g.nickname,g.site_id,g.group_logo,g.created,g.modified,g.misc,g.name_update_flag as name_flag, g.logo_level,g.group_status,g.conversation, g.only_admin_invite,g.reach_limit_count,g.water_mark  FROM user_group AS g LEFT JOIN group_member as gm ON g.id = gm.group_id WHERE g.is_display in (1,2,3) AND gm.profile_id = 7382782 AND gm.join_state != 2 AND UNIX_TIMESTAMP(g.modified) > 1571722305 ORDER BY g.modified ASC LIMIT 0, 1000;
# User@Host: dstatusnet[dstatusnet] @  [x]
# Query_time: 34.276504  Lock_time: 8.440094 Rows_sent: 1  Rows_examined: 1
SET timestamp=1571817039;
SELECT conversation, is_display as group_type, group_status FROM user_group WHERE id=295414;
# Time: 191023 15:50:40
# User@Host: dstatusnet[dstatusnet] @  [x]
# Query_time: 31.410245  Lock_time: 0.561083 Rows_sent: 0  Rows_examined: 7
SET timestamp=1571817040;
SELECT id, clientScope, msgTop FROM uc_application where site_id=106762 AND msgTop=1;
# User@Host: dstatusnet[dstatusnet] @  [x]
# Query_time: 38.442446  Lock_time: 2.762945 Rows_sent: 2  Rows_examined: 38
SET timestamp=1571817040;
SELECT g.id,g.group_pinyin, g.external_type,g.external_group_id,g.mount_id,g.is_display as group_type,g.nickname,g.site_id,g.group_logo,g.created,g.modified,g.misc,g.name_update_flag as name_flag, g.logo_level,g.group_status,g.conversation, g.only_admin_invite,g.reach_limit_count,g.water_mark  FROM user_group AS g LEFT JOIN group_member as gm ON g.id = gm.group_id WHERE g.is_display in (1,2,3) AND gm.profile_id = 9867566 AND gm.join_state != 2 AND UNIX_TIMESTAMP(g.modified) > 1571796860 ORDER BY g.modified ASC LIMIT 0, 1000;
# Time: 191023 15:50:46
# User@Host: dstatusnet[dstatusnet] @  [x]
# Query_time: 40.481602  Lock_time: 8.530303 Rows_sent: 0  Rows_examined: 1
SET timestamp=1571817046;
update heartbeat set modify=1571817004 where session_id='89b480936c1384305e35b531221b705332a4aebf';
# Time: 191023 15:50:41
# User@Host: dstatusnet[dstatusnet] @  [x]
# Query_time: 32.748755  Lock_time: 0.000140 Rows_sent: 0  Rows_examined: 3
SET timestamp=1571817041;
SELECT id,nickname,mount_id,is_display,external_type,misc,site_id,conversation as conv_id,only_admin_invite,reach_limit_count,water_mark, group_logo,logo_level,created,modified,name_update_flag as name_flag, group_pinyin, group_status,external_group_id,event_id FROM user_group WHERE id in(1316262,1352179,1338753) AND UNIX_TIMESTAMP(modified) > 1571023021 ORDER BY created ASC;
# Time: 191023 15:50:46
# User@Host: dstatusnet[dstatusnet] @  [x]
# Query_time: 40.764872  Lock_time: 4.829707 Rows_sent: 0  Rows_examined: 7
SET timestamp=1571817046;
SELECT id,nickname,mount_id,is_display,external_type,misc,site_id,conversation as conv_id,only_admin_invite,reach_limit_count,water_mark, group_logo,logo_level,created,modified,name_update_flag as name_flag, group_pinyin, group_status,external_group_id,event_id FROM user_group WHERE id in(673950,1261923,1261748,1262047,1038545,1184675,1261825) AND UNIX_TIMESTAMP(modified) > 1571744114 ORDER BY created ASC;

As shown above, it is found that this sql has been executed many times during this time period, and the time has exceeded 30~60s. This is a seriously slow sql

3.6 so what operation executes kill?

After inquiry and troubleshooting, it is found that we use the automatic killing function through Pt kill, and the configured query timeout is 30s

shell> ps -aux|grep 87.72
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root      3813  0.0  0.0 218988  8072 ?        Ss   Oct19   9:54 perl /usr/bin/pt-kill --busy-time 30 --interval 10 --match-command Query --ignore-user rep|repl|dba|ptkill --ignore-info 40001 SQL_NO_CACHE --victims all --print --kill --daemonize --charset utf8 --log=/tmp/ptkill.log.kill_____xxx_____172.16.87.72 -h172.16.87.72 -uxxxx -pxxxx

shell> cat /tmp/ptkill.log.kill_____xxx_____172.16.87.72
2019-10-23T15:45:37 KILL 75278028 (Query 39 sec) statusnet SELECT t.conv_id, t.notice_id, t.thread_type, t.parent_conv_id, t.seq, t.reply_count, t.last_reply_seq1, t.last_reply_seq2, t.revocation_seq, t.creator_id, t.status, t.operator_id, t.extra1 FROM notice_thread t WHERE parent_conv_id = 3075773 and notice_id in (1571816367355,1571814620251,1571814611880,1571814601809,1571814589604,1571814578823,1571814568507,1571814559399,1571812785487,1571812769810)

2019-10-23T15:45:37 KILL 75277932 (Query 39 sec) statusnet SELECT t.conv_id, t.notice_id, t.thread_type, t.parent_conv_id, t.seq, t.reply_count, t.last_reply_seq1, t.last_reply_seq2, t.revocation_seq, t.creator_id, t.status, t.operator_id, t.extra1 FROM notice_thread t WHERE parent_conv_id = 6396698 and notice_id in (18606128)

2019-10-23T15:45:37 KILL 75191686 (Query 39 sec) statusnet SELECT id,nickname,mount_id,is_display,external_type,misc,site_id,conversation as conv_id,only_admin_invite,reach_limit_count,water_mark, group_logo,logo_level,created,modified,name_update_flag as name_flag, group_pinyin, group_status,external_group_id,event_id FROM user_group WHERE id in(988091,736931,843788,739447,737242,1252702,736180,1150776) AND UNIX_TIMESTAMP(modified) > 1571740050 ORDER BY created ASC

It was found that PT kill took the initiative to kill it. Combined with the previous discovery, the truth is already in front of us

3.7 why did the sql run for such a long time and report only this time?

Let's compare the slow logs of other times with the current time

# User@Host: dstatusnet[dstatusnet] @  [x]
# Query_time: 2.046068  Lock_time: 0.000079 Rows_sent: 1  Rows_examined: 1
SET timestamp=1572306496;
SELECT ... FROM user_group WHERE id in(693790) AND UNIX_TIMESTAMP(modified) > 0 ORDER BY created ASC;

We found a rule that when there id in (single value), it is basically two seconds. When there are multiple values, the time is not fixed. This conclusion needs to be further confirmed by other evidence

4, So how do we solve it?

Solutions and avoidance:

  • You can choose to upgrade the version of MySQL (the highest version of 5.7)

  • Optimize sql with business

Enjoy GreatSQL 😃

This article is composed of blog one article multi posting platform OpenWrite release!

Added by purtip3154 on Tue, 04 Jan 2022 13:54:22 +0200