MGR status monitoring | MGR in simple terms

Similar to traditional master-slave replication, MGR mainly focuses on the running status of each node and whether the transactions of the Secondary node are delayed. This paper introduces how to monitor MGR node status and transaction status.

1. Node status monitoring

By querying performance_ schema. replication_ group_ The status of MGR nodes can be known from the members table:

mysql> select * from performance_schema.replication_group_members;
| group_replication_applier | af39db70-6850-11ec-94c9-00155d064000 | |        4306 | ONLINE       | PRIMARY     | 8.0.25         |
| group_replication_applier | b05c0838-6850-11ec-a06b-00155d064000 | |        4307 | ONLINE       | SECONDARY   | 8.0.25         |
| group_replication_applier | b0f86046-6850-11ec-92fe-00155d064000 | |        4308 | ONLINE       | SECONDARY   | 8.0.25         |

The main columns in the output results are interpreted as follows:

  • MEMBER_ The ID column value is the server of each node_ UUID is used to uniquely identify each node. In command line mode, member is passed in when calling udf_ ID to specify each node.
  • MEMBER_ROLE indicates the role of each node. If it is PRIMARY, it means that the node can accept read-write transactions. If it is SECONDARY, it means that the node can only accept read-only transactions. If only one node is PRIMARY and the rest are SECONDARY, it means that it is currently in single master mode; If all nodes are PRIMARY, it indicates that they are currently in multi master mode.
  • MEMBER_STATE indicates the state of each node. There are several states: ONLINE, RECOVERING, OFFLINE, ERROR, UNREACHABLE, etc. the following describes several states respectively.
    • ONLINE indicates that the node is in a normal state and can provide services.
    • RECOVERING means that the node is performing distributed recovery and waiting to join the cluster. At this time, it may be copying data from the host node using the clone or transferring it to the binlog.
    • OFFLINE indicates that the node is currently OFFLINE. Remind that when you are about to join or rejoin the cluster, there may also be a very short moment of status displayed as OFFLINE.
    • ERROR indicates that the node is currently in an ERROR state and cannot become a member of the cluster. It is also possible for a node to be in this state when it is conducting distributed recovery or application transactions. When the node is in the ERROR state, it cannot participate in cluster transaction adjudication. When a node is joining or rejoining a cluster, it may also display ERROR status before completing the compatibility check and becoming a formal MGR node.
    • UNREACHABLE: when the group communication message sending and receiving exceeds the time limit, the fault detection mechanism will mark this node as suspicious, and suspect that it may not be able to connect with other nodes, for example, when a node is accidentally disconnected. When you see that other nodes are in UNREACHABLE state on a node, it may mean that some nodes are partitioned, that is, multiple nodes are divided into two or more subsets, and the nodes in the subset can communicate with each other, but the subsets cannot communicate with each other.

When the status of the node is not ONLINE, you should immediately send an alarm and check what happened.

When the node state changes, or when a node joins or exits, the table performance_ schema. replication_ group_ The data of members will be updated, and these status information will be exchanged and shared among nodes, so it can be viewed at any node.

2. MGR transaction status monitoring

Another important thing to focus on is the transaction status of the Secondary node, more specifically, the size of the transaction queue to be authenticated and applied. Execute the following SQL to view, mainly focusing on the count of non Primary nodes_ TRANSACTIONS_ IN_ Queue and count_ TRANSACTIONS_ REMOTE_ IN_ APPLIER_ Is the value of the two columns of queue larger:

| id                                   | trx_tobe_verified | trx_tobe_applied | trx_chkd | trx_done | proposed |
| 4ebd3504-11d9-11ec-8f92-70b5e873a570 |                 0 |                0 |   422248 |        6 |   422248 |
| 549b92bf-11d9-11ec-88e1-70b5e873a570 |                 0 |           238391 |   422079 |   183692 |        0 |
| 5596116c-11d9-11ec-8624-70b5e873a570 |              2936 |           238519 |   422115 |   183598 |        0 |
| ed5fe7ba-37c2-11ec-8e12-70b5e873a570 |              2976 |           238123 |   422167 |   184044 |        0 |

Where, count_ TRANSACTIONS_ REMOTE_ IN_ APPLIER_ The value of queue indicates the size of the transaction queue waiting to be applied, COUNT_TRANSACTIONS_IN_QUEUE indicates the size of the transaction queue waiting to be authenticated. If either value is greater than 0, it indicates that there is a certain degree of delay.

You can also pay attention to the changes of the above two values to see whether the two queues are gradually increasing or decreasing, so as to judge whether the Primary node is "running too fast" or the Secondary node is "running too slow".

To mention more, when flow control is enabled, when the above two values exceed the corresponding threshold (the default threshold of group_replication_flow_control_applier_threshold and group_replication_flow_control_certifier_threshold are 25000), the flow control mechanism will be triggered.

3. Other monitoring

In addition, you can also view the gap between the received transaction and the completed transaction to judge:

mysql> SELECT RECEIVED_TRANSACTION_SET FROM performance_schema.replication_connection_status WHERE  channel_name = 'group_replication_applier' UNION ALL SELECT variable_value FROM performance_schema.global_variables WHERE  variable_name = 'gtid_executed'\G
*************************** 1. row ***************************
RECEIVED_TRANSACTION_SET: 6cfb873b-573f-11ec-814a-d08e7908bcb1:1-3124520
*************************** 2. row ***************************
RECEIVED_TRANSACTION_SET: 6cfb873b-573f-11ec-814a-d08e7908bcb1:1-3078139

It can be seen that the received transaction GTID has reached 3124520, while the local execution has only reached 3078139. The gap between the two is 46381.

By the way, you can continue to pay attention to the change of this difference and estimate whether the local node can catch up with the delay in time or increase the delay.

In addition, when the original master node fails and you want to manually select a node as the new master node, you should also first judge which node has a larger transaction GTID value and give priority to this node.

4. Summary

This paper introduces the main concerns of MGR monitoring, including node status and replication delay status, and how to predict whether replication delay will continue to expand or catch up in time.


Due to the limited personal level, mistakes and omissions are inevitable in the column. Do not directly copy the commands and methods in the document and apply them directly to the online production environment. Readers must fully understand and verify the test environment before formal implementation, so as to avoid damage or damage to the production environment.

Enjoy GreatSQL :)

About GreatSQL

GreatSQL is a MySQL branch maintained by Wanli database. It focuses on improving the reliability and performance of MGR and supports the parallel query feature of InnoDB. It is a branch version of MySQL suitable for financial applications.



Added by scoppc on Wed, 02 Mar 2022 05:41:03 +0200