os: ubunbu 16.04
postgresql: 9.6.8
repmgr: 4.1.1
192.168.56.101 node1
192.168.56.102 node2
Under the premise that the production environment ensures the stability of the network, the database can auto failover is the best state. For synchronous replication, it is recommended to configure two Slavs. For asynchronous replication, it is also recommended to configure two Slavs.
To implement auto failover, you need to enable repmgrd. To enable repmgrd, you need to set the shared [preload] libraries = 'repmgr' of postgresql.conf
$ vi /etc/postgresql/9.6/main/postgresql.conf shared_preload_libraries = 'repmgr'
/Contents of etc/repmgr.conf
Viewed on node1 node, node2 node is basically the same.
# more /etc/repmgr.conf ############################################## # node information node_id=1 node_name=node1 conninfo='host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2' data_directory='/var/lib/postgresql/9.6/main' use_replication_slots=true pg_bindir='/usr/lib/postgresql/9.6/bin' pg_ctl_options='-l /var/log/postgres/startup.log' rsync_options=--exclude=postgresql.local.conf --archive #use_primary_conninfo_password=true location='location01' ############################################## # service_start_command service_start_command = 'sudo pg_ctlcluster 9.6 main start' service_stop_command = 'sudo pg_ctlcluster 9.6 main stop' service_restart_command = 'sudo pg_ctlcluster 9.6 main restart' service_reload_command = 'sudo pg_ctlcluster 9.6 main reload' service_promote_command = 'sudo pg_ctlcluster 9.6 main promote' ############################################## # log info log_level=INFO log_status_interval=10 log_file=/var/log/postgresql/repmgrd.log ############################################## # repmgrd options monitoring_history=yes monitor_interval_secs=5 reconnect_attempts=10 reconnect_interval=5 ############################################## # automatic failover failover=automatic promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file' follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
Start repmgrd
Each node needs to be started
Manual start
# repmgrd -f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid
System boot
# vi /etc/default/repmgrd # default settings for repmgrd. This file is source by /bin/sh from # /etc/init.d/repmgrd # disable repmgrd by default so it won't get started upon installation # valid values: yes/no REPMGRD_ENABLED=yes # configuration file (required) REPMGRD_CONF="/etc/repmgr.conf" # additional options #REPMGRD_OPTS="" # user to run repmgrd as REPMGRD_USER=postgres # repmgrd binary REPMGRD_BIN=/usr/bin/repmgrd # pid file REPMGRD_PIDFILE=/var/run/repmgrd.pid
Repmgrd? Conf points to the configuration file
It is necessary to configure the system to enable.
# systemctl status repmgrd.service # systemctl start repmgrd.service # systemctl enable repmgrd.service
repmgrd process on node1
# ps -ef|grep -i repmgrd |grep -v grep postgres 4091 1 0 14:30 ? 00:00:00 /usr/lib/postgresql/9.6/bin/repmgrd --config-file /etc/repmgr.conf # cat /var/log/postgresql/repmgrd.log [2018-09-25 14:30:28] [NOTICE] repmgrd (repmgrd 4.1.1) starting up [2018-09-25 14:30:28] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2" [2018-09-25 14:30:28] [NOTICE] starting monitoring of node "node1" (ID: 1) [2018-09-25 14:30:28] [NOTICE] monitoring cluster primary "node1" (node ID: 1) [2018-09-25 14:30:38] [INFO] monitoring primary node "node1" (node ID: 1) in normal state [2018-09-25 14:30:48] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
repmgrd process on node2
# ps -ef|grep -i repmgrd |grep -v grep postgres 4349 1 0 14:18 ? 00:00:00 /usr/lib/postgresql/9.6/bin/repmgrd --config-file /etc/repmgr.conf # cat /var/log/postgresql/repmgrd.log [2018-09-25 14:33:44] [NOTICE] repmgrd (repmgrd 4.1.1) starting up [2018-09-25 14:33:44] [INFO] connecting to database "host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2" [2018-09-25 14:33:44] [NOTICE] starting monitoring of node "node2" (ID: 2) [2018-09-25 14:33:44] [INFO] monitoring connection to upstream node "node1" (node ID: 1) [2018-09-25 14:33:55] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 14:33:55] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 14:34:05] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 14:34:05] [DETAIL] last monitoring statistics update was 5 seconds ago
Simulation exception of manually closing the main database
Operation on node1 node
$ repmgr -f /etc/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Connection string ----+-------+---------+-----------+----------+----------+----------------------------------------------------------------- 1 | node1 | primary | * running | | default | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | standby | running | node1 | default | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2 $ pg_ctl -D /var/lib/postgresql/9.6/main -m fast stop //perhaps $ sudo pg_ctlcluster 9.6 main stop $ tail -f /var/log/postgresql/repmgrd.log [2018-09-25 15:08:23] [NOTICE] repmgrd (repmgrd 4.1.1) starting up [2018-09-25 15:08:23] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2" [2018-09-25 15:08:23] [NOTICE] starting monitoring of node "node1" (ID: 1) [2018-09-25 15:08:23] [NOTICE] monitoring cluster primary "node1" (node ID: 1) [2018-09-25 15:08:33] [INFO] monitoring primary node "node1" (node ID: 1) in normal state [2018-09-25 15:08:43] [INFO] monitoring primary node "node1" (node ID: 1) in normal state [2018-09-25 15:08:53] [INFO] monitoring primary node "node1" (node ID: 1) in normal state [2018-09-25 15:09:03] [INFO] monitoring primary node "node1" (node ID: 1) in normal state [2018-09-25 15:09:13] [INFO] monitoring primary node "node1" (node ID: 1) in normal state [2018-09-25 15:09:23] [INFO] monitoring primary node "node1" (node ID: 1) in normal state [2018-09-25 15:09:33] [INFO] monitoring primary node "node1" (node ID: 1) in normal state [2018-09-25 15:09:43] [INFO] monitoring primary node "node1" (node ID: 1) in normal state [2018-09-25 15:09:53] [WARNING] unable to connect to local node [2018-09-25 15:09:53] [INFO] checking state of node 1, 1 of 10 attempts [2018-09-25 15:09:53] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:09:58] [INFO] checking state of node 1, 2 of 10 attempts [2018-09-25 15:09:58] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:03] [INFO] checking state of node 1, 3 of 10 attempts [2018-09-25 15:10:03] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:08] [INFO] checking state of node 1, 4 of 10 attempts [2018-09-25 15:10:08] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:13] [INFO] checking state of node 1, 5 of 10 attempts [2018-09-25 15:10:13] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:18] [INFO] checking state of node 1, 6 of 10 attempts [2018-09-25 15:10:18] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:23] [INFO] checking state of node 1, 7 of 10 attempts [2018-09-25 15:10:23] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:28] [INFO] checking state of node 1, 8 of 10 attempts [2018-09-25 15:10:28] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:33] [INFO] checking state of node 1, 9 of 10 attempts [2018-09-25 15:10:33] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts [2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts [2018-09-25 15:10:38] [NOTICE] unable to connect to local node, falling back to degraded monitoring [2018-09-25 15:10:38] [INFO] monitoring primary node "node1" (node ID: 1) in degraded state [2018-09-25 15:10:38] [DETAIL] waiting for the node to become available [2018-09-25 15:10:48] [INFO] monitoring primary node "node1" (node ID: 1) in degraded state [2018-09-25 15:10:48] [DETAIL] waiting for the node to become available
View on node2
$ repmgr -f /etc/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Connection string ----+-------+---------+---------------+----------+----------+----------------------------------------------------------------- 1 | node1 | primary | ? unreachable | | default | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | standby | running | node1 | default | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - when attempting to connect to node "node1" (ID: 1), following error encountered : "could not connect to server: Connection refused Is the server running on host "192.168.56.101" and accepting TCP/IP connections on port 5432?" - node "node1" (ID: 1) is registered as an active primary but is unreachable
$ tail -f /var/log/postgresql/repmgrd.log [2018-09-25 15:07:58] [NOTICE] repmgrd (repmgrd 4.1.1) starting up [2018-09-25 15:07:58] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2" [2018-09-25 15:07:58] [NOTICE] starting monitoring of node "node2" (ID: 2) [2018-09-25 15:07:58] [INFO] monitoring connection to upstream node "node1" (node ID: 1) [2018-09-25 15:08:08] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:08:08] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 15:08:18] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:08:18] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 15:08:28] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:08:28] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 15:08:38] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:08:38] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 15:08:48] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:08:48] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 15:08:58] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:08:58] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 15:09:08] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:09:08] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 15:09:18] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:09:18] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 15:09:28] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:09:28] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 15:09:38] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:09:38] [DETAIL] last monitoring statistics update was 5 seconds ago [2018-09-25 15:09:48] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state [2018-09-25 15:09:48] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:53] [WARNING] unable to connect to upstream node "node1" (node ID: 1) [2018-09-25 15:09:53] [INFO] checking state of node 1, 1 of 10 attempts [2018-09-25 15:09:53] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:09:58] [INFO] checking state of node 1, 2 of 10 attempts [2018-09-25 15:09:58] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:03] [INFO] checking state of node 1, 3 of 10 attempts [2018-09-25 15:10:03] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:08] [INFO] checking state of node 1, 4 of 10 attempts [2018-09-25 15:10:08] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:13] [INFO] checking state of node 1, 5 of 10 attempts [2018-09-25 15:10:13] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:18] [INFO] checking state of node 1, 6 of 10 attempts [2018-09-25 15:10:18] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:23] [INFO] checking state of node 1, 7 of 10 attempts [2018-09-25 15:10:23] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:28] [INFO] checking state of node 1, 8 of 10 attempts [2018-09-25 15:10:28] [INFO] sleeping 5 seconds until next reconnection attempt [2018-09-25 15:10:33] [INFO] checking state of node 1, 9 of 10 attempts [2018-09-25 15:10:33] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts [2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts [2018-09-25 15:10:38] [NOTICE] this node is the only available candidate and will now promote itself [2018-09-25 15:10:38] [INFO] promote_command is: "/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file" [2018-09-25 15:10:38] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log" [2018-09-25 15:10:38] [NOTICE] promoting standby to primary [2018-09-25 15:10:38] [DETAIL] promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote" [2018-09-25 15:10:38] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete [2018-09-25 15:10:39] [NOTICE] STANDBY PROMOTE successful [2018-09-25 15:10:39] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary [2018-09-25 15:10:39] [INFO] switching to primary monitoring mode [2018-09-25 15:10:39] [NOTICE] monitoring cluster primary "node2" (node ID: 2) [2018-09-25 15:10:49] [INFO] monitoring primary node "node2" (node ID: 2) in normal state [2018-09-25 15:10:59] [INFO] monitoring primary node "node2" (node ID: 2) in normal state [2018-09-25 15:11:09] [INFO] monitoring primary node "node2" (node ID: 2) in normal state [2018-09-25 15:11:19] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
After 10*5s, auto failover: promoting standby to primary occurs, as follows
[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts [2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts [2018-09-25 15:10:38] [NOTICE] this node is the only available candidate and will now promote itself [2018-09-25 15:10:38] [INFO] promote_command is: "/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file" [2018-09-25 15:10:38] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log" [2018-09-25 15:10:38] [NOTICE] promoting standby to primary [2018-09-25 15:10:38] [DETAIL] promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote" [2018-09-25 15:10:38] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete [2018-09-25 15:10:39] [NOTICE] STANDBY PROMOTE successful [2018-09-25 15:10:39] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary [2018-09-25 15:10:39] [INFO] switching to primary monitoring mode [2018-09-25 15:10:39] [NOTICE] monitoring cluster primary "node2" (node ID: 2) [2018-09-25 15:10:49] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
View cluster show on node2
$ repmgr -f /etc/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Connection string ----+-------+---------+-----------+----------+----------+----------------------------------------------------------------- 1 | node1 | primary | - failed | | default | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - when attempting to connect to node "node1" (ID: 1), following error encountered : "could not connect to server: Connection refused Is the server running on host "192.168.56.101" and accepting TCP/IP connections on port 5432?"
Virtual machine power failure simulation exception
After the previous test, node2 is now the master, and node1 becomes slave after node rejoin ing. Please refer to the previous blog for details.
View the status of node2 node before power failure
$ repmgr -f /etc/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Connection string ----+-------+---------+-----------+----------+----------+----------------------------------------------------------------- 1 | node1 | standby | running | node2 | default | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
node2 starts to power down
View on node1
$ repmgr -f /etc/repmgr.conf cluster show ID | Name | Role | Status | Upstream | Location | Connection string ----+-------+---------+---------------+----------+----------+----------------------------------------------------------------- 1 | node1 | standby | running | node2 | default | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | ? unreachable | | default | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - when attempting to connect to node "node2" (ID: 2), following error encountered : "could not connect to server: No route to host Is the server running on host "192.168.56.102" and accepting TCP/IP connections on port 5432?" - node "node2" (ID: 2) is registered as an active primary but is unreachable
At this time, the node1 node will also be promoted to master, and the log output promoted to master is basically similar, so it will not be pasted again.
Reference resources:
https://www.2ndquadrant.com/en/resources/repmgr/
https://github.com/2ndQuadrant/repmgr
https://repmgr.org/docs/4.1/using-repmgrd.html