Six uses of postgresql highly available repmgr 1 Primary + 1 Standby auto failover

os: ubunbu 16.04
postgresql: 9.6.8
repmgr: 4.1.1

192.168.56.101 node1
192.168.56.102 node2

Under the premise that the production environment ensures the stability of the network, the database can auto failover is the best state. For synchronous replication, it is recommended to configure two Slavs. For asynchronous replication, it is also recommended to configure two Slavs.

To implement auto failover, you need to enable repmgrd. To enable repmgrd, you need to set the shared [preload] libraries = 'repmgr' of postgresql.conf

$ vi /etc/postgresql/9.6/main/postgresql.conf
shared_preload_libraries = 'repmgr'

/Contents of etc/repmgr.conf

Viewed on node1 node, node2 node is basically the same.

# more /etc/repmgr.conf

##############################################
# node information
node_id=1
node_name=node1
conninfo='host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/9.6/main'
use_replication_slots=true
pg_bindir='/usr/lib/postgresql/9.6/bin'
pg_ctl_options='-l /var/log/postgres/startup.log'
rsync_options=--exclude=postgresql.local.conf --archive
#use_primary_conninfo_password=true
location='location01'
##############################################
# service_start_command
service_start_command   = 'sudo pg_ctlcluster 9.6 main start'
service_stop_command    = 'sudo pg_ctlcluster 9.6 main stop'
service_restart_command = 'sudo pg_ctlcluster 9.6 main restart'
service_reload_command  = 'sudo pg_ctlcluster 9.6 main reload' 
service_promote_command = 'sudo pg_ctlcluster 9.6 main promote'
##############################################
# log info
log_level=INFO
log_status_interval=10
log_file=/var/log/postgresql/repmgrd.log
##############################################
# repmgrd options
monitoring_history=yes
monitor_interval_secs=5
reconnect_attempts=10
reconnect_interval=5
##############################################
# automatic failover
failover=automatic
promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'

Start repmgrd

Each node needs to be started
Manual start

# repmgrd -f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid

System boot

# vi /etc/default/repmgrd 

# default settings for repmgrd. This file is source by /bin/sh from
# /etc/init.d/repmgrd

# disable repmgrd by default so it won't get started upon installation
# valid values: yes/no
REPMGRD_ENABLED=yes

# configuration file (required)
REPMGRD_CONF="/etc/repmgr.conf"

# additional options
#REPMGRD_OPTS=""

# user to run repmgrd as
REPMGRD_USER=postgres

# repmgrd binary
REPMGRD_BIN=/usr/bin/repmgrd

# pid file
REPMGRD_PIDFILE=/var/run/repmgrd.pid

Repmgrd? Conf points to the configuration file
It is necessary to configure the system to enable.

# systemctl status repmgrd.service
# systemctl start repmgrd.service
# systemctl enable repmgrd.service

repmgrd process on node1

# ps -ef|grep -i repmgrd |grep -v grep
postgres  4091     1  0 14:30 ?        00:00:00 /usr/lib/postgresql/9.6/bin/repmgrd --config-file /etc/repmgr.conf

# cat /var/log/postgresql/repmgrd.log 
[2018-09-25 14:30:28] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 14:30:28] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 14:30:28] [NOTICE] starting monitoring of node "node1" (ID: 1)
[2018-09-25 14:30:28] [NOTICE] monitoring cluster primary "node1" (node ID: 1)
[2018-09-25 14:30:38] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 14:30:48] [INFO] monitoring primary node "node1" (node ID: 1) in normal state

repmgrd process on node2

# ps -ef|grep -i repmgrd |grep -v grep
postgres  4349     1  0 14:18 ?        00:00:00 /usr/lib/postgresql/9.6/bin/repmgrd --config-file /etc/repmgr.conf

# cat /var/log/postgresql/repmgrd.log 
[2018-09-25 14:33:44] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 14:33:44] [INFO] connecting to database "host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 14:33:44] [NOTICE] starting monitoring of node "node2" (ID: 2)
[2018-09-25 14:33:44] [INFO] monitoring connection to upstream node "node1" (node ID: 1)
[2018-09-25 14:33:55] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 14:33:55] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 14:34:05] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 14:34:05] [DETAIL] last monitoring statistics update was 5 seconds ago

Simulation exception of manually closing the main database

Operation on node1 node

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
 
$ pg_ctl -D /var/lib/postgresql/9.6/main -m fast stop
//perhaps
$ sudo pg_ctlcluster 9.6 main stop

$ tail -f /var/log/postgresql/repmgrd.log
[2018-09-25 15:08:23] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 15:08:23] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 15:08:23] [NOTICE] starting monitoring of node "node1" (ID: 1)
[2018-09-25 15:08:23] [NOTICE] monitoring cluster primary "node1" (node ID: 1)
[2018-09-25 15:08:33] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:43] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:53] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:03] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:13] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:23] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:33] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:43] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:53] [WARNING] unable to connect to local node
[2018-09-25 15:09:53] [INFO] checking state of node 1, 1 of 10 attempts
[2018-09-25 15:09:53] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:09:58] [INFO] checking state of node 1, 2 of 10 attempts
[2018-09-25 15:09:58] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:03] [INFO] checking state of node 1, 3 of 10 attempts
[2018-09-25 15:10:03] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:08] [INFO] checking state of node 1, 4 of 10 attempts
[2018-09-25 15:10:08] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:13] [INFO] checking state of node 1, 5 of 10 attempts
[2018-09-25 15:10:13] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:18] [INFO] checking state of node 1, 6 of 10 attempts
[2018-09-25 15:10:18] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:23] [INFO] checking state of node 1, 7 of 10 attempts
[2018-09-25 15:10:23] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:28] [INFO] checking state of node 1, 8 of 10 attempts
[2018-09-25 15:10:28] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:33] [INFO] checking state of node 1, 9 of 10 attempts
[2018-09-25 15:10:33] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
[2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
[2018-09-25 15:10:38] [NOTICE] unable to connect to local node, falling back to degraded monitoring
[2018-09-25 15:10:38] [INFO] monitoring primary node "node1" (node ID: 1) in degraded state
[2018-09-25 15:10:38] [DETAIL] waiting for the node to become available
[2018-09-25 15:10:48] [INFO] monitoring primary node "node1" (node ID: 1) in degraded state
[2018-09-25 15:10:48] [DETAIL] waiting for the node to become available

View on node2

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status        | Upstream | Location | Connection string                                            
----+-------+---------+---------------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | ? unreachable |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running     | node1    | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
	Is the server running on host "192.168.56.101" and accepting
	TCP/IP connections on port 5432?"
  - node "node1" (ID: 1) is registered as an active primary but is unreachable
$ tail -f /var/log/postgresql/repmgrd.log 

[2018-09-25 15:07:58] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 15:07:58] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 15:07:58] [NOTICE] starting monitoring of node "node2" (ID: 2)
[2018-09-25 15:07:58] [INFO] monitoring connection to upstream node "node1" (node ID: 1)
[2018-09-25 15:08:08] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:08] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:18] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:18] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:28] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:28] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:38] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:38] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:48] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:48] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:58] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:58] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:08] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:08] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:18] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:18] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:28] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:28] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:38] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:38] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:48] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:48] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:53] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
[2018-09-25 15:09:53] [INFO] checking state of node 1, 1 of 10 attempts
[2018-09-25 15:09:53] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:09:58] [INFO] checking state of node 1, 2 of 10 attempts
[2018-09-25 15:09:58] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:03] [INFO] checking state of node 1, 3 of 10 attempts
[2018-09-25 15:10:03] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:08] [INFO] checking state of node 1, 4 of 10 attempts
[2018-09-25 15:10:08] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:13] [INFO] checking state of node 1, 5 of 10 attempts
[2018-09-25 15:10:13] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:18] [INFO] checking state of node 1, 6 of 10 attempts
[2018-09-25 15:10:18] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:23] [INFO] checking state of node 1, 7 of 10 attempts
[2018-09-25 15:10:23] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:28] [INFO] checking state of node 1, 8 of 10 attempts
[2018-09-25 15:10:28] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:33] [INFO] checking state of node 1, 9 of 10 attempts
[2018-09-25 15:10:33] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
[2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
[2018-09-25 15:10:38] [NOTICE] this node is the only available candidate and will now promote itself
[2018-09-25 15:10:38] [INFO] promote_command is:
  "/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file"
[2018-09-25 15:10:38] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log"

[2018-09-25 15:10:38] [NOTICE] promoting standby to primary
[2018-09-25 15:10:38] [DETAIL] promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote"
[2018-09-25 15:10:38] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2018-09-25 15:10:39] [NOTICE] STANDBY PROMOTE successful
[2018-09-25 15:10:39] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2018-09-25 15:10:39] [INFO] switching to primary monitoring mode
[2018-09-25 15:10:39] [NOTICE] monitoring cluster primary "node2" (node ID: 2)
[2018-09-25 15:10:49] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
[2018-09-25 15:10:59] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
[2018-09-25 15:11:09] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
[2018-09-25 15:11:19] [INFO] monitoring primary node "node2" (node ID: 2) in normal state

After 10*5s, auto failover: promoting standby to primary occurs, as follows

[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
[2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
[2018-09-25 15:10:38] [NOTICE] this node is the only available candidate and will now promote itself
[2018-09-25 15:10:38] [INFO] promote_command is:
  "/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file"
[2018-09-25 15:10:38] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log"

[2018-09-25 15:10:38] [NOTICE] promoting standby to primary
[2018-09-25 15:10:38] [DETAIL] promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote"
[2018-09-25 15:10:38] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2018-09-25 15:10:39] [NOTICE] STANDBY PROMOTE successful
[2018-09-25 15:10:39] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2018-09-25 15:10:39] [INFO] switching to primary monitoring mode
[2018-09-25 15:10:39] [NOTICE] monitoring cluster primary "node2" (node ID: 2)
[2018-09-25 15:10:49] [INFO] monitoring primary node "node2" (node ID: 2) in normal state

View cluster show on node2

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | - failed  |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
	Is the server running on host "192.168.56.101" and accepting
	TCP/IP connections on port 5432?"
	

Virtual machine power failure simulation exception

After the previous test, node2 is now the master, and node1 becomes slave after node rejoin ing. Please refer to the previous blog for details.
View the status of node2 node before power failure

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

node2 starts to power down

View on node1

$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status        | Upstream | Location | Connection string                                            
----+-------+---------+---------------+----------+----------+-----------------------------------------------------------------
 1  | node1 | standby |   running     | node2    | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | ? unreachable |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - when attempting to connect to node "node2" (ID: 2), following error encountered :
"could not connect to server: No route to host
	Is the server running on host "192.168.56.102" and accepting
	TCP/IP connections on port 5432?"
  - node "node2" (ID: 2) is registered as an active primary but is unreachable
  

At this time, the node1 node will also be promoted to master, and the log output promoted to master is basically similar, so it will not be pasted again.

Reference resources:
https://www.2ndquadrant.com/en/resources/repmgr/
https://github.com/2ndQuadrant/repmgr
https://repmgr.org/docs/4.1/using-repmgrd.html

Keywords: PostgreSQL sudo Database network

Added by lasse48 on Wed, 25 Dec 2019 22:35:37 +0200