corosync + pacemaker + crmsh High Availability Cluster

Principle:

corosync provides the function of message layer of cluster, transfers heartbeat information and cluster transaction information, and monitors heartbeat by multicast between multiple machines.
pacemaker works in the resource allocation layer, provides the function of resource manager, and configures resources with crmsh, the command interface of resource allocation
One is for heartbeat detection and the other is for resource transfer. The combination of the two can realize the automatic management of high available architectures.
Heart rate detection is used to detect whether the server is still providing services, as long as an exception can not provide services, it is considered dead.
When the server is detected to be dead, it is necessary to transfer the service resources.
CoroSync is an open source software running in the heartbeat layer.
PaceMaker is an open source software running in the resource transfer layer.

For more detailed principles, please click on the following portal (the forum contains details of each big man):
http://www.dataguru.cn/thread-527749-1-1.html

Get ready:

Environmental Science:
redhat EL 6.5

Machines and software:
hostname: server6 ip: 172.25.12.6 corosync pacemaker crmsh
hostname: server7 ip: 172.25.12.7 corosync pacemaker (crmsh)
hostname: server8 ip: 172.25.12.8 corosync pacemaker (crmsh)

1. The time of the three hosts must be synchronized, because I take a snapshot of the same virtual machine, so this problem can be avoided.
2. Host name should be consistent with uname-n
3. Configure hosts files of three hosts (domain name resolution)
4. Two host root users can communicate based on keys, because once the configuration is highly available, the resources are controlled by CRM, so start-up and shut-down of each resource is necessary.

Software source:

Corosync pacemaker can be installed directly with the official mirror yum, but the yum has to be matched:

The following baseurl is my own mirror mount point, you have to modify it yourself.

[Server]
name=Red Hat Enterprise Linux $releasever - $basearch - Source
baseurl=http://172.25.12.250/rhel6.5/Server
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release

[HighAvailability]
name=HighAvailability
baseurl=http://172.25.12.250/rhel6.5/HighAvailability
gpgcheck=0

[LoadBalancer]
name=LoadBalancer
baseurl=http://172.25.12.250/rhel6.5/LoadBalancer
gpgcheck=0

[ResilientStorage]
name=ResilientStorage
baseurl=http://172.25.12.250/rhel6.5/ResilientStorage
gpgcheck=0

[ScalableFileSystem]
name=ScalableFileSystem
baseurl=http://172.25.12.250/rhel6.5/ScalableFileSystem
gpgcheck=0

crmsh download:
http://rpm.pbone.net/index.php3/stat/4/idpl/23861008/dir/RedHat%C2%A0EL%C2%A06/com/crmsh-1.2.6-0.rc2.2.1.x86_64.rpm.html
crmsh dependent pssh download:
http://rpm.pbone.net/index.php3/stat/4/idpl/25907913/dir/redhat_el_6/com/pssh-2.3.1-5.el6.noarch.rpm.html

Installation:

Download the package directly
yum install -y corosync pacemaker crmsh-1.2.6-0.rc2.2.1.x86_64.rpm pssh-2.3.1-5.el6.noarch.rpm
Other dependencies will be resolved automatically by yum

To configure:

The corosync configuration file is in the / etc/corosync / directory:
mv corosync.conf.example corosync.conf # has a template that you can modify directly

  #Totem defines how the nodes in the cluster communicate with each other. totem is a protocol that is dedicated to corosync and to each node. totem protocol has version.
totem {
    version: 2    #Version of totme, unchangeable
    secauth: off  #Security authentication, which consumes a lot of cpu when it is turned on
    threads: 0    #Number of parallel threads after security authentication is turned on
    interface {
        ringnumber: 0   #Loop-back number, if the host has multiple network cards, avoid heartbeat confluence
        bindnetaddr: 172.25.12.0  #In the heartbeat segment, corosync automatically determines which IP address on the local network card belongs to the network and uses this interface as the interface for multicast heartbeat information transmission.
        mcastaddr: 226.94.1.1  #Heartbeat message multicast address (all nodes must be consistent)
        mcastport: 5405 #Multicast Port
        ttl: 1   #Avoid multicast message loops by only one hop
    }
}

logging {
    fileline: off
    to_stderr: no
    to_logfile: yes
    to_syslog: yes
    logfile: /var/log/cluster/corosync.log
    debug: off
    timestamp: on
    logger_subsys {
        subsys: AMF
        debug: off
    }
}

amf {
    mode: disabled
}

#Let pacemaker start as a plug-in in corosync:
service{
    ver:0  #version number
    name:pacemaker  #Module name, start corosync and pacemaker
}

#Operating Identity
aisxec{
    user:root
    group:root
}

Generate multicast information key:

corosync-keygen generates pre-shared keys to transmit heartbeat information. It requires a total length of 1024 bits / dev/random to generate keys.
# The generated key file generates an authkey file itself in the configuration file directory

Be careful:
/ dev/random is a random number generator in Linux system. It generates random numbers from the address space of an entropy pool in the memory of the current system according to the interruption of the system. Encryption programs or key generation programs will use a large number of random numbers, which will lead to the insufficient use of random numbers. The random characteristic is that once the random numbers in the entropy pool are empty, the current system will be blocked. The interruption of process waiting will continue to generate random numbers.
Because the 1024-bit length key will be used here, there may be insufficient random number in the entropy pool, which will always block the key generation stage, so we use / dev/urandom instead of it, / dev/urandom does not depend on the system interruption, which will not cause the process to wait busily, but the randomness of the data is not high.

Operation:

mv /dev/{random,random.bak}
ln -s /dev/urandom /dev/random
corosync-keygen   #It won't get stuck.

chmod 400 authkey  # Key file permissions must be 400 or 600
scp -p authkey corosync.conf server7:/etc/corosync/
scp -p authkey corosync.conf server8:/etc/corosync/
  # Copy the newly generated key and configuration file to other nodes and save permissions.

service corosync start All nodes start service

crm_mon# View Cluster Information

Last updated: Fri May 26 20:27:03 2017
Last change: Fri May 26 19:19:16 2017 via cibadmin on server6
Stack: classic openais (with plugin)
Current DC: server8 - partition with quorum
Version: 1.1.10-14.el6-368c726
3 Nodes configured, 3 expected votes  #Three nodes, three votes
0 Resources configured    #Number of resources


Online: [ server6 server7 server8 ]  #Online Node

Log files:
/var/log/cluster/corosync.log

After the start-up is completed, a series of validations should be made on each node to see whether the components are working properly:

grep -e "Corosync Cluster Engine" -e "configuration file" /var/log/cluster/corosync.log
#Ensure Cluster Engine works properly
grep  "TOTEM"  /var/log/cluster/corosync.log
# Check whether the initialization member node notification is issued properly
grep pcmk_startup /var/log/cluster/corosync.log
# See if the pcmk(pacemaker abbreviation) plug-in works properly
grep ERROR /var/log/cluster/corosync.log
# Check for errors during startup        
# What errors have occurred in the log, if prompted that pacemaker should not run as a plug-in, it can be ignored directly;
# It may prompt us that PE is not working properly; let's check it with crm_verify-L-V.

crm_verify -L -V

error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
# Stonith is enabled by default in corosync, and the current cluster does not have the corresponding stonith device, so this default configuration is not available yet.
# That is, there is no STONITH device, and the experimental purpose here can be ignored.

crm:
1. Configurations made in the crm management interface are synchronized to each node
2. Any operation will take effect only after commit submission.
3. You need to stop a resource before you want to delete it.
4. Use help COMMAND to get help from this command
5. Supporting TAB Completion

Working mode:
1. Batch mode, enter commands directly in the shell
2. Interactive

[root@server8 ~]# crm
crm(live)# help   #Get the currently available commands
This is crm shell, a Pacemaker command line interface.
Available commands:
cib              manage shadow CIBs # cib sandbox
resource         resources management # All resources are defined after this subcommand
configure        CRM cluster configuration # Editing Cluster Configuration Information
node             nodes management # Cluster Node Management Subcommand
options          user preferences # user priority
history          CRM cluster history
site             Geo-cluster support
ra               resource agents information center # Resource Agent Subcommand (all processes associated with resource agents are under this command)
status           show cluster status # Display status information for the current cluster
help,?           show help (help topics for list of topics)# View possible commands in the current area
end,cd,up        go back one level # Return to Level 1 crm(live)#
quit,bye,exit    exit the program  # Exit crm (live) interaction mode

Common subcommands:

1.resource subcommand # defines the state of all resources

crm(live)resource# help
vailable commands:
status           show status of resources #Display resource status information
start            start a resource #Start a resource
stop             stop a resource #Stop a resource
restart          restart a resource #Restart a resource
promote          promote a master-slave resource #Promoting a master-slave resource
demote           demote a master-slave resource #Downgrading a master-slave resource
manage           put a resource into managed mode
unmanage         put a resource into unmanaged mode
migrate          migrate a resource to another node #Migrate resources to another node
unmigrate        unmigrate a resource to another node
param            manage a parameter of a resource #Parameters for managing resources
secret           manage sensitive parameters #Managing sensitive parameters
meta             manage a meta attribute #Managing Source Properties
utilization      manage a utilization attribute
failcount        manage failcounts #Management Failure Counter
cleanup          cleanup resource status #Clean up resource status
refresh          refresh CIB from the LRM status #Update CIB (Cluster Information Base) from LRM (LRM Local Resource Management)
reprobe          probe for resources not started by the CRM #Detecting resources that are not started in CRM
trace            start RA tracing #Enabling Resource Agent (RA) Tracking
untrace          stop RA tracing #Disable Resource Agent (RA) Tracking
help             show help (help topics for list of topics) #Display help
end              go back one level #Return to Level 1(crm(live)#)
quit             exit the program #Exit Interactive Program

2.configure subcommand # resource stickiness, resource type, resource constraints

crm(live)configure# help
Available commands:
node             define a cluster node #Define a cluster node
primitive        define a resource #Defining resources
monitor          add monitor operation to a primitive #Add monitoring options to a resource (such as timeouts, failing operations to start)
group            define a group #Define a group type (integrating multiple resources)
clone            define a clone #Define a clone type (you can set the total number of clones and run several clones on each node)
ms               define a master-slave resource #Define a master-slave type (the nodes in the cluster can only have one running master resource and the other slaves can be used as standby)
rsc_template     define a resource template #Define a resource template
location         a location preference # Define the location constraint priority (which node will run by default if the location constraint values are the same and which node will run by default if the default orientation is high)
colocation       colocate resources #Arrangement constraints resources (the possibility of multiple resources coming together)
order            order resources #Priority of resource startup
rsc_ticket       resources ticket dependency#
property         set a cluster property #Setting Cluster Properties
rsc_defaults     set resource defaults #Setting default properties of resources (stickiness)
fencing_topology node fencing order #Sequence of Isolated Nodes
role             define role access rights #Define access rights for roles
user             define user access rights #Define user access rights
op_defaults      set resource operations defaults #Setting default options for resources
schema           set or display current CIB RNG schema
show             display CIB objects #Display Cluster Information Base Pairs
edit             edit CIB objects #Editing Cluster Information Base Objects (Editing in vim mode)
filter           filter CIB objects #Filtering CIB objects
delete           delete CIB objects #Delete CIB objects
default-timeouts     set timeouts for operations to minimums from the meta-data
rename           rename a CIB object #Rename CIB objects
modgroup         modify group #Change resource groups
refresh          refresh from CIB #Reread CIB information
erase            erase the CIB #Clear CIB information
ptest            show cluster actions if changes were committed
rsctest          test resources as currently configured
cib              CIB shadow management
cibstatus        CIB status management and editing
template         edit and import a configuration from a template
commit           commit the changes to the CIB #Submit the changed information to CIB
verify           verify the CIB with crm_verify #CIB Syntax Verification
upgrade          upgrade the CIB to version 1.0
save             save the CIB to a file #Export the current CIB to a file (the exported file is stored in the directory before switching crm)
load             import the CIB from a file #Loading CIB from File Content
graph            generate a directed graph
xml              raw xml
help             show help (help topics for list of topics) #display help information
end              go back one level #Back to Level 1(crm(live)#)

3.node subcommand Node management and status

crm(live)# node
crm(live)node# help
Node management and status commands.
Available commands:
status           show nodes status as XML #Display node status information in xml format
show             show node #Display node status information in command line format
standby          put node into standby #Simulate the specified node offline (standby must have FQDN later)
online           set node online #Node Re-online
maintenance      put node into maintenance mode
ready            put node into ready mode
fence            fence node #Isolated Node
clearstate       Clear node state #Clean up node status information
delete           delete node #Delete a node
attribute        manage attributes
utilization      manage utilization attributes
status-attr      manage status attributes
help             show help (help topics for list of topics)
end              go back one level
quit             exit the program

4. The RA subcommand # Resource Agent categories are all here

crm(live)# ra
crm(live)ra# help
Available commands:
classes          list classes and providers # Classifying resource agents
list             list RA for a class (and provider) # Display the resources provided in a category
meta             show meta data for a RA # Display available parameters for a resource broker order (such as meta ocf:heartbeat:IPaddr2)
providers        show providers for a RA and a class
help             show help (help topics for list of topics)
end              go back one level
quit             exit the program

Disable stonith devices (if there are no stonith devices, it is best to disable them):

configure
crm(live)configure# property stonith-enabled=false
crm(live)configure# commit
crm_verify -L -V   #The stoith device will not be checked at this time.

Configure vip:

crm(live)#configure
crm(live)configure# primitive vip ocf:heartbeat:IPaddr params ip=172.25.12.200 nic='eth0' cidr_netmask='24' broadcast='172.25.12.255'
# As long as IPaddr does not exist under more than one resource agent category, ocf:heartbeat can be omitted
crm(live)configure# verify
crm(live)configure# commit
crm(live)configure# cd
crm(live)# status
Last updated: Fri May 26 20:56:39 2017
Last change: Fri May 26 20:56:34 2017 via cibadmin on server8
Stack: classic openais (with plugin)
Current DC: server8 - partition with quorum
Version: 1.1.10-14.el6-368c726
3 Nodes configured, 3 expected votes
1 Resources configured
Online: [ server6 server7 server8 

 vip    (ocf::heartbeat:IPaddr):    Started server6
 #vip runs on server 6

See vip:

[root@server6 ~]# ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:0b:8f:1f brd ff:ff:ff:ff:ff:ff
    inet 172.25.12.6/24 brd 172.25.12.255 scope global eth0
    inet 172.25.12.100/24 brd 172.25.12.255 scope global secondary eth0
    inet6 fe80::5054:ff:fe0b:8f1f/64 scope link 
       valid_lft forever preferred_lft forever

When the VIP address has been successfully configured, the resources defined by crm will be transferred to each node and take effect on each node. At this time, the server 6 node will be converted to standby, and the VIP will be transferred to other nodes.

[root@server6 ~]# crm
crm(live)# node
crm(live)node# standby 
crm(live)node# cd
crm(live)# status
Last updated: Fri May 26 21:01:33 2017
Last change: Fri May 26 21:01:28 2017 via crm_attribute on server6
Stack: classic openais (with plugin)
Current DC: server8 - partition with quorum
Version: 1.1.10-14.el6-368c726
3 Nodes configured, 3 expected votes
1 Resources configured


Node server6: standby
Online: [ server7 server8 ]  #Now only server7 server8 is running

 vip    (ocf::heartbeat:IPaddr):    Started server8 
 #vip was transferred to server 8

View the transferred vip:

[root@server8 ~]# ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:3e:fb:19 brd ff:ff:ff:ff:ff:ff
    inet 172.25.12.8/24 brd 172.25.12.255 scope global eth0
    inet 172.25.12.200/24 brd 172.25.12.255 scope global secondary eth0
    inet6 fe80::5054:ff:fe3e:fb19/64 scope link 
       valid_lft forever preferred_lft forever

rm node online

Be careful:
If there are only two nodes in the cluster:
If one node is stopped, resources will disappear rather than be transferred to another node, because the current two-node cluster, one node is damaged, the other nodes will not be able to vote, status will become WITHOUT quorum.
So I used three nodes, and when one of them stopped, the other two could vote.

Experiment:

Define a high availability cluster:
1.VIP:172.125.12.200
2. Configuring apache(httpd) services
3. Define constraints to ensure the sequence of resource startup and run two resources on the same node

monitor monitoring resources:

 monitor <rsc> [:<role>] <interval>  [:<timeout>]  

Monitor which resource, which role, how long to monitor once, and how long to monitor timeouts

Before doing the experiment, delete the previous VIP:
Stop vip before deleting it

crm(live)# resource 
crm(live)resource# status
 vip    (ocf::heartbeat:IPaddr):    Started 
crm(live)resource# stop vip
crm(live)resource# status
 webip  (ocf::heartbeat:IPaddr):    Stopped 
crm(live)resource# cd
crm(live)# configure 
crm(live)configure# delete vip
crm(live)configure# commit

1. Define vip:

crm(live)configure# primitive webip IPaddr params ip=172.25.12.200 op monitor interval=10s timeout=20s
crm(live)configure# verify
crm(live)configure# commit 
crm(live)configure# cd
crm(live)# status
Last updated: Fri May 26 19:00:35 2017
Last change: Fri May 26 19:00:27 2017 via cibadmin on server6
Stack: classic openais (with plugin)
Current DC: server8 - partition with quorum
Version: 1.1.10-14.el6-368c726
3 Nodes configured, 3 expected votes
1 Resources configured


Node server6: standby
Online: [ server7 server8 ]

 webip  (ocf::heartbeat:IPaddr):    Started server7

2. All nodes are loaded with apache
yum install httpd
Start-up and shut-down without setting startup

3. Define apache resources:

crm(live)# configure 
crm(live)configure# primitive webserver lsb:httpd op monitor interval=30s timeout=15s
crm(live)configure# verify
crm(live)configure# commit 
crm(live)configure# cd
crm(live)# status
Last updated: Fri May 26 19:03:47 2017
Last change: Fri May 26 19:03:37 2017 via cibadmin on server6
Stack: classic openais (with plugin)
Current DC: server8 - partition with quorum
Version: 1.1.10-14.el6-368c726
3 Nodes configured, 3 expected votes
2 Resources configured


Node server6: standby
Online: [ server7 server8 ]

 webip  (ocf::heartbeat:IPaddr):    Started server7 
 webserver  (lsb:httpd):    Started server8

Now both resources are started, but they are scattered on different nodes. By default, resources run as balanced as possible on each node.

Two solutions:
1. group resources, which define two resources together and operate as a group of resources;
2. Colonization can also define permutation constraints, also known as cooperative constraints, where two resources must be together.

We chose the second one.

crm(live)# configure 
crm(live)configure# colocation apache inf: webserver webip
crm(live)configure# show
node server6 \
    attributes standby="on"
node server7
node server8
primitive webip ocf:heartbeat:IPaddr \
    params ip="172.25.12.200" \
    op monitor interval="10s" timeout="20s"
primitive webserver lsb:httpd \
    op monitor interval="30s" timeout="15s"
colocation apache inf: webserver webip
property $id="cib-bootstrap-options" \
    dc-version="1.1.10-14.el6-368c726" \
    cluster-infrastructure="classic openais (with plugin)" \
    expected-quorum-votes="3" \
    stonith-enabled="false"
crm(live)configure# cd
crm(live)# status
Last updated: Fri May 26 22:37:13 2017
Last change: Fri May 26 22:16:30 2017 via cibadmin on server8
Stack: classic openais (with plugin)
Current DC: server8 - partition with quorum
Version: 1.1.10-14.el6-368c726
3 Nodes configured, 3 expected votes
2 Resources configured


Online: [ server6 server7 server8 ]

 webip  (ocf::heartbeat:IPaddr):    Started server7 
 webserver  (lsb:httpd):    Started server7
 ##Both resources are on 7.

Define order constraints:

crm(live)configure#order webip_before_webserver mandatory: webip webserver
crm(live)configure# commit

Note: Mandatory means mandatory. The webip and webserver resources must be started in the order I give them.

Experimental tests:

Visit http://172.25.12.200 You can access the http page on server 7
Serr7 standby

crm(live)# status 
Last updated: Fri May 26 22:42:51 2017
Last change: Fri May 26 22:42:45 2017 via crm_attribute on server7
Stack: classic openais (with plugin)
Current DC: server8 - partition with quorum
Version: 1.1.10-14.el6-368c726
3 Nodes configured, 3 expected votes
2 Resources configured


Node server7: standby
Online: [ server6 server8 ]

 webip  (ocf::heartbeat:IPaddr):    Started server6 
 webserver  (lsb:httpd):    Started server6 

The whole group of resources is transferred to server 6 to refresh the page. http://172.25.12.200 http page on server 6 when accessed

Put server 7 online and resources will not flow back

crm(live)# status 
Last updated: Fri May 26 22:44:32 2017
Last change: Fri May 26 22:44:27 2017 via crm_attribute on server8
Stack: classic openais (with plugin)
Current DC: server8 - partition with quorum
Version: 1.1.10-14.el6-368c726
3 Nodes configured, 3 expected votes
2 Resources configured


Online: [ server6 server7 server8 ]

 webip  (ocf::heartbeat:IPaddr):    Started server6 
 webserver  (lsb:httpd):    Started server6 

Keywords: RPM yum Apache xml

Added by trippyd on Thu, 27 Jun 2019 20:54:32 +0300