3.6 CSM, RMC and RSCT management system

Last updated 2021 / 07 / 14

This is a group of logic components, which is composed of several software to realize the communication and control among minicomputer, partition and HMC. CSM, RMC and RSCT are intertwined and called each other. As Fans of IBM, Power system and AIX, the author usually accepts all IBM's new technologies, praises their advantages and defends their disadvantages as much as possible, but IBM often makes some strange things that it thinks are beautiful and users want to vomit when using them from time to time. CSM, RMC and RSCT are the most famous examples. If we want to trace back to other similar products in history, only PSSP can surpass them. This is the management software of IBM SP minicomputer (that is, the minicomputer with the structure of playing chess with Kasparov). The only argument I can defend IBM here is that IBM is beyond the times. I didn't expect that its users are not all mathematicians.

RSCT, RMC and CSM are becoming more and more important in the current AIX virtualization environment. Many partition dynamic operations are realized by the combination of these three. RSCT is used to establish the communication link between HMC and partition; RMC controls the identification of partition resources through RSCT, and then realizes dynamic resource adjustment (Dlpar function); CSM is related to HMC operation partition installation and restart process, which belongs to the auxiliary function of NIM network installation.

CSM, RSCT and AIX patch have certain correspondence requirements. Generally, a version and patch of Aix contain or must correspond to the version of RSCT and CSM. The installable versions of CSM and RSCT can be found in the AIX installation disk, and the patch needs to be downloaded from the IBM website. In addition, RSCT can also be installed on the HACMP / power ha CD, but the version does not necessarily match.

RSCT is essentially a cluster management software. Its purpose is to manage all nodes in the cluster and realize single point management. However, RSCT itself does not have management function and can only be used as the basis of management to provide inter node communication function. These nodes in the same administrative domain are called peers, and the administrative domain is called peer domain. The important management function in RSCT comes from Configuration Resource Manager (CRM). When AIX is installed or RSCT is installed separately, the peer domain is automatically created. At the same time, RSCT will call the preprpnode command, which sets the trusted host and IP address. RSCT also initializes an ACL related control file: / var / CT / CFG / ctrmc acls. This configuration allows other nodes to access the resources of all nodes in the same peer domain.

Because RSCT supports other nodes to access their own information through peer domain authorization, the most common error is "Permission denied errors", / var / CT / CFG / CT_ has. The list of authorized hosts is saved in the thl file. Although the list is not encrypted, it is not clear text that can be displayed. The other two files are / var / CT / CFG / ctrmc acls and / var / CT / CFG / ctsec There are also some additional information in nodeinfo, all in clear text. They are all related to authorization. The thl file stores node information, including itself and managers. HMC is set as a missing management node. After AIX is installed, the system will automatically save HMC related authorization in the configuration file. The details can be displayed through the command / usr/sbin/rsct/bin/ctsthl -l; Save authorized operations in acls file.

If one or more of the above files have errors, the RSCT management function will fail, including the failure of dynamic DLpar operation by HMC/ usr/sbin/rsct/install/bin/recfgct can reinitialize the authorization file. Since many IBM systems are configured based on RSCT, reinitializing RSCT will affect these software, such as GPFS, HACMP, etc., so it is best to stop GPFS/PowerHA/HACMP and other software when performing similar operations to avoid potential downtime risks.
If the problem still cannot be solved after the above reconfiguration process is performed, the simpler repair solution is to delete the RMC, RSCT and other assemblies and reinstall them. However, if the system is very important and the assembly cannot be deleted and reinstalled arbitrarily, the following process can help you check and repair RMC faults bit by bit. Of course, this process is very long and requires some experience.

  1. Check RMC daemon status. Execute the lssrc command to view the output.
# lssrc -s ctrmc
Subsystem         Group            PID     Status
 ctrmc            rsct             2388    active

If the status of the daemon is inopermanent, you need to further check why it is not active, because AIX startup will automatically execute the ctrmc process. Looking at the errpt error report sometimes reveals some problems. At the same time, some useful information will be recorded in the syslog file (/ var/log/messages) (find the RMCD keyword).

If ctrmc is not active, you can restart with the command rmcctrl, and then check the status and error records again:

#/usr/sbin/rsct/bin/rmcctrl -s

Further, you can directly execute / usr/sbin/rsct/bin/rmcd, and any error prompt will be printed directly on the screen. (rmcctrl -s also executes rmcd to start RMC).

  1. If rmcd does not start normally, you can check netstat -a or - an and - A. Rmcd needs to LISTEN to port 657 of UDP/TCP protocol. If this port is occupied, rmcd cannot be started. You can use commands such as lsof to view the occupier of port 657. After rmcd is started normally, the 657 port status should also be LISTEN.
#/usr/sbin/rsct/bin # netstat -an | more
Active Internet connections (including servers)
Proto Recv-Q Send-Q  Local Address  Foreign Address        (state)
tcp        0      0  *.21                   *.*                    LISTEN
tcp4       0      0  *.22                   *.*                    LISTEN
tcp        0      0  *.23                   *.*                    LISTEN
tcp4       0      0  *.111                  *.*                    LISTEN
tcp4       0      0  *.199                  *.*                    LISTEN
tcp        0      0  *.657                  *.*                    LISTEN
  1. Use rmcdomainstatus to view RMC admin domain information. You should be able to see the (IP) information of the HMC and its status (node information). If there is no prompt or other prompt information, there is a problem in the RMC management domain.
#/usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc
Management Domain Status: Management Control Points
  I A  0xe81cea1a25060a2c  0001  10.10.160.72
  I A  0xb7b0064a1b1b75e9  0002  10.10.160.73

There are many types of node information in an RMC management domain:

  • S represents its own node information
  • I indicates prompt information
  • A means active
  1. Check the configuration and operation log s related to the RMC management domain.
  • /The IP address corresponds to the domain name in etc/hosts. The domain name and IP of HMC may not exist, but if there are records in / etc/hosts, they must be correct.
  • Delete the following log
/var/ct/IW/log/mc/trace (ctrmc log)
/var/ct/IW/log/mc/IBM.CSMAgentRM/trace (CSMAgentRM log)
/var/ct/IW/log/mc/IBM.DMRSM/trace (DRM log)
  • After restarting rmc by executing the following command, review the log again:
#rmcctrl -z on both HMC and LPAR
#rmcctrl -A on HMC first
#rmcctrl -A on LPAR second
  • Wait for 5 minutes to view the log, and use the rpttr -o dtic trace command to view the above log
  • Use the - l parameter to view the lsrsrc items in turn, such as IBM MCP
#/var/ct/IW/log/mc/lsrsrc IBM.MCP
Resource Persistent Attributes for IBM.MCP
resource 1:
        MNName           = "192.168.42.164"
        NodeID           = 16725500514158381612
        KeyToken         = "10.10.160.72"
        IPAddresses= {"10.10.160.72","172.17.0.2","fe80::e61f:13ff:fe2c:349c","fe80::e61f:13ff:fe2c:349e"}
        ActivePeerDomain = ""
        NodeNameList     = {"server5"}
resource 2:
        MNName           = "192.168.42.164"
        NodeID           = 13236086220194018793
        KeyToken         = "10.10.160.73"
        IPAddresses= {"10.10.160.73","172.16.0.1","fe80::e61f:13ff:fe2c:31b8","fe80::e61f:13ff:fe2c:31ba"}
        ActivePeerDomain = ""
        NodeNameList     = {"server5"}
  1. Execute lspartition using debug mode on HMC
lspartition -dlpar -debug
     <#0> Partition:<002, rp02.mytest.com, 10.114.69.66>
          Active:<1>, OS:<AIX, 5.2>, DCaps:<0xf>, CmdCaps:<0x1, 0x1>
          -------
     <#1> Partition:<004, rp04.mytest.com, 10.114.69.68>
          Active:<1>, OS:<AIX, 5.2>, DCaps:<0xf>, CmdCaps:<0x1, 0x1>
          -------
  1. Check DLPAR daemon (to execute on HMC, you need to apply for pe password first and then su to root)
  • IBM.LparCmdRM purpose:
    a) Run on HMC
    b) Complete DLPAR operation
  • IBM.DRM
    a) Run on partition
    b) Execute DLPAR command
    c) Execute after the HMC sends the command through LparCmdRM
              HMC                               LPARs
1) DMSRM Create encryption key and push to LPARs   ------->    CSMAgentRM
                 <----------  Return record information <-----------|
2) LparCmdRM  -----------> send out LparCmd_cb  ------>DRM
                                                  |
                                             implement DLPAR command
  1. Add trace. If the existing trace file is too small (the trace file is recycled) to record enough information, you can increase the size of the trace file.
#stopsrc -s <RM resource name>
#startsrc -s <RM resource name> -e TR_SIZE=#######

Change each resource that needs to increase the trace file in turn

  1. Change trace item
  • ctcasd (the following changes upgrade the trace detail level from 1 to 8, up to 9):
vi /usr/sbin/rsct/cfg/ctcasd.cfg
 Find“ TRACELEVELS= _SEC:Info=1,_SEC:Errors=1"A row, change it to:
"TRACELEVELS= _SEC:Info=8,_SEC:Errors=8"
stopsrc -s ctcasd
startsrc -s ctcasd
  • RMC daemon trace
/usr/sbin/rsct/bin/rmctrace -s ctrmc -a all_but_msgs=2
#All items except msgs have a tracking level of 2

/usr/sbin/rsct/bin/rmctrace -s ctrmc -a all=2
#Set all project tracking levels to 2

/usr/sbin/rsct/bin/rmctrace -s ctrmc -a all=0
#Set the tracking level of all items to 0, that is, any prompt information is prohibited. Because some information is not controlled by level, the system will still display some information even if 0 is set.

chssys -s ctrmc -a "-d all=2"
#Trace rmcd startup
  • Tracking RMC GS (Group Service)
vi /usr/sbin/rsct/bin/rmcd_start
 In“ exec /usr/sbin/rsct/bin/rmcd $ROPT $MOPT $NOPT $@"Add before one line:
export CT_TR_TRACE_LEVELS="_GSA:Info=2"
  • Trace RMC commands
#On:
/usr/sbin/rsct/bin/rmctrace -s ctrmc -a prm=100
#The Trace file is in the / var/ct/IW/log/mc directory

#prohibit
/usr/sbin/rsct/bin/rmctrace -s ctrmc -a prm=0

#Format output. The location of the trace file is / var / CT / < peer domain or IW > / log / MC (each command has its own trace file. Peer domain is a set of numbers and should be different in each AIX instance)
/usr/bin/rpttr -o dtic "trace File name 
  • RMC API. Since RMC is widely used in many places such as HMC partition management and HACMP node management, RMC API will be called. If you want to add the content of RMC trace, you can use the following command:
export CT_TR_FILENAME=/tmp/cmd.trace
export CT_TR_SIZE=0x100000
#CT_TR_SIZE is used to control the size of trace file. In this case, it is 1MB
export CT_TR_TRACE_LEVELS="_MCA:*=8"
# *Represents all items
#Execute lsrsrc or other related commands and view / TMP / CMD Is the trace file created
#Execute other RMC commands or actions that need to be tracked. The following 2 commands prohibit trace
unset CT_TR_FILENAME
unset CT_TR_TRACE_LEVEL

Two common RSCT problems are: the old AIX does not reinitialize the unique node ID such as RSCT when installing the system with mksysb, resulting in the RSCT ID conflict between the two systems installed with the same mksysb. It is recommended to upgrade AIX to a newer TL, delete RSCT and then reinstall, or delete node ID and then reconfigure. The process of completely deleting the node ID and then reconfiguring is as follows:

#stopsrc -g rsct_rm
#stopsrc -g rsct
#/usr/bin/odmdelete -o CuAt -q 'attribute=node_uuid'
#/usr/sbin/rsct/bin/mknodeid -f
#/usr/sbin/rsct/install/bin/recfgct
#Generally, after the last command is executed, rsct and other processes will restart automatically. If they are not restarted, please refer to the previous contents of this section

Another problem is that a firewall is set between the HMC and the partition. It is recommended to connect the HMC and the partition with a separate management network. The connection between the HMC and the partition can be completed through an independent VLAN without routing to other VLANs, which can ensure sufficient security.

Keywords: PowerVM

Added by juminoz on Wed, 19 Jan 2022 02:09:09 +0200