Handling of common operation and maintenance faults of drbd

Introduction to drbd

1. What is DRBD?
DRBD (Distributed Replicated Block Device) is a software implemented, non shared storage and replication solution that mirrors the contents of block devices between servers. DRBD is a mirror block device that mirrors the same data block by data bit.

2. Difference between DRBD and RAID1
RAID1 also realizes data mirroring and backup between different storage devices. The difference is that each storage device of RAID1 is connected to a RAID controller and connected to a host, while DRBD realizes data mirroring and backup of storage devices of different node hosts through the network.

basic operation

The installation process is not described here.

  1. How to view drbd status

    drbd-overview
    

    Each field is based on the actual situation of the machine, and the definition is as follows:

    	0:test1/0                       drbd disc id
    	Connected                       Connection status
    	Primary/Secondary               Local disk role/End disk role
    	UpToDate/UpToDate               Local synchronization status/End disk synchronization status
    	/data/test                       Mount point (displayed only when the disk is mounted)
    	xfs                             File system (only when the disk is mounted)
    	4.1T                            Total capacity (displayed only when the disk is mounted)
    	485G                           Used capacity (displayed only when the disk is mounted)
    	3.7T                           Remaining capacity (displayed only when the disk is mounted)
    	12%                            Utilization rate (displayed only when the disk is mounted)
    

    2. Kernel view

     root@demo1r01n02:~# cat /proc/drbd 
    version: 8.4.11-1 (api:1/proto:86-101)
    GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by root@c165, 2019-08-19 15:26:38
     0: cs:Connected ro:Primary/Secondary ds:UpToDate/Diskless A r-----
        ns:1607893187 nr:0 dw:865871622 dr:1018289021 al:19682555 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:204143444
     1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r-----
        ns:0 nr:438102284 dw:2129640104 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
    

    cs: connect state
    ro: indicates role information
    ds: disk status information Inconsistent/UpToDate
    ns/nr: packet information sent / received by the network
    dw/dr: device read / write information

    3. View resource connection status

         drbdadm cstate  Resource name
    
  • Resource connection status; Due to different situations, the performance status is different, which may be one of the following:
    Connection status of resources; A resource may have one of the following connection states
    StandAlone independent: network configuration is unavailable; The resource has not been connected or managed to be disconnected (using the drbdadm disconnect command), or due to authentication failure or brain crack
    Disconnecting: disconnecting is only a temporary state, and the next state is stand alone
    Unconnected suspended: it is the temporary state before attempting to connect. The next state may be WFconnection and WFReportParams
    Timeout: the connection with the peer node timed out. It is also a temporary state. The next state is unconnected and suspended
    BrokerPipe: the connection with the peer node is lost, which is also a temporary state. The next state is unconnected and suspended
    Network failure: the temporary state after pushing the connection with the peer node. The next state is unconnected and suspended
    ProtocolError: the temporary state after pushing the connection with the peer node. The next state is unconnected and suspended
    TearDown disassembly: in the temporary state, the peer node is closed, and the next state is unconnected suspended
    WFConnection: wait for network connection with peer node
    WFReportParams: TCP connection has been established. This node is waiting for the first network packet from the peer node
    Connected connection: the DRBD has established a connection, the data image is now available, and the node is in a normal state
    StartingSyncS: full synchronization. The synchronization initiated by the administrator has just started. The possible future status is SyncSource or PausedSyncS
    StartingSyncT: full synchronization. The synchronization initiated by the administrator has just started, and the next status is WFSyncUUID
    WFBitMapS: partial synchronization has just started. The next possible status is SyncSource or PausedSyncS
    WFBitMapT: partial synchronization has just started. The next possible status is WFSyncUUID
    WFSyncUUID: synchronization is about to start. The next possible status is SyncTarget or PausedSyncT
    SyncSource: synchronization with this node as the synchronization source is in progress
    SyncTarget: synchronization with this node as the synchronization target is in progress
    PausedSyncS: the local node is the source of continuous synchronization, but the synchronization has been suspended. It may be because another synchronization is in progress or the synchronization is suspended using the command (drbdadm pause sync)
    PausedSyncT: the local node is the target of continuous synchronization, but the synchronization has been suspended. This can be because another synchronization is in progress or the synchronization is suspended using the command (drbdadm pause sync)
    VerifyS: online device verification with the local node as the verification source is in progress
    VerifyT: online device verification with the local node as the verification target is in progress
  1. View hard disk status
    drbdadm dstate resource name
    The hard disks of local and peer nodes may be in one of the following states:
  • Diskless diskless: no local block devices are allocated to DRBD, which means that there are no available devices, or manual separation using the drbdadm command, or automatic separation caused by underlying I/O errors
    Attaching: read the instantaneous state when there is no data
    Failed: the local block device reports the next status of the I/O error. Its next status is Diskless
    Negotiation: the instantaneous state before the attached DRBD is set to read no data
    Inconsistent: the data is inconsistent. A new resource is created immediately after this state occurs on the two nodes (before the initial full synchronization). In addition, this state occurs on one node during synchronization (synchronization target)
    Dated: the data resources are consistent but Outdated
    DUnknown: this state occurs when the peer network connection is unavailable
    Consistent: the data of an unconnected node is consistent. When a connection is established, it determines whether the data is UpToDate or updated
    UpToDate: consistent latest data status, which is normal

Common fault handling of drbd

  1. Processing of Unconfigured state

    root@test01:~# drbd-overview 
     0:test01/0  WFConnection Primary/Unknown   UpToDate/DUnknown /data/test xfs 4.1T 322G 3.8T 8% 
     1:test02/0  Connected    Secondary/Primary UpToDate/UpToDate
    
    root@test02:~# drbd-overview 
     0:test02/0  Connected    Primary/Secondary UpToDate/UpToDate /data/test xfs 4.1T 485G 3.7T 12% 
     1:test01/0  Unconfigured .     .
    

    You can see that the slave disk status on 02 is Unconfigured, which indicates that the disk is in the down state. You can use the "drbdadm up disk id" command to modify the status and execute

    root@test02:~# drbdadm up test01
    

    Then use DRBD overview to check whether the status is synchronized again

  2. The roles of the primary disk and the standby disk are correct. The WFConnection of the primary disk and the StandAlone status of the standby disk are processed

     root@r01n02:~#drbd-overview 
       0:s01n02/0  SyncSource Primary/Secondary UpToDate/Inconsistent C      r----- /data xfs 19T 209G 18T 2% 
      [==========>.........] sync'ed: 57.9% (8120/19272)Mfinish: 0:03:57 speed: 34,976 (25,716) K/sec
       1:s01n03/0  StandAlone Secondary/Unknown UpToDate/DUnknown     r-----
     
     root@r01n03:~#drbd-overview 
       0:s01n03/0  WFConnection Primary/Unknown   UpToDate/DUnknown     C r----- /data xfs 19T 107G 19T 1% 
       1:s01n02/0  SyncTarget   Secondary/Primary Inconsistent/UpToDate C r----- 
     	[===========>........] sync'ed: 62.3% (7272/19272)Mfinish: 0:03:31 speed: 35,224 (26,144) want: 71,720 K/sec
    

    You can see that the sr01n03/0 group of disks is not synchronized. The Primary role is Primary and the standby role is Secondary. This indicates that the Primary and standby roles are correct and do not need to be adjusted. At this time, just discard the data in the standby disk to synchronize it. Examples are as follows

    root@s1r01n02:~#drbdadm connect --discard-my-data s1r01n03
    
  3. Incorrect handling of primary and standby roles
    The primary and standby disk roles are incorrect. There are the following situations. You need to uninstall the disk before modifying the disk role.
    3.1 both primary and standby disks are Primary/Unknown
    Processing method: modify the spare status correctly

    root@Spare node:~# drbdadm secondary spare id
    

    3.2 both primary and standby disks are Secondary/Unknown
    Processing method: correct the status of the main disk

    root@Master disk node:~# drbdadm primary --force primary disk id
    

    3.3 the primary disk status is Secondary/Unknown, and the standby disk status is Primary/Unknown
    In this case, ensure that the spare data has been migrated. First adjust the role of the Primary disk to Primary, and then adjust the role of the spare disk to Secondary

     root@Master disk node:~# drbdadm primary --force primary disk id
    
     root@Spare node:~# drbdadm secondary spare id
    
  4. Failed to modify the state of DRB disk D, prompting Device is held open by someone or busy

      root@test01:~# drbdadm secondary test02
     1: State change failed: (-12) Device is held open by someone
     Command 'drbdsetup-84 secondary 1' terminated with exit code 11
    

    Let's see if there are processes occupying the disk
    lsof /dev/drbd1

  5. Handling of Diskless status

     root@s01n02:~#drbd-overview 
       0:s01n02/0  Connected Primary/Secondary UpToDate/Diskless C r----- data/ xfs 19T 205G 18T 2% 
       1:s01n03/0  Connected Secondary/Primary UpToDate/Diskless C r----- 
    

    Disk failure / raid card failure or the kernel panic of the operating system usually cause the diskless problem. Any operation will hang, so it is impossible to restart remotely. The hardware must be restarted manually on site.

drbd synchronous acceleration

If the speed limit is turned on, it can be accelerated manually
drbdadm disk-options --c-plan-ahead=0 --resync-rate=250M <resource_id>

Keywords: Linux Operation & Maintenance server

Added by IWS on Thu, 06 Jan 2022 12:53:43 +0200