Introduction to drbd
1. What is DRBD?
DRBD (Distributed Replicated Block Device) is a software implemented, non shared storage and replication solution that mirrors the contents of block devices between servers. DRBD is a mirror block device that mirrors the same data block by data bit.
2. Difference between DRBD and RAID1
RAID1 also realizes data mirroring and backup between different storage devices. The difference is that each storage device of RAID1 is connected to a RAID controller and connected to a host, while DRBD realizes data mirroring and backup of storage devices of different node hosts through the network.
basic operation
The installation process is not described here.
-
How to view drbd status
drbd-overview
Each field is based on the actual situation of the machine, and the definition is as follows:
0:test1/0 drbd disc id Connected Connection status Primary/Secondary Local disk role/End disk role UpToDate/UpToDate Local synchronization status/End disk synchronization status /data/test Mount point (displayed only when the disk is mounted) xfs File system (only when the disk is mounted) 4.1T Total capacity (displayed only when the disk is mounted) 485G Used capacity (displayed only when the disk is mounted) 3.7T Remaining capacity (displayed only when the disk is mounted) 12% Utilization rate (displayed only when the disk is mounted)
2. Kernel view
root@demo1r01n02:~# cat /proc/drbd version: 8.4.11-1 (api:1/proto:86-101) GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by root@c165, 2019-08-19 15:26:38 0: cs:Connected ro:Primary/Secondary ds:UpToDate/Diskless A r----- ns:1607893187 nr:0 dw:865871622 dr:1018289021 al:19682555 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:204143444 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate A r----- ns:0 nr:438102284 dw:2129640104 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
cs: connect state
ro: indicates role information
ds: disk status information Inconsistent/UpToDate
ns/nr: packet information sent / received by the network
dw/dr: device read / write information3. View resource connection status
drbdadm cstate Resource name
- Resource connection status; Due to different situations, the performance status is different, which may be one of the following:
Connection status of resources; A resource may have one of the following connection states
StandAlone independent: network configuration is unavailable; The resource has not been connected or managed to be disconnected (using the drbdadm disconnect command), or due to authentication failure or brain crack
Disconnecting: disconnecting is only a temporary state, and the next state is stand alone
Unconnected suspended: it is the temporary state before attempting to connect. The next state may be WFconnection and WFReportParams
Timeout: the connection with the peer node timed out. It is also a temporary state. The next state is unconnected and suspended
BrokerPipe: the connection with the peer node is lost, which is also a temporary state. The next state is unconnected and suspended
Network failure: the temporary state after pushing the connection with the peer node. The next state is unconnected and suspended
ProtocolError: the temporary state after pushing the connection with the peer node. The next state is unconnected and suspended
TearDown disassembly: in the temporary state, the peer node is closed, and the next state is unconnected suspended
WFConnection: wait for network connection with peer node
WFReportParams: TCP connection has been established. This node is waiting for the first network packet from the peer node
Connected connection: the DRBD has established a connection, the data image is now available, and the node is in a normal state
StartingSyncS: full synchronization. The synchronization initiated by the administrator has just started. The possible future status is SyncSource or PausedSyncS
StartingSyncT: full synchronization. The synchronization initiated by the administrator has just started, and the next status is WFSyncUUID
WFBitMapS: partial synchronization has just started. The next possible status is SyncSource or PausedSyncS
WFBitMapT: partial synchronization has just started. The next possible status is WFSyncUUID
WFSyncUUID: synchronization is about to start. The next possible status is SyncTarget or PausedSyncT
SyncSource: synchronization with this node as the synchronization source is in progress
SyncTarget: synchronization with this node as the synchronization target is in progress
PausedSyncS: the local node is the source of continuous synchronization, but the synchronization has been suspended. It may be because another synchronization is in progress or the synchronization is suspended using the command (drbdadm pause sync)
PausedSyncT: the local node is the target of continuous synchronization, but the synchronization has been suspended. This can be because another synchronization is in progress or the synchronization is suspended using the command (drbdadm pause sync)
VerifyS: online device verification with the local node as the verification source is in progress
VerifyT: online device verification with the local node as the verification target is in progress
- View hard disk status
drbdadm dstate resource name
The hard disks of local and peer nodes may be in one of the following states:
- Diskless diskless: no local block devices are allocated to DRBD, which means that there are no available devices, or manual separation using the drbdadm command, or automatic separation caused by underlying I/O errors
Attaching: read the instantaneous state when there is no data
Failed: the local block device reports the next status of the I/O error. Its next status is Diskless
Negotiation: the instantaneous state before the attached DRBD is set to read no data
Inconsistent: the data is inconsistent. A new resource is created immediately after this state occurs on the two nodes (before the initial full synchronization). In addition, this state occurs on one node during synchronization (synchronization target)
Dated: the data resources are consistent but Outdated
DUnknown: this state occurs when the peer network connection is unavailable
Consistent: the data of an unconnected node is consistent. When a connection is established, it determines whether the data is UpToDate or updated
UpToDate: consistent latest data status, which is normal
Common fault handling of drbd
-
Processing of Unconfigured state
root@test01:~# drbd-overview 0:test01/0 WFConnection Primary/Unknown UpToDate/DUnknown /data/test xfs 4.1T 322G 3.8T 8% 1:test02/0 Connected Secondary/Primary UpToDate/UpToDate root@test02:~# drbd-overview 0:test02/0 Connected Primary/Secondary UpToDate/UpToDate /data/test xfs 4.1T 485G 3.7T 12% 1:test01/0 Unconfigured . .
You can see that the slave disk status on 02 is Unconfigured, which indicates that the disk is in the down state. You can use the "drbdadm up disk id" command to modify the status and execute
root@test02:~# drbdadm up test01
Then use DRBD overview to check whether the status is synchronized again
-
The roles of the primary disk and the standby disk are correct. The WFConnection of the primary disk and the StandAlone status of the standby disk are processed
root@r01n02:~#drbd-overview 0:s01n02/0 SyncSource Primary/Secondary UpToDate/Inconsistent C r----- /data xfs 19T 209G 18T 2% [==========>.........] sync'ed: 57.9% (8120/19272)Mfinish: 0:03:57 speed: 34,976 (25,716) K/sec 1:s01n03/0 StandAlone Secondary/Unknown UpToDate/DUnknown r----- root@r01n03:~#drbd-overview 0:s01n03/0 WFConnection Primary/Unknown UpToDate/DUnknown C r----- /data xfs 19T 107G 19T 1% 1:s01n02/0 SyncTarget Secondary/Primary Inconsistent/UpToDate C r----- [===========>........] sync'ed: 62.3% (7272/19272)Mfinish: 0:03:31 speed: 35,224 (26,144) want: 71,720 K/sec
You can see that the sr01n03/0 group of disks is not synchronized. The Primary role is Primary and the standby role is Secondary. This indicates that the Primary and standby roles are correct and do not need to be adjusted. At this time, just discard the data in the standby disk to synchronize it. Examples are as follows
root@s1r01n02:~#drbdadm connect --discard-my-data s1r01n03
-
Incorrect handling of primary and standby roles
The primary and standby disk roles are incorrect. There are the following situations. You need to uninstall the disk before modifying the disk role.
3.1 both primary and standby disks are Primary/Unknown
Processing method: modify the spare status correctlyroot@Spare node:~# drbdadm secondary spare id
3.2 both primary and standby disks are Secondary/Unknown
Processing method: correct the status of the main diskroot@Master disk node:~# drbdadm primary --force primary disk id
3.3 the primary disk status is Secondary/Unknown, and the standby disk status is Primary/Unknown
In this case, ensure that the spare data has been migrated. First adjust the role of the Primary disk to Primary, and then adjust the role of the spare disk to Secondaryroot@Master disk node:~# drbdadm primary --force primary disk id root@Spare node:~# drbdadm secondary spare id
-
Failed to modify the state of DRB disk D, prompting Device is held open by someone or busy
root@test01:~# drbdadm secondary test02 1: State change failed: (-12) Device is held open by someone Command 'drbdsetup-84 secondary 1' terminated with exit code 11
Let's see if there are processes occupying the disk
lsof /dev/drbd1 -
Handling of Diskless status
root@s01n02:~#drbd-overview 0:s01n02/0 Connected Primary/Secondary UpToDate/Diskless C r----- data/ xfs 19T 205G 18T 2% 1:s01n03/0 Connected Secondary/Primary UpToDate/Diskless C r-----
Disk failure / raid card failure or the kernel panic of the operating system usually cause the diskless problem. Any operation will hang, so it is impossible to restart remotely. The hardware must be restarted manually on site.
drbd synchronous acceleration
If the speed limit is turned on, it can be accelerated manually
drbdadm disk-options --c-plan-ahead=0 --resync-rate=250M <resource_id>