1. PG introduction
Following the last shared< Ceph introduction and principle architecture sharing >This time, I will mainly share the detailed explanation of various states of PG in Ceph. PG is one of the most complex and difficult concepts. The complexity of PG is as follows:
- At the architecture level, PG is located in the middle of the RADOS layer.
a. Upward is responsible for receiving and processing requests from clients.
b. The next step is responsible for translating these data requests into transactions that can be understood by local object storage. - It is the basic unit of the storage pool. Many features in the storage pool are implemented directly by PG.
- Generally speaking, PG needs to perform distributed write across nodes due to the backup strategy for disaster recovery domain. Therefore, the synchronization of data between different nodes and data repair during recovery also depend on PG.
2. PG status table
The normal PG status is 100% active + clean, which means that all PGs are accessible and all replicas are available to all PGs.
If Ceph also reports other warning or error status of PG. PG status table:
state | describe |
---|---|
Activating | Peering has completed. PG is waiting for all PG instances to synchronize and solidify peering results (Info, Log, etc.) |
Active | Active state. PG can normally handle read and write requests from clients |
Backfilling | Filling status in the background. backfill is a special scenario of recovery. After peering is completed, if incremental synchronization cannot be performed on some PG instances in the Up Set based on the current authoritative log (for example, the OSD hosting these PG instances is offline for too long, or the overall migration of PG instances is caused by the addition of a new OSD to the cluster), full synchronization will be performed by completely copying all objects in the current Primary |
Backfill-toofull | The OSD of a PG instance that needs to be backfilled has insufficient free space, and the Backfill process is currently suspended |
Backfill-wait | Waiting for Backfill resource reservation |
Clean | Clean. Currently, there is no object to be repaired in PG. the contents of Acting Set and Up Set are the same, and the size is equal to the number of replicas in the storage pool |
Creating | PG is being created |
Deep | PG is or will be performing object consistency scanning and cleaning |
Degraded | Degraded status. After Peering is completed, PG detects that there are inconsistent objects (to be synchronized / repaired) in any PG instance, or the current ActingSet is less than the number of storage pool replicas |
Down | During Peering, PG detects an Interval that cannot be skipped (for example, during this Interval, PG completes Peering and successfully switches to the Active state, which may normally process the read-write requests from the client). The current remaining online OSD s are not enough to complete data repair |
Incomplete | During Peering, because a. there is no non authoritative log, b. through choose_ The Acting Set selected by acting is not enough to complete data repair, resulting in no abnormal completion of Peering |
Inconsistent | Inconsistent state. After cluster cleaning and deep cleaning, it is detected that there are inconsistencies in the replica of objects in PG, such as inconsistent file size of objects or loss of a replica of an object after Recovery |
Peered | Peering has completed, but the current ActingSet size of PG is less than the minimum number of replicas (min_size) specified by the storage pool |
Peering | Synchronizing. PG is performing synchronization |
Recovering | Restoring state. The cluster is performing migration or synchronizing objects and their replicas |
Recovering-wait | Waiting for Recovery resource reservation |
Remapped | Remap state. In case of any change in PG activity set, the data will be migrated from the old activity set to the new activity set. During the migration, the client requests are still processed with the primary OSD in the old activity set. Once the migration is completed, the primary OSD in the new activity set starts processing |
Repair | During the execution of Scrub, if inconsistent objects are found and can be repaired, PG will automatically repair the status |
Scrubbing | PG is or is about to perform object consistency scanning |
Unactive | Inactive state. PG cannot process read / write requests |
Unclean | Non clean state. PG cannot recover from the previous failure |
Stale | Not refreshed. The PG status is not updated by any OSD, which indicates that all OSDs storing this PG may hang up, or Mon does not detect the Primary Statistics (network jitter) |
Undersized | PG current Acting Set is less than the number of storage pool replicas |
3. Detailed state explanation and fault simulation reproduction
3.1 Degraded
3.1.1 description
Degradation: as can be seen from the above, each PG has three copies, which are stored in different OSDs. In the case of non failure, the PG is in the active+clean state. If the PG copy osd.4 hangs up, the PG is in the degraded state.
3.1.2 fault simulation
a. stop it osd.1 $ systemctl stop ceph-osd@1 b. see PG state $ bin/ceph pg stat 20 pgs: 20 active+undersized+degraded; 14512 kB data, 302 GB used, 6388 GB / 6691 GB avail; 12/36 objects degraded (33.333%) c. View cluster monitoring status $ bin/ceph health detail HEALTH_WARN 1 osds down; Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s) OSD_DOWN 1 osds down osd.1 (root=default,host=ceph-xx-cc00) is down PG_DEGRADED Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded pg 1.0 is active+undersized+degraded, acting [0,2] pg 1.1 is active+undersized+degraded, acting [2,0] d. client IO operation #Write object $ bin/rados -p test_pool put myobject ceph.conf #Read object to file $ bin/rados -p test_pool get myobject.old #see file $ ll ceph.conf* -rw-r--r-- 1 root root 6211 Jun 25 14:01 ceph.conf -rw-r--r-- 1 root root 6211 Jul 3 19:57 ceph.conf.old
Fault summary:
In order to simulate the fault, (size = 3, min_size = 2) we manually stop osd.1 and check the PG status. It can be seen that its current status is active + understood + degraded. When the OSD where a PG is located hangs, the PG will enter the understood + degraded state. The meaning of [0,2] on the back side is that there are two copies surviving on osd.0 and osd.2, And the client can read and write IO normally at this time.
3.1.3 summary
Degradation means that Ceph marks all PG s on the OSD as Degraded after some failures, such as the OSD hangs.
The degraded cluster can read and write data normally. The degraded PG is just a small problem, not a serious problem.
Undersized means that the current number of surviving copies of PG is 2, which is less than the number of copies 3. This flag indicates that the number of copies in inventory is insufficient, which is not a serious problem.
3.2 Peered
3.2.1 description
Peering has completed, but the current Acting Set size of PG is less than the minimum number of replicas (min_size) specified by the storage pool.
3.2.2 fault simulation
a. Stop two copies osd.1,osd.0 $ systemctl stop ceph-osd@1 $ systemctl stop ceph-osd@0. b. View cluster health status $ bin/ceph health detail HEALTH_WARN 1 osds down; Reduced data availability: 4 pgs inactive; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s) OSD_DOWN 1 osds down osd.0 (root=default,host=ceph-xx-cc00) is down PG_AVAILABILITY Reduced data availability: 4 pgs inactive pg 1.6 is stuck inactive for 516.741081, current state undersized+degraded+peered, last acting [2] pg 1.10 is stuck inactive for 516.737888, current state undersized+degraded+peered, last acting [2] pg 1.11 is stuck inactive for 516.737408, current state undersized+degraded+peered, last acting [2] pg 1.12 is stuck inactive for 516.736955, current state undersized+degraded+peered, last acting [2] PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded pg 1.0 is undersized+degraded+peered, acting [2] pg 1.1 is undersized+degraded+peered, acting [2] c. client IO operation(Tamp) #Read object to file, tamp IO $ bin/rados -p test_pool get myobject ceph.conf.old
Fault summary:
Now pg only survives on osd.2, and pg has another state: peered, which means to look carefully. Here we can understand it as negotiation and search.
At this time, when reading the file, you will find that the instruction will be stuck in that place. Why can't you read the content, because we set min_size=2. If the number of survivors is less than 2, such as 1 here, external IO requests will not be responded to.
d. adjustment min_size=1 Can solve IO Tamping problem #Set min_size = 1 $ bin/ceph osd pool set test_pool min_size 1 set pool 1 min_size to 1 e. View cluster monitoring status $ bin/ceph health detail HEALTH_WARN 1 osds down; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized; application not enabled on 1 pool(s) OSD_DOWN 1 osds down osd.0 (root=default,host=ceph-xx-cc00) is down PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized pg 1.0 is stuck undersized for 65.958983, current state active+undersized+degraded, last acting [2] pg 1.1 is stuck undersized for 65.960092, current state active+undersized+degraded, last acting [2] pg 1.2 is stuck undersized for 65.960974, current state active+undersized+degraded, last acting [2] f. client IO operation #Read object into file $ ll -lh ceph.conf* -rw-r--r-- 1 root root 6.1K Jun 25 14:01 ceph.conf -rw-r--r-- 1 root root 6.1K Jul 3 20:11 ceph.conf.old -rw-r--r-- 1 root root 6.1K Jul 3 20:11 ceph.conf.old.1
Fault summary:
It can be seen that the PG status Peered is gone, and the client file IO can be read and written normally.
When min_ When size = 1, as long as a replica in the cluster is alive, it can respond to external IO requests.
3.2.3 summary
Peered status can be understood here as waiting for other copies to go online.
When min_ When size = 2, that is, when two replicas must survive, the Peered state can be removed.
PG in Peered state cannot respond to external requests, and IO is suspended.
3.3 Remapped
3.3.1 description
After Peering is completed, if the current Acting Set of PG is inconsistent with the Up Set, the Remapped state will appear.
3.3.2 fault simulation
a. stop it osd.x $ systemctl stop ceph-osd@x b. Start at an interval of 5 minutes osd.x $ systemctl start ceph-osd@x c. see PG state $ ceph pg stat 1416 pgs: 6 active+clean+remapped, 1288 active+clean, 3 stale+active+clean, 119 active+undersized+degraded; 74940 MB data, 250 GB used, 185 TB / 185 TB avail; 1292/48152 objects degraded (2.683%) $ ceph pg dump | grep remapped dumped all 13.cd 0 0 0 0 0 0 2 2 active+clean+remapped 2018-07-03 20:26:14.478665 9453'2 20716:11343 [10,23] 10 [10,23,14] 10 9453'2 2018-07-03 20:26:14.478597 9453'2 2018-07-01 13:11:43.262605 3.1a 44 0 0 0 0 373293056 1500 1500 active+clean+remapped 2018-07-03 20:25:47.885366 20272'79063 20716:109173 [9,23] 9 [9,23,12] 9 20272'79063 2018-07-03 03:14:23.960537 20272'79063 2018-07-03 03:14:23.960537 5.f 0 0 0 0 0 0 0 0 active+clean+remapped 2018-07-03 20:25:47.888430 0'0 20716:15530 [23,8] 23 [23,8,22] 23 0'0 2018-07-03 06:44:05.232179 0'0 2018-06-30 22:27:16.778466 3.4a 45 0 0 0 0 390070272 1500 1500 active+clean+remapped 2018-07-03 20:25:47.886669 20272'78385 20716:108086 [7,23] 7 [7,23,17] 7 20272'78385 2018-07-03 13:49:08.190133 7998'78363 2018-06-28 10:30:38.201993 13.102 0 0 0 0 0 0 5 5 active+clean+remapped 2018-07-03 20:25:47.884983 9453'5 20716:11334 [1,23] 1 [1,23,14] 1 9453'5 2018-07-02 21:10:42.028288 9453'5 2018-07-02 21:10:42.028288 13.11d 1 0 0 0 0 4194304 1539 1539 active+clean+remapped 2018-07-03 20:25:47.886535 20343'22439 20716:86294 [4,23] 4 [4,23,15] 4 20343'22439 2018-07-03 17:21:18.567771 20343'22439 2018-07-03 17:21:18.567771#Query $ceph pg stat after 2 minutes 1416 pgs: 2 active+undersized+degraded+remapped+backfilling, 10 active+undersized+degraded+remapped+backfill_wait, 1401 active+clean, 3 stale+active+clean; 74940 MB data, 247 GB used, 179 TB / 179 TB avail; 260/48152 objects degraded (0.540%); 49665 kB/s, 9 objects/s recovering$ ceph pg dump | grep remapped dumped all 13.1e8 2 0 2 0 0 8388608 1527 1527 active+undersized+degraded+remapped+backfill_wait 2018-07-03 20:30:13.999637 9493'38727 20754:165663 [18,33,10] 18 [18,10] 18 9493'38727 2018-07-03 19:53:43.462188 0'0 2018-06-28 20:09:36.303126 d. client IO operation #Reading and writing of rados is normal rados -p test_pool put myobject /tmp/test.log
3.3.3 summary
When the OSD hangs up or expands, the OSD on the PG will reassign the OSD number of the PG according to the Crush algorithm. And will remap PG to other OSDs.
When Remapped, the current PG Acting Set is inconsistent with the Up Set.
The client IO can read and write normally.
3.4 Recovery
3.4.1 description
It refers to the process that PG synchronizes and repairs objects with inconsistent data through the PGLog log.
3.4.2 fault simulation
a. stop it osd.x $ systemctl stop ceph-osd@x b. Start every 1 minute osd.x osd$ systemctl start ceph-osd@x c. View cluster monitoring status $ ceph health detail HEALTH_WARN Degraded data redundancy: 183/57960 objects degraded (0.316%), 17 pgs unclean, 17 pgs degraded PG_DEGRADED Degraded data redundancy: 183/57960 objects degraded (0.316%), 17 pgs unclean, 17 pgs degraded pg 1.19 is active+recovery_wait+degraded, acting [29,9,17]
3.4.3 summary
Recovery is to recover data through the recorded PGLog.
PGLog recorded in OSD_ max_ pg_ log_ Entries = less than 10000 entries. At this time, the data can be incrementally recovered through PGLog.
3.5 Backfill
3.5.1 description
When the copy of the PG is nothing but data recovery through the PGLog, full synchronization is required. Full synchronization is performed by fully copying all objects in the current Primary node.
3.5.2 fault simulation
a. stop it osd.x $ systemctl stop ceph-osd@x b. Start every 10 minutes osd.x $ osd systemctl start ceph-osd@x c. View cluster health status $ ceph health detail HEALTH_WARN Degraded data redundancy: 6/57927 objects degraded (0.010%), 1 pg unclean, 1 pg degraded PG_DEGRADED Degraded data redundancy: 6/57927 objects degraded (0.010%), 1 pg unclean, 1 pg degraded pg 3.7f is active+undersized+degraded+remapped+backfilling, acting [21,29]
3.5.3 summary
When the data cannot be recovered according to the recorded PGLog, you need to perform the Backfill process to recover the data in full.
If OSD is exceeded_ max_ pg_ log_ Entries = 10000 entries. At this time, it is necessary to recover the full amount of data.
3.6 Stale
3.6.1 description
mon detects that the osd of the Primary node of the current PG is down.
When the Primary node times out, it does not report pg related information (such as network congestion) to mon.
When all three copies in PG hang up.
3.6.2 fault simulation
a. Stop separately PG Three copies in osd, First stop osd.23 $ systemctl stop ceph-osd@23 b. Then stop osd.24 $ systemctl stop ceph-osd@24 c. View stop two copies PG 1.45 State of(undersized+degraded+peered) $ ceph health detail HEALTH_WARN 2 osds down; Reduced data availability: 9 pgs inactive; Degraded data redundancy: 3041/47574 objects degraded (6.392%), 149 pgs unclean, 149 pgs degraded, 149 pgs undersized OSD_DOWN 2 osds down osd.23 (root=default,host=ceph-xx-osd02) is down osd.24 (root=default,host=ceph-xx-osd03) is down PG_AVAILABILITY Reduced data availability: 9 pgs inactive pg 1.45 is stuck inactive for 281.355588, current state undersized+degraded+peered, last acting [10] d. Stop at PG 1.45 Third copy in osd.10 $ systemctl stop ceph-osd@10 e. View stop three copies PG 1.45 State of(stale+undersized+degraded+peered) $ ceph health detail HEALTH_WARN 3 osds down; Reduced data availability: 26 pgs inactive, 2 pgs stale; Degraded data redundancy: 4770/47574 objects degraded (10.026%), 222 pgs unclean, 222 pgs degraded, 222 pgs undersized OSD_DOWN 3 osds down osd.10 (root=default,host=ceph-xx-osd01) is down osd.23 (root=default,host=ceph-xx-osd02) is down osd.24 (root=default,host=ceph-xx-osd03) is down PG_AVAILABILITY Reduced data availability: 26 pgs inactive, 2 pgs stale pg 1.9 is stuck inactive for 171.200290, current state undersized+degraded+peered, last acting [13] pg 1.45 is stuck stale for 171.206909, current state stale+undersized+degraded+peered, last acting [10] pg 1.89 is stuck inactive for 435.573694, current state undersized+degraded+peered, last acting [32] pg 1.119 is stuck inactive for 435.574626, current state undersized+degraded+peered, last acting [28] f. client IO operation #Read write mount disk IO tamper ll /mnt/
Fault summary:
First stop the two replicas in the same PG, and the status is understood + degraded + peered.
Then stop the three replicas in the same PG in the status of stale + underestimated + degraded + peered.
3.6.3 summary
When all three copies in a PG hang up, the stale state will appear.
At this time, the PG cannot provide client read / write, and IO is suspended.
When the Primary node times out and fails to report pg related information (such as network congestion) to the mon, the stale status will also appear.
3.7 Inconsistent
3.7.1 description
PG detects that some or some objects are inconsistent between PG instances through Scrub
3.7.2 fault simulation
a. delete PG 3.0 Medium copy osd.34 Header file $ rm -rf /var/lib/ceph/osd/ceph-34/current/3.0_head/DIR_0/1000000697c.0000122c__head_19785300__3 b. Manual execution PG 3.0 Data cleaning $ ceph pg scrub 3.0 instructing pg 3.0 on osd.34 to scrub c. Check cluster monitoring status $ ceph health detail HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 3.0 is active+clean+inconsistent, acting [34,23,1] d. repair PG 3.0 $ ceph pg repair 3.0 instructing pg 3.0 on osd.34 to repair #View cluster monitoring status $ ceph health detail HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent, 1 pg repair OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent, 1 pg repair pg 3.0 is active+clean+scrubbing+deep+inconsistent+repair, acting [34,23,1] #The cluster monitoring status has returned to normal $ ceph health detail HEALTH_OK
Fault summary:
When there are data inconsistencies among the three internal copies of PG, if you want to repair the inconsistent data files, you only need to execute the ceph pg repair repair instruction, and ceph will copy the lost files from other copies to repair the data.
3.7.3 fault simulation
When OSD hangs up for a short time, because there are still two replicas in the cluster that can be written normally, but the data in osd.34 has not been updated. After a while, osd.34 goes online. At this time, the data of osd.34 is old. In the process of recovery, the data is restored to osd.34 through other OSDs to make its data up to date, The status of PG will change from inconsistent - > recover - > clean, and finally return to normal.
This is a scenario of cluster fault self-healing.
3.8 Down
3.8.1 description
During Peering, PG detects an Interval that cannot be skipped (for example, during this Interval, PG completes Peering and successfully switches to the Active state, which may normally process the read-write requests from the client). The current remaining online OSD s are not enough to complete data repair
3.8.2 fault simulation
a. see PG 3.7f Number of internal replicas $ ceph pg dump | grep ^3.7f dumped all 3.7f 43 0 0 0 0 494927872 1569 1569 active+clean 2018-07-05 02:52:51.512598 21315'80115 21356:111666 [5,21,29] 5 [5,21,29] 5 21315'80115 2018-07-05 02:52:51.512568 6206'80083 2018-06-29 22:51:05.831219 b. stop it PG 3.7f copy osd.21 $ systemctl stop ceph-osd@21 c. see PG 3.7f state $ ceph pg dump | grep ^3.7f dumped all 3.7f 66 0 89 0 0 591396864 1615 1615 active+undersized+degraded 2018-07-05 15:29:15.741318 21361'80161 21365:128307 [5,29] 5 [5,29] 5 21315'80115 2018-07-05 02:52:51.512568 6206'80083 2018-06-29 22:51:05.831219 d. When the client writes data, be sure to write the data to PG 3.7f In a copy of[5,29] $ fio -filename=/mnt/xxxsssss -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=4M -size=2G -numjobs=30 -runtime=200 -group_reporting -name=read-libaio read-libaio: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=1 ... fio-2.2.8 Starting 30 threads read-libaio: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 5 (f=5): [_(5),R(1),_(5),R(1),_(3),R(1),_(2),R(1),_(1),R(1),_(9)] [96.5% done] [1052MB/0KB/0KB /s] [263/0/0 iops] [eta 00m:02s] s] read-libaio: (groupid=0, jobs=30): err= 0: pid=32966: Thu Jul 5 15:35:16 2018 read : io=61440MB, bw=1112.2MB/s, iops=278, runt= 55203msec slat (msec): min=18, max=418, avg=103.77, stdev=46.19 clat (usec): min=0, max=33, avg= 2.51, stdev= 1.45 lat (msec): min=18, max=418, avg=103.77, stdev=46.19 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 2], | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 2], | 70.00th=[ 3], 80.00th=[ 3], 90.00th=[ 4], 95.00th=[ 5], | 99.00th=[ 7], 99.50th=[ 8], 99.90th=[ 10], 99.95th=[ 14], | 99.99th=[ 32] bw (KB /s): min=15058, max=185448, per=3.48%, avg=39647.57, stdev=12643.04 lat (usec) : 2=19.59%, 4=64.52%, 10=15.78%, 20=0.08%, 50=0.03% cpu : usr=0.01%, sys=0.37%, ctx=491792, majf=0, minf=15492 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=15360/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=61440MB, aggrb=1112.2MB/s, minb=1112.2MB/s, maxb=1112.2MB/s, mint=55203msec, maxt=55203msec e. stop it PG 3.7f Medium copy osd.29,And view PG 3.7f state(undersized+degraded+peered) #Stop the PG copy osd.29 systemctl stop ceph-osd@29 #Check that the PG 3.7f status is understood + degraded + peered ceph pg dump | grep ^3.7f dumped all 3.7f 70 0 140 0 0 608174080 1623 1623 undersized+degraded+peered 2018-07-05 15:35:51.629636 21365'80169 21367:132165 [5] 5 [5] 5 21315'80115 2018-07-05 02:52:51.512568 6206'80083 2018-06-29 22:51:05.831219 f. stop it PG 3.7f Medium copy osd.5,And view PG 3.7f state(undersized+degraded+peered) #Stop the PG copy osd.5 $ systemctl stop ceph-osd@5 #View the PG status: understood + degraded + peered $ ceph pg dump | grep ^3.7f dumped all 3.7f 70 0 140 0 0 608174080 1623 1623 stale+undersized+degraded+peered 2018-07-05 15:35:51.629636 21365'80169 21367:132165 [5] 5 [5] 5 21315'80115 2018-07-05 02:52:51.512568 6206'80083 2018-06-29 22:51:05.831219 g. Pull up PG 3.7f Medium copy osd.21(At this time osd.21 The data is old), see PG state(down) #Pull up the osd.21 of the PG $ systemctl start ceph-osd@21 #View the status of the PG down $ ceph pg dump | grep ^3.7f dumped all 3.7f 66 0 0 0 0 591396864 1548 1548 down 2018-07-05 15:36:38.365500 21361'80161 21370:111729 [21] 21 [21] 21 21315'80115 2018-07-05 02:52:51.512568 6206'80083 2018-06-29 22:51:05.831219 h. client IO operation #At this time, the client IO will be tamped ll /mnt/
Fault summary:
First, a PG 3.7f has three copies [5,21,29]. When an osd.21 is stopped, write data to OSD. 5 and osd.29. At this time, stop OSD. 29 and OSD. 5, and finally pull up OSD. 21. At this time, the data of osd.21 is old, and the PG is down. At this time, the client IO will be tamped, and the problem can only be repaired by pulling up the suspended OSD.
3.8.3 OSD with PG Down is lost or cannot be pulled up
Repair method(Production environment validated) a. Delete items that cannot be pulled up OSD b. Create corresponding numbered OSD c. PG of Down The state will disappear d. about unfound of PG ,Can choose delete perhaps revert ceph pg {pg-id} mark_unfound_lost revert|delete
3.8.4 conclusion
Typical scenario: A(main),B,C a. first kill B b. New write data to A,C c. kill A and C d. Pull up B
The PG is Down because the osd node data is too old and other online OSDs are not enough to complete data repair.
At this time, the PG cannot provide client IO read / write, and IO will hang.
3.9 Incomplete
During Peering, because a. there is no non authoritative log, b. through choose_ The Acting Set selected by acting is not enough to complete data repair, resulting in no abnormal completion of Peering.
It is common for ceph clusters to restart the server back and forth or power down in peering state.
3.9.1 summary
Repair method wanted: command to clear 'incomplete' PGs
For example, pg 1.1 is incomplete. First compare the number of objects in pg between the primary replicas of pg 1.1, and export which pg has more objects
Then import to pg with few objects, and then mark complete. Be sure to export pg backup first.
In a simple way, the data may be lost again
a. stop the osd that is primary for the incomplete PG; b. run: ceph-objectstore-tool --data-path ... --journal-path ... --pgid $PGID --op mark-complete c. start the osd.
Ensure data integrity
#1. Check the number of objects in the primary replica of pg 1.1. If the number of primary objects is large, go to the osd node where the primary is located $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --journal-path /var/lib/ceph/osd/ceph-0/journal --pgid 1.1 --op export --file /home/pg1.1 #2. Then send / home/pg1.1 scp to the node where the replica is located (there are multiple replicas, and this is required for each replica), and then go to the node where the replica is located $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --journal-path /var/lib/ceph/osd/ceph-1/journal --pgid 1.1 --op import --file /home/pg1.1 #3. Then makr complete $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --journal-path /var/lib/ceph/osd/ceph-1/journal --pgid 1.1 --op mark-complete #4. Finally start osd $ start osd
Verification scheme
#1. Mark pg in incomplete status as complete. It is recommended to verify in the test environment and be familiar with the use of CEPH objectstore tool before operation. PS: use ceph-objectstore-tool You need to stop the current operation before osd,Otherwise, an error will be reported. #2. Query the details of pg 7.123 and use it online ceph pg 7.123 query > /export/pg-7.123-query.txt #3. Query each osd replica node ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-641/ --type bluestore --pgid 7.123 --op info > /export/pg-7.123-info-osd641.txt as: pg 7.123 OSD 1 Presence 1,2,3,4,5 object pg 7.123 OSD 2 Presence 1,2,3,6 object pg 7.123 OSD 2 Presence 1,2,3,7 object #4. Query comparison data #4.1 exported pg object list ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-641/ --type bluestore --pgid 7.123 --op list > /export/pg-7.123-object-list-osd-641.txt #4.2 querying pg the number of object s ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-641/ --type bluestore --pgid 7.123 --op list|wc -l #4.3 compare whether the object s of all copies are consistent. diff -u /export/pg-7.123-object-list-osd-1.txt /export/pg-7.123-object-list-osd-2.txt For example: pg 7.123 yes incomplete,Contrast 7.123 Between all copies of pg Inside object quantity
- As mentioned above, after diff comparison, check whether the object list of each replica (all master-slave replicas) is consistent. Avoid data inconsistency. The largest number of backups containing all objects after diff comparison.
- As mentioned above, after diff comparison, the quantity is inconsistent, and the largest number does not contain all objects. Therefore, it is necessary to consider not overwriting the import and then exporting. Finally, all complete objects are used for import. Note: import is to import after removing PG in advance, which is equal to overwriting import.
- As mentioned above, if the data is consistent after diff comparison, use the backup with the largest number of objects, and then import it to the pg with a small number of objects. Then mark complete all replicas. Be sure to export pg backup at the osd node of all replicas to avoid exceptions, and then restore the pg.
#5. Export backup see pg 7.123 In all copies object Quantity, refer to the above situation and assume osd-641 of object A large number of data diff After the comparison is consistent, go to object The largest number, object list Consistent copy osd The node performs (preferably export backups for each replica),For 0, you also need to export the backup) ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-641/ --type bluestore --pgid 7.123 --op export --file /export/pg1.414-osd-1.obj #6. Import backup Then/export/pg1.414-osd-1.obj scp To the node where the replica is located, select the replica with fewer objects osd The node performs the import. (it is better to export and backup each copy,For 0, you also need to export the backup) The specified pg Import metadata to current pg,You need to delete the current before importing pg(remove Please first export Back up pg Data). need remove current pg,Otherwise, it cannot be imported, and the prompt already exists. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-57/ --type bluestore --pgid 7.123 --op remove Need to add–force Can be deleted. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-57/ --type bluestore --pgid 7.123 --op import --file /export/pg1.414-osd-1.obj #7. Mark pg status, makr complete ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-57/ --type bluestore --pgid 7.123 --op mark-complete
Author: Lihang Lucien
Link: https://www.jianshu.com/p/36c2d5682d87
Source: Jianshu
The copyright belongs to the author. For commercial reprint, please contact the author for authorization, and for non-commercial reprint, please indicate the source.