Ceph operation, maintenance and repair

From techdocs
Revision as of 19:56, 24 July 2023 by Plinich (talk | contribs) (→‎Important links and documentation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Important links and documentation

How to see the status of the cluster

On any cluster node, you can run the ceph health, ceph health detail or ceph status commands to get an increasingly-detailed overview of the cluster's status.

Important: read the "Placement group states" page (linked above) for what status strings like "active", "backfilling", "remapped", etc., mean.

ceph health

ceph health gives a very condensed status of the cluster. Ideally, it's output will look like this:

root@storage00:~# ceph health
HEALTH_OK
root@storage00:~#

Less ideally it'll display a WARN or ERROR status, which might look like this:

[root@vmfram1 ~]# ceph health
HEALTH_WARN Low space hindering backfill (add storage if this doesn't resolve itself): 2 pgs backfill_toofull
[root@vmfram1 ~]#

The above warning indicates that the cluster is not able to shuffle some objects around (backfilling) due to a lack of disk space. This state might be temporary while it is doing other backfilling, the result of which be some extra space available. More likely the "too full" state will persist and you need to either add storage (best) or make some architectural or OSD weight changes (less than ideal) to force space to become available.

ceph health detail

Below is a more detailed output from the ceph health from above.

[root@vmfram1 ~]# ceph health detail
HEALTH_WARN Low space hindering backfill (add storage if this doesn't resolve itself): 2 pgs backfill_toofull
PG_BACKFILL_FULL Low space hindering backfill (add storage if this doesn't resolve itself): 2 pgs backfill_toofull
    pg 9.5 is active+remapped+backfill_wait+backfill_toofull, acting [305,406,103]
    pg 9.e is active+remapped+backfill_wait+backfill_toofull, acting [306,406,104]
[root@vmfram1 ~]#

ceph status

root@storage00:~# ceph status
  cluster:
    id:     db5b6a5a-1080-46d2-974a-80fe8274c8ba
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum storage00,storage01,compute01 (age 12d)
    mgr: storage01(active, since 12d), standbys: storage00
    mds: vm:1 {0=storage01=up:active} 1 up:standby
    osd: 10 osds: 8 up (since 12d), 8 in (since 3M)
 
  data:
    pools:   4 pools, 448 pgs
    objects: 15.60k objects, 59 GiB
    usage:   177 GiB used, 14 TiB / 14 TiB avail
    pgs:     448 active+clean
 
  io:
    client:   341 B/s wr, 0 op/s rd, 0 op/s wr

root@storage00:~#
[root@vmfram1 ~]# ceph status
  cluster:
    id:     afaf721c-f7ea-4466-bae5-b7eda68eb85a
    health: HEALTH_WARN
            Low space hindering backfill (add storage if this doesn't resolve itself): 2 pgs backfill_toofull
 
  services:
    mon: 3 daemons, quorum vmfram1,vmfram3,vmfram4 (age 13d)
    mgr: vmfram1(active, since 13d)
    osd: 16 osds: 16 up (since 2h), 16 in (since 2h); 227 remapped pgs
 
  data:
    pools:   3 pools, 384 pgs
    objects: 1.35M objects, 5.1 TiB
    usage:   16 TiB used, 14 TiB / 29 TiB avail
    pgs:     1067532/4047075 objects misplaced (26.378%)
             220 active+remapped+backfill_wait
             157 active+clean
             5   active+remapped+backfilling
             2   active+remapped+backfill_wait+backfill_toofull
 
  io:
    client:   86 KiB/s wr, 0 op/s rd, 14 op/s wr
    recovery: 78 MiB/s, 19 objects/s

[root@vmfram1 ~]#

Error states

Unfound objects

my development Ceph cluster (vmfarm) got an unrecoverable unfound object error on the weekend. ceph health detail showed the error in pool 7 (ceph osd lspools, the RBD block device pool used for my VM's) and ceph osd pg repair couldn't repair it but showed that the primary OSD for that group was OSD 205 (which is on vmfram2). Running dmesg on that server showed a physical sector error on that actual disk device.

The non-obvious fix for this is, a bit counter intuitively, to stop the OSD (systemctl stop ceph-osd@205. This fails the OSD, and then Ceph starts rebuilding using the remaining good OSDs (on which the object is still found). The failed OSD is probably best thrown away (removed) and rebuilt from scratch rather that try to fix it.

============================================================

Mon Jul 10 06:15:02 AEST 2023

HEALTH_ERR 1/1353953 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 3/4061859 objects degraded (0.000%), 1 pg degraded

OBJECT_UNFOUND 1/1353953 objects unfound (0.000%)

    pg 7.b has 1 unfound objects

PG_DAMAGED Possible data damage: 1 pg recovery_unfound

    pg 7.b is active+recovery_unfound+degraded+repair, acting [205,103,304], 1 unfound

PG_DEGRADED Degraded data redundancy: 3/4061859 objects degraded (0.000%), 1 pg degraded

    pg 7.b is active+recovery_unfound+degraded+repair, acting [205,103,304], 1 unfound

============================================================