Placement groups are down
Understand and troubleshoot placement groups that are in a down state.
The ceph health detail command reports that some placement groups are
down. For
example,
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down ... pg 0.5 is down+peering pg 1.4 is down+peering ... osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
What this means
In certain cases, the peering process can be blocked, which prevents a placement group from becoming active and usable. Usually, a failure of an OSD causes the peering failures.
For more information, see Ceph OSD peering.
Troubleshooting this problem
Determine what blocks the peering
process.
ceph pg ID queryReplace ID with
the ID of the placement group that is down.For example,
[ceph: root@host01 /]# ceph pg 0.5 query
{ "state": "down+peering",
...
"recovery_state": [
{ "name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2021-08-06 14:40:16.169679",
"requested_info_from": []},
{ "name": "Started\/Primary\/Peering",
"enter_time": "2021-08-06 14:40:16.169659",
"probing_osds": [
0,
1],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
1],
"peering_blocked_by": [
{ "osd": 1,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"}]},
{ "name": "Started",
"enter_time": "2021-08-06 14:40:16.169513"}
]
}The recovery_state section includes information on why the peering
process is blocked.- If the output includes the peering is blocked due to down osds error message, see Down OSDs.
-
If you see any other error message, open a support ticket with IBM Support.