Troubleshooting failing or missing NVMe drive during 2.0.2 upgrade
The typical failure mode for drives is to fail during reboot or power cycle. Because
these reboots and power cycles happen during upgrade, mainly reboot in most cases, this could
increase the risk of a drive failing during upgrade as opposed to normal operation. If a worker node
NVMe drive fails or goes missing during
apupgrade
from 2.x to 2.x, it might cause
the MCP component to take a long time retrying and eventually fail to upgrade. Follow these steps to
determine if you ran into the issue and learn how to complete upgrade successfully when such issue
occurs.
Procedure
Follow these steps to determine if you ran into the issue.
- In the /var/log/appliance/apupgrade/apupgrade.log look for retries
remaining that are greater than zero. For example:
[root@gt01-node1 ~]# tail -n 20 /var/log/appliance/apupgrade/apupgrade.log 2022-08-08 17:28:59 INFO: 25 try of check_update_completion. 2022-08-08 17:29:00 INFO: Function: check_update_completion failed 25 try.will retry in 60 seconds Retries remaining: 65 2022-08-08 17:30:00 INFO: 26 try of check_update_completion. 2022-08-08 17:30:01 INFO: Function: check_update_completion failed 26 try.will retry in 60 seconds Retries remaining: 64 2022-08-08 17:31:01 INFO: 27 try of check_update_completion. 2022-08-08 17:31:01 INFO: Function: check_update_completion failed 27 try.will retry in 60 seconds Retries remaining: 63 2022-08-08 17:32:01 INFO: 28 try of check_update_completion. 2022-08-08 17:32:02 INFO: Function: check_update_completion failed 28 try.will retry in 60 seconds Retries remaining: 62
[root@gt01-node1 ~]# tail -n 10 /var/log/appliance/apupgrade/apupgrade.log.tracelog LOGGING FROM: mcp_postinstaller.py:check_update_completion:50 2022-08-08 17:39:07 TRACE: RC: 0. STDOUT: [worker rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5 False True True 3 0 0 1 30h ] STDERR: [] LOGGING FROM: mcp_postinstaller.py:check_update_completion:50 2022-08-08 17:39:07 INFO: Function: check_update_completion failed 35 try.will retry in 60 seconds Retries remaining: 55 LOGGING FROM: mcp_postinstaller.py:attempt_recovery:94
- Run:
oc get mcp
oc get nodes
to confirm the error for timeout evictingoc describe node
osd
pod. For example:[root@gt01-node1 ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-a8aa23cab71f9d6e2418ee14587002a6 True False False 3 3 3 0 30h unset rendered-unset-32ea3872f9f22c90550a9831d6e21873 True False False 2 2 2 0 30h worker rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5 False True True 3 0 0 1 30h [root@gt01-node1 ~]# oc get nodes NAME STATUS ROLES AGE VERSION e1n1-master.fbond Ready master 30h v1.19.0+d670f74 e1n2-master.fbond Ready master 30h v1.19.0+d670f74 e1n3-master.fbond Ready master 30h v1.19.0+d670f74 e1n4.fbond Ready,SchedulingDisabled worker 30h v1.19.0+d670f74 e2n1.fbond Ready worker 30h v1.19.0+d670f74 e2n2.fbond Ready worker 30h v1.19.0+d670f74 e2n3.fbond Ready worker 30h v1.19.0+d670f74 e2n4.fbond Ready worker 30h v1.19.0+d670f74 [root@gt01-node1 ~]# oc describe node e1n4.fbond | grep -A7 Annotations Annotations: csi.volume.kubernetes.io/nodeid: {"openshift-storage.cephfs.csi.ceph.com":"e1n4.fbond","openshift-storage.rbd.csi.ceph.com":"e1n4.fbond","smb.csi.k8s.io":"e1n4.fbond"} machineconfiguration.openshift.io/currentConfig: rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5 machineconfiguration.openshift.io/desiredConfig: rendered-worker-32ea3872f9f22c90550a9831d6e21873 machineconfiguration.openshift.io/reason: failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod "rook-ceph-osd-2-546b98f44f-n89ck": global t... machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true
- Run:
For example:ap issues
[root@gt01-node1 ~]# ap issues Open alerts (issues) and unacknowledged events +------+---------------------+--------------------+-----------------------------------------------------+----------------------------------+----------+--------------+ | ID | Date (EDT) | Type | Reason Code and Title | Target | Severity | Acknowledged | +------+---------------------+--------------------+-----------------------------------------------------+----------------------------------+----------+--------------+ | 1009 | 2022-08-08 14:43:53 | SW_NEEDS_ATTENTION | 490: OCS component is not healthy | sw://ocs | MAJOR | N/A | | 1010 | 2022-08-08 14:43:58 | HW_NEEDS_ATTENTION | 204: Component is missing | hw://enclosure2.node2.drive1 | MINOR | N/A | | 1011 | 2022-08-08 14:44:59 | SW_NEEDS_ATTENTION | 490: OCS component is not healthy | sw://ocs/enclosure2.node2.drive1 | MAJOR | N/A | | 1012 | 2022-08-08 15:48:15 | SW_NEEDS_ATTENTION | 491: MachineConfig update degraded | sw://openshift/enclosure1.node4 | WARNING | N/A | | 1013 | 2022-08-08 15:54:55 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | ocs@sw://ocs | MAJOR | N/A | | 1016 | 2022-08-08 16:04:26 | SW_NEEDS_ATTENTION | 496: Some of platform management pods are not ready | sw://magneto | MAJOR | N/A | +------+---------------------+--------------------+-----------------------------------------------------+----------------------------------+----------+--------------+
- Run:
to confirmoc -n openshift-storage get pods -o wide | grep rook-ceph-osd | grep -v prepare
osd
is inCrashLoopBackOff
state.ap issues
also hinted at a drive missing issue. For example:[root@gt01-node1 ~]# oc -n openshift-storage get pods -o wide | grep rook-ceph-osd | grep -v prepare rook-ceph-osd-0-7f4b48956b-zgd69 1/1 Running 0 27h 9.254.12.10 e1n4.fbond <none> <none> rook-ceph-osd-1-5c9bcffcc9-wl2w7 1/1 Running 0 27h 9.254.12.9 e1n4.fbond <none> <none> rook-ceph-osd-10-568fd6466c-g75d7 1/1 Running 0 26h 9.254.8.12 e2n1.fbond <none> <none> rook-ceph-osd-11-998cb4-sf2k7 1/1 Running 0 26h 9.254.8.10 e2n1.fbond <none> <none> rook-ceph-osd-2-546b98f44f-n89ck 1/1 Running 0 27h 9.254.12.12 e1n4.fbond <none> <none> rook-ceph-osd-3-6d456d6f65-b2qmt 1/1 Running 0 26h 9.254.6.11 e2n2.fbond <none> <none> rook-ceph-osd-4-9788cc9bf-s2psd 1/1 Running 0 26h 9.254.6.10 e2n2.fbond <none> <none> rook-ceph-osd-5-5dcb7cd58c-tbpmk 1/1 Running 0 27h 9.254.12.11 e1n4.fbond <none> <none> rook-ceph-osd-6-b7d498f57-55mmq 1/1 Running 0 26h 9.254.6.12 e2n2.fbond <none> <none> rook-ceph-osd-7-fc77899-dzjqs 0/1 CrashLoopBackOff 38 26h 9.254.6.13 e2n2.fbond <none> <none> rook-ceph-osd-8-687cc7d76c-rh7xq 1/1 Running 0 26h 9.254.8.11 e2n1.fbond <none> <none> rook-ceph-osd-9-8bb779f47-w57zl 1/1 Running 0 26h 9.254.8.9 e2n1.fbond <none> <none>
- Run:
to confirm you can see aoc -n openshift-storage get pdb
pdb
, that stands forPodDisruptionBudget
which is the actual mechanism for blocking the drain required to move on with themachineconfig
update, check
to be equal to zero. For example:MAX_UNAVAILABLE
[root@gt01-node1 ~]# oc -n openshift-storage get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 29h rook-ceph-mon-pdb N/A 1 0 29h rook-ceph-osd-rack-rack0 N/A 0 0 3h13m rook-ceph-osd-rack-rack1 N/A 0 0 3h13m
By now, your upgrade might have already timed out or the MCP component might be still retrying. You must follow these steps to complete upgrade successfully when such issue occurs.
- Run:
to disable the problematic drive and remove it fromap sd disable
OCS
that in turn redistributes its data to the remaining drives on the cluster and prevent data loss which thepdb
protected from. For example:[root@gt01-node1 ~]# ap sd disable enclosure2.node2.drive1 Storage drive enclosure2.node2.drive1 will be disabled. Are you sure you want to proceed? [y/N]: y State change request sent successfully
- Monitor the progress by running:
[root@gt01-node1 ~]# tail -f /var/log/appliance/platform/management/sysmgr.log
[root@gt01-node1 ~]# tail -f /var/log/appliance/platform/xcat/cpds_disk_mgt.log.tracelog
- Wait for the node to get disabled. For example:
[root@gt01-node1 ~]# tail -n 40 /var/log/appliance/platform/management/sysmgr.log 2022-08-08 17:59:51.514 EDT I | <PETH> - Disabling storage drive enclosure2.node2.drive1 ~snip~ 2022-08-08 18:08:42.072 EDT I | <PETH> - Disabled storage drive enclosure2.node2.drive1 2022-08-08 18:08:42.105 EDT I | <AETH> - Alert raise requested: id: 1020, reason_code: 837, target: hw://enclosure2.node2.drive1
After that, you see that theosd
pods arePending
because the drain was successful, and you see the node transitioned toNotReady
For example:rook-ceph-osd-0-7f4b48956b-mlfq6 0/1 Pending 0 89s rook-ceph-osd-1-5c9bcffcc9-68xnm 0/1 Pending 0 89s rook-ceph-osd-10-568fd6466c-g75d7 1/1 Running 0 26h rook-ceph-osd-11-998cb4-sf2k7 1/1 Running 0 26h rook-ceph-osd-2-546b98f44f-hqlfx 0/1 Pending 0 89s rook-ceph-osd-3-6d456d6f65-b2qmt 1/1 Running 0 26h rook-ceph-osd-4-9788cc9bf-s2psd 1/1 Running 0 26h rook-ceph-osd-5-5dcb7cd58c-xnjfh 0/1 Pending 0 110s rook-ceph-osd-6-b7d498f57-55mmq 1/1 Running 0 26h rook-ceph-osd-8-687cc7d76c-rh7xq 1/1 Running 0 26h rook-ceph-osd-9-8bb779f47-w57zl 1/1 Running 0 26h [root@gt01-node1 ~]# oc get nodes NAME STATUS ROLES AGE VERSION e1n1-master.fbond Ready master 30h v1.19.0+d670f74 e1n2-master.fbond Ready master 30h v1.19.0+d670f74 e1n3-master.fbond Ready master 30h v1.19.0+d670f74 e1n4.fbond NotReady,SchedulingDisabled worker 30h v1.19.0+d670f74 e2n1.fbond Ready worker 30h v1.19.0+d670f74 e2n2.fbond Ready worker 30h v1.19.0+d670f74 e2n3.fbond Ready worker 30h v1.19.0+d670f74 e2n4.fbond Ready worker 30h v1.19.0+d670f74 [root@gt01-node1 ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-a8aa23cab71f9d6e2418ee14587002a6 True False False 3 3 3 0 30h unset rendered-unset-32ea3872f9f22c90550a9831d6e21873 True False False 2 2 2 0 30h worker rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5 False True True 3 0 0 1 30h
- Run:
to see that the MCP is no longer degraded and some other node is now rebooting/draining as the MCP progresses. For example:oc get mcp
[root@gt01-node1 ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-a8aa23cab71f9d6e2418ee14587002a6 True False False 3 3 3 0 31h unset rendered-unset-32ea3872f9f22c90550a9831d6e21873 True False False 2 2 2 0 30h worker rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5 False True False 3 1 1 0 31h [root@gt01-node1 ~]# oc get nodes NAME STATUS ROLES AGE VERSION e1n1-master.fbond Ready master 31h v1.19.0+d670f74 e1n2-master.fbond Ready master 31h v1.19.0+d670f74 e1n3-master.fbond Ready master 31h v1.19.0+d670f74 e1n4.fbond Ready worker 31h v1.19.0+d670f74 e2n1.fbond Ready,SchedulingDisabled worker 31h v1.19.0+d670f74 e2n2.fbond Ready worker 31h v1.19.0+d670f74 e2n3.fbond Ready worker 31h v1.19.0+d670f74 e2n4.fbond Ready worker 31h v1.19.0+d670f74
This unblocks the MCP, so upgrade can resume. However, after the upgrade is finished, the drive must be replaced physically and integrated back into the cluster with
ap sd enable
which IBM Support can coordinate.