Troubleshooting failing or missing NVMe drive during 2.0.2 upgrade

The typical failure mode for drives is to fail during reboot or power cycle. Because these reboots and power cycles happen during upgrade, mainly reboot in most cases, this could increase the risk of a drive failing during upgrade as opposed to normal operation. If a worker node NVMe drive fails or goes missing during apupgrade from 2.x to 2.x, it might cause the MCP component to take a long time retrying and eventually fail to upgrade. Follow these steps to determine if you ran into the issue and learn how to complete upgrade successfully when such issue occurs.

Procedure

Follow these steps to determine if you ran into the issue.

In the /var/log/appliance/apupgrade/apupgrade.log look for retries remaining that are greater than zero. For example:

[root@gt01-node1 ~]# tail -n 20 /var/log/appliance/apupgrade/apupgrade.log
2022-08-08 17:28:59 INFO: 25 try of check_update_completion.
2022-08-08 17:29:00 INFO: Function: check_update_completion failed 25 try.will retry in 60 seconds Retries remaining: 65
2022-08-08 17:30:00 INFO: 26 try of check_update_completion.
2022-08-08 17:30:01 INFO: Function: check_update_completion failed 26 try.will retry in 60 seconds Retries remaining: 64
2022-08-08 17:31:01 INFO: 27 try of check_update_completion.
2022-08-08 17:31:01 INFO: Function: check_update_completion failed 27 try.will retry in 60 seconds Retries remaining: 63
2022-08-08 17:32:01 INFO: 28 try of check_update_completion.
2022-08-08 17:32:02 INFO: Function: check_update_completion failed 28 try.will retry in 60 seconds Retries remaining: 62

[root@gt01-node1 ~]# tail -n 10 /var/log/appliance/apupgrade/apupgrade.log.tracelog

                           LOGGING FROM: mcp_postinstaller.py:check_update_completion:50
2022-08-08 17:39:07 TRACE: RC: 0.
                           STDOUT: [worker   rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5   False   True   True   3     0     0     1     30h
                           ]
                           STDERR: []

                           LOGGING FROM: mcp_postinstaller.py:check_update_completion:50
2022-08-08 17:39:07 INFO: Function: check_update_completion failed 35 try.will retry in 60 seconds Retries remaining: 55
                           LOGGING FROM: mcp_postinstaller.py:attempt_recovery:94

Run:

oc get mcp

oc get nodes

oc describe node

to confirm the error for timeout evicting osd pod. For example:

[root@gt01-node1 ~]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-a8aa23cab71f9d6e2418ee14587002a6   True      False      False      3              3                   3                     0                      30h
unset    rendered-unset-32ea3872f9f22c90550a9831d6e21873    True      False      False      2              2                   2                     0                      30h
worker   rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5   False     True       True       3              0                   0                     1                      30h

[root@gt01-node1 ~]# oc get nodes
NAME                STATUS                     ROLES    AGE   VERSION
e1n1-master.fbond   Ready                      master   30h   v1.19.0+d670f74
e1n2-master.fbond   Ready                      master   30h   v1.19.0+d670f74
e1n3-master.fbond   Ready                      master   30h   v1.19.0+d670f74
e1n4.fbond          Ready,SchedulingDisabled   worker   30h   v1.19.0+d670f74
e2n1.fbond          Ready                      worker   30h   v1.19.0+d670f74
e2n2.fbond          Ready                      worker   30h   v1.19.0+d670f74
e2n3.fbond          Ready                      worker   30h   v1.19.0+d670f74
e2n4.fbond          Ready                      worker   30h   v1.19.0+d670f74

[root@gt01-node1 ~]# oc describe node e1n4.fbond | grep -A7 Annotations
Annotations:        csi.volume.kubernetes.io/nodeid:
                      {"openshift-storage.cephfs.csi.ceph.com":"e1n4.fbond","openshift-storage.rbd.csi.ceph.com":"e1n4.fbond","smb.csi.k8s.io":"e1n4.fbond"}
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-32ea3872f9f22c90550a9831d6e21873
                    machineconfiguration.openshift.io/reason:
                      failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod "rook-ceph-osd-2-546b98f44f-n89ck": global t...
                    machineconfiguration.openshift.io/state: Degraded
                    volumes.kubernetes.io/controller-managed-attach-detach: true

Run:

ap issues

For example:

[root@gt01-node1 ~]# ap issues
Open alerts (issues) and unacknowledged events
+------+---------------------+--------------------+-----------------------------------------------------+----------------------------------+----------+--------------+
| ID   |          Date (EDT) |               Type | Reason Code and Title                               | Target                           | Severity | Acknowledged |
+------+---------------------+--------------------+-----------------------------------------------------+----------------------------------+----------+--------------+
| 1009 | 2022-08-08 14:43:53 | SW_NEEDS_ATTENTION | 490: OCS component is not healthy                   | sw://ocs                         |    MAJOR |          N/A |
| 1010 | 2022-08-08 14:43:58 | HW_NEEDS_ATTENTION | 204: Component is missing                           | hw://enclosure2.node2.drive1     |    MINOR |          N/A |
| 1011 | 2022-08-08 14:44:59 | SW_NEEDS_ATTENTION | 490: OCS component is not healthy                   | sw://ocs/enclosure2.node2.drive1 |    MAJOR |          N/A |
| 1012 | 2022-08-08 15:48:15 | SW_NEEDS_ATTENTION | 491: MachineConfig update degraded                  | sw://openshift/enclosure1.node4  |  WARNING |          N/A |
| 1013 | 2022-08-08 15:54:55 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | ocs@sw://ocs                     |    MAJOR |          N/A |
| 1016 | 2022-08-08 16:04:26 | SW_NEEDS_ATTENTION | 496: Some of platform management pods are not ready | sw://magneto                     |    MAJOR |          N/A |
+------+---------------------+--------------------+-----------------------------------------------------+----------------------------------+----------+--------------+

Run:

oc -n openshift-storage get pods -o wide | grep rook-ceph-osd | grep -v prepare

to confirm osd is in CrashLoopBackOff state.

ap
issues

also hinted at a drive missing issue. For example:

[root@gt01-node1 ~]# oc -n openshift-storage get pods -o wide | grep rook-ceph-osd | grep -v prepare
rook-ceph-osd-0-7f4b48956b-zgd69                                  1/1     Running            0          27h     9.254.12.10   e1n4.fbond   <none>           <none>
rook-ceph-osd-1-5c9bcffcc9-wl2w7                                  1/1     Running            0          27h     9.254.12.9    e1n4.fbond   <none>           <none>
rook-ceph-osd-10-568fd6466c-g75d7                                 1/1     Running            0          26h     9.254.8.12    e2n1.fbond   <none>           <none>
rook-ceph-osd-11-998cb4-sf2k7                                     1/1     Running            0          26h     9.254.8.10    e2n1.fbond   <none>           <none>
rook-ceph-osd-2-546b98f44f-n89ck                                  1/1     Running            0          27h     9.254.12.12   e1n4.fbond   <none>           <none>
rook-ceph-osd-3-6d456d6f65-b2qmt                                  1/1     Running            0          26h     9.254.6.11    e2n2.fbond   <none>           <none>
rook-ceph-osd-4-9788cc9bf-s2psd                                   1/1     Running            0          26h     9.254.6.10    e2n2.fbond   <none>           <none>
rook-ceph-osd-5-5dcb7cd58c-tbpmk                                  1/1     Running            0          27h     9.254.12.11   e1n4.fbond   <none>           <none>
rook-ceph-osd-6-b7d498f57-55mmq                                   1/1     Running            0          26h     9.254.6.12    e2n2.fbond   <none>           <none>
rook-ceph-osd-7-fc77899-dzjqs                                     0/1     CrashLoopBackOff   38         26h     9.254.6.13    e2n2.fbond   <none>           <none>
rook-ceph-osd-8-687cc7d76c-rh7xq                                  1/1     Running            0          26h     9.254.8.11    e2n1.fbond   <none>           <none>
rook-ceph-osd-9-8bb779f47-w57zl                                   1/1     Running            0          26h     9.254.8.9     e2n1.fbond   <none>           <none>

Run:

oc -n openshift-storage get pdb

to confirm you can see a pdb, that stands for PodDisruptionBudget which is the actual mechanism for blocking the drain required to move on with the machineconfig update, check MAX_UNAVAILABLE to be equal to zero. For example:

[root@gt01-node1 ~]# oc -n openshift-storage get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     29h
rook-ceph-mon-pdb                                 N/A             1                 0                     29h
rook-ceph-osd-rack-rack0                          N/A             0                 0                     3h13m
rook-ceph-osd-rack-rack1                          N/A             0                 0                     3h13m

By now, your upgrade might have already timed out or the MCP component might be still retrying. You must follow these steps to complete upgrade successfully when such issue occurs.

Run:

ap sd disable

to disable the problematic drive and remove it from OCS that in turn redistributes its data to the remaining drives on the cluster and prevent data loss which the pdb protected from. For example:

[root@gt01-node1 ~]# ap sd disable enclosure2.node2.drive1
Storage drive enclosure2.node2.drive1 will be disabled. Are you sure you want to proceed? [y/N]: y
State change request sent successfully

Monitor the progress by running:

[root@gt01-node1 ~]# tail -f /var/log/appliance/platform/management/sysmgr.log

[root@gt01-node1 ~]# tail -f /var/log/appliance/platform/xcat/cpds_disk_mgt.log.tracelog

Wait for the node to get disabled. For example:

[root@gt01-node1 ~]# tail -n 40 /var/log/appliance/platform/management/sysmgr.log
2022-08-08 17:59:51.514 EDT I | <PETH> - Disabling storage drive enclosure2.node2.drive1
~snip~
2022-08-08 18:08:42.072 EDT I | <PETH> - Disabled storage drive enclosure2.node2.drive1
2022-08-08 18:08:42.105 EDT I | <AETH> - Alert raise requested: id: 1020, reason_code: 837, target: hw://enclosure2.node2.drive1

After that, you see that the osd pods are Pending because the drain was successful, and you see the node transitioned to NotReady For example:

rook-ceph-osd-0-7f4b48956b-mlfq6                                  0/1     Pending            0          89s
rook-ceph-osd-1-5c9bcffcc9-68xnm                                  0/1     Pending            0          89s
rook-ceph-osd-10-568fd6466c-g75d7                                 1/1     Running            0          26h
rook-ceph-osd-11-998cb4-sf2k7                                     1/1     Running            0          26h
rook-ceph-osd-2-546b98f44f-hqlfx                                  0/1     Pending            0          89s
rook-ceph-osd-3-6d456d6f65-b2qmt                                  1/1     Running            0          26h
rook-ceph-osd-4-9788cc9bf-s2psd                                   1/1     Running            0          26h
rook-ceph-osd-5-5dcb7cd58c-xnjfh                                  0/1     Pending            0          110s
rook-ceph-osd-6-b7d498f57-55mmq                                   1/1     Running            0          26h
rook-ceph-osd-8-687cc7d76c-rh7xq                                  1/1     Running            0          26h
rook-ceph-osd-9-8bb779f47-w57zl                                   1/1     Running            0          26h
[root@gt01-node1 ~]# oc get nodes
NAME                STATUS                        ROLES    AGE   VERSION
e1n1-master.fbond   Ready                         master   30h   v1.19.0+d670f74
e1n2-master.fbond   Ready                         master   30h   v1.19.0+d670f74
e1n3-master.fbond   Ready                         master   30h   v1.19.0+d670f74
e1n4.fbond          NotReady,SchedulingDisabled   worker   30h   v1.19.0+d670f74
e2n1.fbond          Ready                         worker   30h   v1.19.0+d670f74
e2n2.fbond          Ready                         worker   30h   v1.19.0+d670f74
e2n3.fbond          Ready                         worker   30h   v1.19.0+d670f74
e2n4.fbond          Ready                         worker   30h   v1.19.0+d670f74
[root@gt01-node1 ~]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-a8aa23cab71f9d6e2418ee14587002a6   True      False      False      3              3                   3                     0                      30h
unset    rendered-unset-32ea3872f9f22c90550a9831d6e21873    True      False      False      2              2                   2                     0                      30h
worker   rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5   False     True       True       3              0                   0                     1                      30h

Run:

oc get mcp

to see that the MCP is no longer degraded and some other node is now rebooting/draining as the MCP progresses. For example:

[root@gt01-node1 ~]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-a8aa23cab71f9d6e2418ee14587002a6   True      False      False      3              3                   3                     0                      31h
unset    rendered-unset-32ea3872f9f22c90550a9831d6e21873    True      False      False      2              2                   2                     0                      30h
worker   rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5   False     True       False      3              1                   1                     0                      31h
[root@gt01-node1 ~]# oc get nodes
NAME                STATUS                     ROLES    AGE   VERSION
e1n1-master.fbond   Ready                      master   31h   v1.19.0+d670f74
e1n2-master.fbond   Ready                      master   31h   v1.19.0+d670f74
e1n3-master.fbond   Ready                      master   31h   v1.19.0+d670f74
e1n4.fbond          Ready                      worker   31h   v1.19.0+d670f74
e2n1.fbond          Ready,SchedulingDisabled   worker   31h   v1.19.0+d670f74
e2n2.fbond          Ready                      worker   31h   v1.19.0+d670f74
e2n3.fbond          Ready                      worker   31h   v1.19.0+d670f74
e2n4.fbond          Ready                      worker   31h   v1.19.0+d670f74

This unblocks the MCP, so upgrade can resume. However, after the upgrade is finished, the drive must be replaced physically and integrated back into the cluster with ap sd enable which IBM Support can coordinate.