Troubleshooting failing or missing NVMe drive during 2.0.2 upgrade

The typical failure mode for drives is to fail during reboot or power cycle. Because these reboots and power cycles happen during upgrade, mainly reboot in most cases, this could increase the risk of a drive failing during upgrade as opposed to normal operation. If a worker node NVMe drive fails or goes missing during apupgrade from 2.x to 2.x, it might cause the MCP component to take a long time retrying and eventually fail to upgrade. Follow these steps to determine if you ran into the issue and learn how to complete upgrade successfully when such issue occurs.

Procedure

Follow these steps to determine if you ran into the issue.

  1. In the /var/log/appliance/apupgrade/apupgrade.log look for retries remaining that are greater than zero. For example:
    [root@gt01-node1 ~]# tail -n 20 /var/log/appliance/apupgrade/apupgrade.log
    2022-08-08 17:28:59 INFO: 25 try of check_update_completion.
    2022-08-08 17:29:00 INFO: Function: check_update_completion failed 25 try.will retry in 60 seconds Retries remaining: 65
    2022-08-08 17:30:00 INFO: 26 try of check_update_completion.
    2022-08-08 17:30:01 INFO: Function: check_update_completion failed 26 try.will retry in 60 seconds Retries remaining: 64
    2022-08-08 17:31:01 INFO: 27 try of check_update_completion.
    2022-08-08 17:31:01 INFO: Function: check_update_completion failed 27 try.will retry in 60 seconds Retries remaining: 63
    2022-08-08 17:32:01 INFO: 28 try of check_update_completion.
    2022-08-08 17:32:02 INFO: Function: check_update_completion failed 28 try.will retry in 60 seconds Retries remaining: 62
    [root@gt01-node1 ~]# tail -n 10 /var/log/appliance/apupgrade/apupgrade.log.tracelog
    
                               LOGGING FROM: mcp_postinstaller.py:check_update_completion:50
    2022-08-08 17:39:07 TRACE: RC: 0.
                               STDOUT: [worker   rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5   False   True   True   3     0     0     1     30h
                               ]
                               STDERR: []
    
                               LOGGING FROM: mcp_postinstaller.py:check_update_completion:50
    2022-08-08 17:39:07 INFO: Function: check_update_completion failed 35 try.will retry in 60 seconds Retries remaining: 55
                               LOGGING FROM: mcp_postinstaller.py:attempt_recovery:94
  2. Run:
    oc get mcp
    oc get nodes
    oc describe node
    to confirm the error for timeout evicting osd pod. For example:
    [root@gt01-node1 ~]# oc get mcp
    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-a8aa23cab71f9d6e2418ee14587002a6   True      False      False      3              3                   3                     0                      30h
    unset    rendered-unset-32ea3872f9f22c90550a9831d6e21873    True      False      False      2              2                   2                     0                      30h
    worker   rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5   False     True       True       3              0                   0                     1                      30h
    
    [root@gt01-node1 ~]# oc get nodes
    NAME                STATUS                     ROLES    AGE   VERSION
    e1n1-master.fbond   Ready                      master   30h   v1.19.0+d670f74
    e1n2-master.fbond   Ready                      master   30h   v1.19.0+d670f74
    e1n3-master.fbond   Ready                      master   30h   v1.19.0+d670f74
    e1n4.fbond          Ready,SchedulingDisabled   worker   30h   v1.19.0+d670f74
    e2n1.fbond          Ready                      worker   30h   v1.19.0+d670f74
    e2n2.fbond          Ready                      worker   30h   v1.19.0+d670f74
    e2n3.fbond          Ready                      worker   30h   v1.19.0+d670f74
    e2n4.fbond          Ready                      worker   30h   v1.19.0+d670f74
    
    [root@gt01-node1 ~]# oc describe node e1n4.fbond | grep -A7 Annotations
    Annotations:        csi.volume.kubernetes.io/nodeid:
                          {"openshift-storage.cephfs.csi.ceph.com":"e1n4.fbond","openshift-storage.rbd.csi.ceph.com":"e1n4.fbond","smb.csi.k8s.io":"e1n4.fbond"}
                        machineconfiguration.openshift.io/currentConfig: rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5
                        machineconfiguration.openshift.io/desiredConfig: rendered-worker-32ea3872f9f22c90550a9831d6e21873
                        machineconfiguration.openshift.io/reason:
                          failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod "rook-ceph-osd-2-546b98f44f-n89ck": global t...
                        machineconfiguration.openshift.io/state: Degraded
                        volumes.kubernetes.io/controller-managed-attach-detach: true
  3. Run:
    ap issues
    For example:
    [root@gt01-node1 ~]# ap issues
    Open alerts (issues) and unacknowledged events
    +------+---------------------+--------------------+-----------------------------------------------------+----------------------------------+----------+--------------+
    | ID   |          Date (EDT) |               Type | Reason Code and Title                               | Target                           | Severity | Acknowledged |
    +------+---------------------+--------------------+-----------------------------------------------------+----------------------------------+----------+--------------+
    | 1009 | 2022-08-08 14:43:53 | SW_NEEDS_ATTENTION | 490: OCS component is not healthy                   | sw://ocs                         |    MAJOR |          N/A |
    | 1010 | 2022-08-08 14:43:58 | HW_NEEDS_ATTENTION | 204: Component is missing                           | hw://enclosure2.node2.drive1     |    MINOR |          N/A |
    | 1011 | 2022-08-08 14:44:59 | SW_NEEDS_ATTENTION | 490: OCS component is not healthy                   | sw://ocs/enclosure2.node2.drive1 |    MAJOR |          N/A |
    | 1012 | 2022-08-08 15:48:15 | SW_NEEDS_ATTENTION | 491: MachineConfig update degraded                  | sw://openshift/enclosure1.node4  |  WARNING |          N/A |
    | 1013 | 2022-08-08 15:54:55 | SW_NEEDS_ATTENTION | 436: Failed to collect status from resource manager | ocs@sw://ocs                     |    MAJOR |          N/A |
    | 1016 | 2022-08-08 16:04:26 | SW_NEEDS_ATTENTION | 496: Some of platform management pods are not ready | sw://magneto                     |    MAJOR |          N/A |
    +------+---------------------+--------------------+-----------------------------------------------------+----------------------------------+----------+--------------+
    
  4. Run:
    oc -n openshift-storage get pods -o wide | grep rook-ceph-osd | grep -v prepare
    to confirm osd is in CrashLoopBackOff state. ap issues also hinted at a drive missing issue. For example:
    [root@gt01-node1 ~]# oc -n openshift-storage get pods -o wide | grep rook-ceph-osd | grep -v prepare
    rook-ceph-osd-0-7f4b48956b-zgd69                                  1/1     Running            0          27h     9.254.12.10   e1n4.fbond   <none>           <none>
    rook-ceph-osd-1-5c9bcffcc9-wl2w7                                  1/1     Running            0          27h     9.254.12.9    e1n4.fbond   <none>           <none>
    rook-ceph-osd-10-568fd6466c-g75d7                                 1/1     Running            0          26h     9.254.8.12    e2n1.fbond   <none>           <none>
    rook-ceph-osd-11-998cb4-sf2k7                                     1/1     Running            0          26h     9.254.8.10    e2n1.fbond   <none>           <none>
    rook-ceph-osd-2-546b98f44f-n89ck                                  1/1     Running            0          27h     9.254.12.12   e1n4.fbond   <none>           <none>
    rook-ceph-osd-3-6d456d6f65-b2qmt                                  1/1     Running            0          26h     9.254.6.11    e2n2.fbond   <none>           <none>
    rook-ceph-osd-4-9788cc9bf-s2psd                                   1/1     Running            0          26h     9.254.6.10    e2n2.fbond   <none>           <none>
    rook-ceph-osd-5-5dcb7cd58c-tbpmk                                  1/1     Running            0          27h     9.254.12.11   e1n4.fbond   <none>           <none>
    rook-ceph-osd-6-b7d498f57-55mmq                                   1/1     Running            0          26h     9.254.6.12    e2n2.fbond   <none>           <none>
    rook-ceph-osd-7-fc77899-dzjqs                                     0/1     CrashLoopBackOff   38         26h     9.254.6.13    e2n2.fbond   <none>           <none>
    rook-ceph-osd-8-687cc7d76c-rh7xq                                  1/1     Running            0          26h     9.254.8.11    e2n1.fbond   <none>           <none>
    rook-ceph-osd-9-8bb779f47-w57zl                                   1/1     Running            0          26h     9.254.8.9     e2n1.fbond   <none>           <none>
  5. Run:
    oc -n openshift-storage get pdb
    to confirm you can see a pdb, that stands for PodDisruptionBudget which is the actual mechanism for blocking the drain required to move on with the machineconfig update, check MAX_UNAVAILABLE to be equal to zero. For example:
    [root@gt01-node1 ~]# oc -n openshift-storage get pdb
    NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
    rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     29h
    rook-ceph-mon-pdb                                 N/A             1                 0                     29h
    rook-ceph-osd-rack-rack0                          N/A             0                 0                     3h13m
    rook-ceph-osd-rack-rack1                          N/A             0                 0                     3h13m

By now, your upgrade might have already timed out or the MCP component might be still retrying. You must follow these steps to complete upgrade successfully when such issue occurs.

  1. Run:
    ap sd disable
    to disable the problematic drive and remove it from OCS that in turn redistributes its data to the remaining drives on the cluster and prevent data loss which the pdb protected from. For example:
    [root@gt01-node1 ~]# ap sd disable enclosure2.node2.drive1
    Storage drive enclosure2.node2.drive1 will be disabled. Are you sure you want to proceed? [y/N]: y
    State change request sent successfully
  2. Monitor the progress by running:
    [root@gt01-node1 ~]# tail -f /var/log/appliance/platform/management/sysmgr.log
    [root@gt01-node1 ~]# tail -f /var/log/appliance/platform/xcat/cpds_disk_mgt.log.tracelog
  3. Wait for the node to get disabled. For example:
    [root@gt01-node1 ~]# tail -n 40 /var/log/appliance/platform/management/sysmgr.log
    2022-08-08 17:59:51.514 EDT I | <PETH> - Disabling storage drive enclosure2.node2.drive1
    ~snip~
    2022-08-08 18:08:42.072 EDT I | <PETH> - Disabled storage drive enclosure2.node2.drive1
    2022-08-08 18:08:42.105 EDT I | <AETH> - Alert raise requested: id: 1020, reason_code: 837, target: hw://enclosure2.node2.drive1
    After that, you see that the osd pods are Pending because the drain was successful, and you see the node transitioned to NotReady For example:
    rook-ceph-osd-0-7f4b48956b-mlfq6                                  0/1     Pending            0          89s
    rook-ceph-osd-1-5c9bcffcc9-68xnm                                  0/1     Pending            0          89s
    rook-ceph-osd-10-568fd6466c-g75d7                                 1/1     Running            0          26h
    rook-ceph-osd-11-998cb4-sf2k7                                     1/1     Running            0          26h
    rook-ceph-osd-2-546b98f44f-hqlfx                                  0/1     Pending            0          89s
    rook-ceph-osd-3-6d456d6f65-b2qmt                                  1/1     Running            0          26h
    rook-ceph-osd-4-9788cc9bf-s2psd                                   1/1     Running            0          26h
    rook-ceph-osd-5-5dcb7cd58c-xnjfh                                  0/1     Pending            0          110s
    rook-ceph-osd-6-b7d498f57-55mmq                                   1/1     Running            0          26h
    rook-ceph-osd-8-687cc7d76c-rh7xq                                  1/1     Running            0          26h
    rook-ceph-osd-9-8bb779f47-w57zl                                   1/1     Running            0          26h
    [root@gt01-node1 ~]# oc get nodes
    NAME                STATUS                        ROLES    AGE   VERSION
    e1n1-master.fbond   Ready                         master   30h   v1.19.0+d670f74
    e1n2-master.fbond   Ready                         master   30h   v1.19.0+d670f74
    e1n3-master.fbond   Ready                         master   30h   v1.19.0+d670f74
    e1n4.fbond          NotReady,SchedulingDisabled   worker   30h   v1.19.0+d670f74
    e2n1.fbond          Ready                         worker   30h   v1.19.0+d670f74
    e2n2.fbond          Ready                         worker   30h   v1.19.0+d670f74
    e2n3.fbond          Ready                         worker   30h   v1.19.0+d670f74
    e2n4.fbond          Ready                         worker   30h   v1.19.0+d670f74
    [root@gt01-node1 ~]# oc get mcp
    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-a8aa23cab71f9d6e2418ee14587002a6   True      False      False      3              3                   3                     0                      30h
    unset    rendered-unset-32ea3872f9f22c90550a9831d6e21873    True      False      False      2              2                   2                     0                      30h
    worker   rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5   False     True       True       3              0                   0                     1                      30h
  4. Run:
    oc get mcp
    to see that the MCP is no longer degraded and some other node is now rebooting/draining as the MCP progresses. For example:
    [root@gt01-node1 ~]# oc get mcp
    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-a8aa23cab71f9d6e2418ee14587002a6   True      False      False      3              3                   3                     0                      31h
    unset    rendered-unset-32ea3872f9f22c90550a9831d6e21873    True      False      False      2              2                   2                     0                      30h
    worker   rendered-worker-d0ef374d6867bdaeaed83bc1bb9a94e5   False     True       False      3              1                   1                     0                      31h
    [root@gt01-node1 ~]# oc get nodes
    NAME                STATUS                     ROLES    AGE   VERSION
    e1n1-master.fbond   Ready                      master   31h   v1.19.0+d670f74
    e1n2-master.fbond   Ready                      master   31h   v1.19.0+d670f74
    e1n3-master.fbond   Ready                      master   31h   v1.19.0+d670f74
    e1n4.fbond          Ready                      worker   31h   v1.19.0+d670f74
    e2n1.fbond          Ready,SchedulingDisabled   worker   31h   v1.19.0+d670f74
    e2n2.fbond          Ready                      worker   31h   v1.19.0+d670f74
    e2n3.fbond          Ready                      worker   31h   v1.19.0+d670f74
    e2n4.fbond          Ready                      worker   31h   v1.19.0+d670f74

    This unblocks the MCP, so upgrade can resume. However, after the upgrade is finished, the drive must be replaced physically and integrated back into the cluster with ap sd enable which IBM Support can coordinate.