IBM Support

Monitoring goes into a degraded state due to stale network CR

General Page

All nodes in the IBM Fusion 2.9.0 rack go into a monitoring degraded state with the following error message and condition in each node monitoring CR:

- lastTransitionTime: <timestamp>
message: 'Failed to get NetworkSwitchCR info: stale switch cr entries, switch <switch-name> lastTransitionTime time greater than <integer-number> minutes'
observedGeneration: 1
reason: immMonitoring
status: 'True'
type: Error

environmentReadiness:
- category: ComputeHealth
message: Failed to get hardware monitoring data for the node <compute-node-name>.
messageArgs:
- <compute-node-name>
messageCode: BMYMO0002W
messageType: WARNING

Note: Node upsize and configuration might also fails with the same issue.

Workaround

Follow the steps to resolve the issue:
  1. Take a backup of the network switch CRs. Follow the steps to download the instances of the switch CR:
    • Log in to the OpenShift web console.
    • Go to Administration > CustomResourceDefinitions.
    • Enter the text "switch" in the search bar and select the Switch CRD from the search list.
    • Go to Instances tab.
    • Click each switch instance and go to the YAML tab.
    • Click Download to get the switch instances. 
  2. Run the following commands using OpenShift CLI:
    oc project ibm-spectrum-fusion-ns
    oc get switch

    Example output:
    NAME            AGE
    hspeed1-isfdeld 6d7h
    hspeed2-isfdeld 6d8h
    mgmt1-isfdeld   6d7h
    mgmt2-isfdeld   6d8h
  3. Delete hspeed1 and hspeed2 switch CRs using the following command.
    oc delete switch <hspeed-name>

    For example:
    oc delete switch hspeed1-isfdeld
    oc delete switch hspeed2-isfdeld
  4. Delete the network operator pod using the following commands. 
    oc get pods | grep network
    oc delete pods <network-pod-name>
  5. Run the following command to check whether the hspeed1 and hspeed2 switch CRs get recreated on the switch CR.
    oc get switch -w
  6. After the hspeed1 and hspeed2 switch CRs get created, then use the following command to delete the mgmt1 and mgmt2 switch CRs.
    oc delete switch <mgmt-name>

    For example:
    oc delete switch mgmt1-isfdeld
    oc delete switch mgmt2-isfdeld
  7. Delete the network operator pod using the following commands. 
    oc get pods | grep network
    oc delete pods <network-pod-name>
  8. Run the following command to check whether the mgmt1 and mgmt2 switch CRs get recreated on the switch CR.
    oc get switch -w
  9. Once all the switch CRs get recreated, then nodes should return to Normal state in a few minutes.

[{"Type":"MASTER","Line of Business":{"code":"LOB69","label":"Storage TPS"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSXEFDS","label":"IBM Fusion HCI Appliance Software"},"ARM Category":[{"code":"a8m3p0000000rX7AAI","label":"HW"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"2.9.0"}]

Document Information

Modified date:
12 March 2025

UID

ibm17185468