Monitoring goes into a degraded state due to stale network CR

General Page

All nodes in the IBM Fusion 2.9.0 rack go into a monitoring degraded state with the following error message and condition in each node monitoring CR:

- lastTransitionTime: <timestamp>
message: 'Failed to get NetworkSwitchCR info: stale switch cr entries, switch <switch-name> lastTransitionTime time greater than <integer-number> minutes'
observedGeneration: 1
reason: immMonitoring
status: 'True'
type: Error

environmentReadiness:
- category: ComputeHealth
message: Failed to get hardware monitoring data for the node <compute-node-name>.
messageArgs:
- <compute-node-name>
messageCode: BMYMO0002W
messageType: WARNING

Note: Node upsize and configuration might also fails with the same issue.

Workaround

Follow the steps to resolve the issue:

Take a backup of the network switch CRs. Follow the steps to download the instances of the switch CR:
- Log in to the OpenShift web console.
- Go to Administration > CustomResourceDefinitions.
- Enter the text "switch" in the search bar and select the Switch CRD from the search list.
- Go to Instances tab.
- Click each switch instance and go to the YAML tab.
- Click Download to get the switch instances.
Run the following commands using OpenShift CLI:
oc project ibm-spectrum-fusion-ns
oc get switch Example output: NAME AGE hspeed1-isfdeld 6d7h hspeed2-isfdeld 6d8h mgmt1-isfdeld 6d7h mgmt2-isfdeld 6d8h
Delete hspeed1 and hspeed2 switch CRs using the following command.
oc delete switch <hspeed-name>

For example:
oc delete switch hspeed1-isfdeld
oc delete switch hspeed2-isfdeld
Delete the network operator pod using the following commands.
oc get pods | grep network
oc delete pods <network-pod-name>
Run the following command to check whether the hspeed1 and hspeed2 switch CRs get recreated on the switch CR.
oc get switch -w
After the hspeed1 and hspeed2 switch CRs get created, then use the following command to delete the mgmt1 and mgmt2 switch CRs.
oc delete switch <mgmt-name>

For example:
oc delete switch mgmt1-isfdeld
oc delete switch mgmt2-isfdeld
Delete the network operator pod using the following commands.
oc get pods | grep network oc delete pods <network-pod-name>
Run the following command to check whether the mgmt1 and mgmt2 switch CRs get recreated on the switch CR.
oc get switch -w
Once all the switch CRs get recreated, then nodes should return to Normal state in a few minutes.

[{"Type":"MASTER","Line of Business":{"code":"LOB69","label":"Storage TPS"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSXEFDS","label":"IBM Fusion HCI Appliance Software"},"ARM Category":[{"code":"a8m3p0000000rX7AAI","label":"HW"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"2.9.0"}]

Tips

Monitoring goes into a degraded state due to stale network CR

General Page

Workaround

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?