Compute node quiesced issues

Troubleshooting

Problem

The compute node was quiesced by itself and starting it fails (job failure, back to quiesced automatically).

Resolving The Problem

Type 1: Known issue of compute nodes in IBM Cloud Pak System prior to 2.3.x.x releases (fixed by APAR IT25311)

Do this step to check the issue:
- Download the failed job log and search the following message:
  CWZIP8760E It was not possible to determine if the logical unit number (LUN) partition was bound to this host
Do these steps to recover from the issue:
1. Put the node in maintenance mode to migrate all VMs out (the node is still connected in vCenter, and vMotion is possible).
2. Power off the compute node from the user interface.
3. Power on the compute node from the user interface.
4. Start the node.

Note: You can complete these steps by yourself.

Type 2: Known issue in vCenter/ESXi that vSphere HA agent cannot be reconfigured in all hosts

Do this step to check the issue:
- Find a warning event with the following message:
  CWZIP6212W Setting compute node SN#XXX to quiesced because its hypervisor high availability agent failed to reconfigure

Do these steps to recover from the issue:
- See this article for reference: https://kb.vmware.com/articleview?docid=2008609.
- Disable and re-enable the vSphere HA setting in the vSphere cluster (PureApplication cloud group).

Note: Your (customer's) vCenter access cannot do this step. Open a PMR and work with IBM PureApplication/IBM Cloud Pak System Support.

Type 3: Known issue in ESXi that hostd process hangs and cannot restart other than rebooting ESXi host

Do this step to check the issue:
- Confirm that the node is in "Not responding" or "Disconnected" state in vCenter.
- SSH to that compute node (create external application users, if needed).
- Run the following commands:
  Important: Stopping or restarting these processes can lead to reboot node by PSM if they run when the node is in "Connected" state in vCenter.
  /etc/init.d/vmware-fdm [status|stop|start|restart]
  
  /etc/init.d/vpxa [status|stop|start|restart]
  /etc/init.d/hostd [status|stop|start|restart]
- Check if restarting hostd failed with the state as "No such process".
- See this article for reference: https://kb.vmware.com/s/article/1007261?lang=en_US
  
  Note: You can also confirm by killing the process manually. Run these commands:
  - ps | grep hostd
  - kill -9 <the process ID of hostd-worker that is shown by the above command>
- Check if this action fails with the state as "No such process".
Do these steps to recover from the issue:
1. Restart ESXi, but vMotion is not possible.
2. SSH to that compute node and run the reboot command.

Note:

Starting the shutdown may take 10-15 minutes in ESXi. The VM downtime needs to be planned if any VMs are running in that node.
You can complete these steps by yourself. But highly probably you might want to open a PMR and work with IBM PureApplication/IBM Cloud Pak System Support.

Document Location

Worldwide

[{"Line of Business":{"code":"","label":""},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSM8NY","label":"PureApplication System"},"ARM Category":[{"code":"a8m0z000000cwm2AAA","label":"Product Components"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"2.2.6;2.3.0;2.3.1;2.3.2;2.3.3"},{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSFQSV","label":"IBM Cloud Pak System Software"},"ARM Category":[],"Platform":[{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"2.3.0;2.3.1;2.3.2;2.3.3"}]

Tips

Compute node quiesced issues

Troubleshooting

Problem

Resolving The Problem

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?