Troubleshooting
Problem
The compute node was quiesced by itself and starting it fails (job failure, back to quiesced automatically).
Resolving The Problem
Type 1: Known issue of compute nodes in IBM Cloud Pak System prior to 2.3.x.x releases (fixed by APAR IT25311)
- Do this step to check the issue:
- Download the failed job log and search the following message:
CWZIP8760E It was not possible to determine if the logical unit number (LUN) partition was bound to this host
- Download the failed job log and search the following message:
- Do these steps to recover from the issue:
- Put the node in maintenance mode to migrate all VMs out (the node is still connected in vCenter, and vMotion is possible).
- Power off the compute node from the user interface.
- Power on the compute node from the user interface.
- Start the node.
Note: You can complete these steps by yourself.
Type 2: Known issue in vCenter/ESXi that vSphere HA agent cannot be reconfigured in all hosts
- Do this step to check the issue:
- Find a warning event with the following message:
CWZIP6212W Setting compute node SN#XXX to quiesced because its hypervisor high availability agent failed to reconfigure
- Find a warning event with the following message:
- Do these steps to recover from the issue:
- See this article for reference: https://kb.vmware.com/articleview?docid=2008609.
- Disable and re-enable the vSphere HA setting in the vSphere cluster (PureApplication cloud group).
Type 3: Known issue in ESXi that hostd process hangs and cannot restart other than rebooting ESXi host
- Do this step to check the issue:
- Confirm that the node is in "Not responding" or "Disconnected" state in vCenter.
- SSH to that compute node (create external application users, if needed).
- Run the following commands:
Important: Stopping or restarting these processes can lead to reboot node by PSM if they run when the node is in "Connected" state in vCenter./etc/init.d/vmware-fdm [status|stop|start|restart]/etc/init.d/vpxa [status|stop|start|restart]
/etc/init.d/hostd [status|stop|start|restart] - Check if restarting hostd failed with the state as "No such process".
- See this article for reference: https://kb.vmware.com/s/article/1007261?lang=en_US
Note: You can also confirm by killing the process manually. Run these commands:- ps | grep hostd
- kill -9 <the process ID of hostd-worker that is shown by the above command>
- Check if this action fails with the state as "No such process".
- Do these steps to recover from the issue:
- Restart ESXi, but vMotion is not possible.
- SSH to that compute node and run the reboot command.
Note:
- Starting the shutdown may take 10-15 minutes in ESXi. The VM downtime needs to be planned if any VMs are running in that node.
- You can complete these steps by yourself. But highly probably you might want to open a PMR and work with IBM PureApplication/IBM Cloud Pak System Support.
Document Location
Worldwide
[{"Line of Business":{"code":"","label":""},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSM8NY","label":"PureApplication System"},"ARM Category":[{"code":"a8m0z000000cwm2AAA","label":"Product Components"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"2.2.6;2.3.0;2.3.1;2.3.2;2.3.3"},{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSFQSV","label":"IBM Cloud Pak System Software"},"ARM Category":[],"Platform":[{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"2.3.0;2.3.1;2.3.2;2.3.3"}]
Was this topic helpful?
Document Information
Modified date:
11 September 2020
UID
ibm10886025