IBM Support

Avoiding Power10 code update failures from CR1 or CR2 HMCs

Troubleshooting


Problem

There is an ongoing investigation of 7063-CR1 and 7063-CR2 HMCs unexpectedly hanging requiring a power off and on locally at the console or through the BMC if accessible.  The hangs are impacting code updates to the p10 9080-HEX system when in a DHCP environment.

Symptom

The lease to the FSP is for a duration of 10 hours.  The FSP expects the DHCP server on the HMC to be available at the end of that lease.  When the HMC hangs the DHCP server is unavailable, and if the lease from the FSP comes due during the hang condition the FSP posts a DHCPDISCOVER when the hmc becomes operational again. The result is in an abandoning of the IP address the FSP previously had, yet the FSP retains both the abandoned IP address and the new IP address. 

Cause

The FSP should not retain the abandoned IP address but should release the old IP address and only store the new IP address.
If the steps are not followed in this tip before code update the following error will occur leaving the FSP on the P-side without a concurrent recovery:
image-20220909060021-6

Environment

HMC Version 10 Any Release
9080-HEX <1010.40 or 1030.00

Diagnosing The Problem

There are some obvious symptoms once this occurs.
ASM has blank values for the IP address fields in the Primary and Secondary FSP. Launch ASM to the Primary FSP, by selecting the HEX system -> Operations -> Launch Advanced System Management
image-20220915060939-1
 Navigate to Network Services -> Network Configuration.  Select the Primary first from the Service Processor drop down, then click Continue.
image-20220909055309-1
Select IPv4 and then Continue:
image-20220915062007-6
The values are blank for the IP address and subnet mask fields from the DHCP server, this means the FSP is impacted by this issue and code update fails, resulting in the FSP remaining on the P-side requiring an outage to resolve.  If DHCP is set for both eth0 and eth1 the values may both be blank if the redundant HMC also hung.  If one side is static the IP addresses are populated, this issue is specific to DHCP only, static IP addresses are not impacted. 
The symptom is "Type of address"  set to Dynamic and values for "IP address" are blank.
image-20220915063440-7
Now verify the secondary, from the primary ASM, go back to Network Services -> Network Configuration and from the drop down menu for "Service Processor" select Secondary, then Continue:
image-20220915061352-2
Select IPv4 and Continue:
image-20220915061911-5
image-20220915063836-8


Another symptom is the HMC lssysconn command shows four connections for the HEX system with the duplicates displayed with a connection_error_code of "Already Connected"
resource_type=sys,type_model_serial_num=9080-HEX*134AB38,sp_type=fsp,sp=secondary,sp_phys_loc=U78D6.SC1.KIC3264-P1-C3,ipaddr=10.254.0.57,alt_ipaddr=unavailable,state=Connected
resource_type=sys,type_model_serial_num=9080-HEX*134AB38,sp_type=fsp,sp=primary,sp_phys_loc=U78D6.SC1.KIC3264-P1-C4,ipaddr=10.254.0.28,alt_ipaddr=unavailable,state=Connected
resource_type=sys,type_model_serial_num=9080-HEX*134AB38,sp_type=fsp,sp=unavailable,sp_phys_loc=unavailable,ipaddr=10.254.0.150,alt_ipaddr=unavailable,state=No Connection,connection_error_code=Already connected  0402-0001-00000025
resource_type=sys,type_model_serial_num=9080-HEX*134AB38,sp_type=fsp,sp=unavailable,sp_phys_loc=unavailable,ipaddr=10.254.0.198,alt_ipaddr=unavailable,state=No Connection,connection_error_code=Already connected  0402-0001-00000025
The connections may have been removed in some cases by the admin.  The preferred check is to launch ASM and verify both the primary and secondary FSP display the blank values to 100% confirm this tip is applicable in the environment.

Resolving The Problem

If both the primary and secondary FSP are DHCP for eth0 and eth1, scroll to the bottom of the page and click "Reset Network Configuration" to force the FSP to assign only one IP address to the fsp.  Repeat on the secondary FSP.  If both eth0 and eth1 are not DHCP, and only one ethX port is DHCP, do not select "Reset Network Configuration", instead select "Configure this interface" checkmark and Type of IP address as "dynamic" and click OK.
image-20220915061621-3
Scroll to the bottom, if both eth0 and eth1 are blank, click "Reset Network Configuration":
image-20220909055752-4
Wait a few minutes and recheck the network configuration to verify IP addresses have populated:
image-20220915064400-11
Next, proceed to the secondary FSP:
image-20220909055724-3
Scroll to the bottom, if both eth0 and eth1 are blank, click "Reset Network Configuration":
image-20220909055803-5
Wait a few minutes and recheck the network configuration to verify IP addresses have populated:
image-20220915064325-10
Wait for 15 minutes for the HMC and FSP to finish communicating with each other with DHCP and for any automatic clean-up to take place.
Finally, from the HMC re-run lssysconn -r all and check the connections.  It may be required to clean up old connections so only one good "Connected" state exists for both the primary and secondary with rmsysconn.
rmsysconn -o remove --ip 10.254.0.150
rmsysconn -o remove --ip 10.254.0.198
If you are in a situation where connections were removed and those IP addresses are now the valid IP addresses, then run the following cmd:
mksysconn -o auto
The resulting lssysconn -r all should contain only 2 good Connections one to the primary and one to the secondary:
resource_type=sys,type_model_serial_num=9080-HEX*134AB38,sp_type=fsp,sp=secondary,sp_phys_loc=U78D6.SC1.KIC3264-P1-C3,ipaddr=10.254.0.57,alt_ipaddr=unavailable,state=Connected
resource_type=sys,type_model_serial_num=9080-HEX*134AB38,sp_type=fsp,sp=primary,sp_phys_loc=U78D6.SC1.KIC3264-P1-C4,ipaddr=10.254.0.28,alt_ipaddr=unavailable,state=Connected

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"TI000BK","label":"Power System E1080 Server (9080-HEX)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"7063-CR1","label":"Hardware Management Console (7063-CR1)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"10.1.0;10.2.0"},{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"7063-CR2","label":"Hardware Management Console (7063-CR2)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"10.1.0;10.2.0"}]

Document Information

Modified date:
09 March 2023

UID

ibm16619359