IBM Support

IBM ESS Alert (HIPER): Simultaneous canister power cycle, missing NVMe drives and out of range sensor data in ESS 3500 5141-FN2 storage enclosures

Troubleshooting


Problem

Enclosure management software running in BMC of each Canisters in an ESS 3500 5141-FN2 enclosure may exhibit following conditions and symptoms:
• Enclosure management software may incorrectly detect that the 3.3V Electronic Circuit Breaker in the power supply assembly is open and reset them. In some rare condition this may lead to reset of both power supplies at the same time leading a momentary loss of power to each canister causing them to reboot at the same time.
• Enclosure management software may incorrectly read state of the NVMe drive hot plug register and may set the drive in stuck or unallocated state from a canister view. This may result in a missing one or more paths to a NVMe drive.
• Enclosure management software may intermittently read incorrect power supply and enclosure fan operational values (such as power supply input/output voltage/current and fan rpm) and presence. In rare cases the health monitoring system such as mmhealth and mmlsenclosure may sample during the incorrect values presented by the enclosure management software may generate false alert including call home.

Symptom

ESS 3500 may show one or more of the following symptoms:
• System may report reboot of both canisters including the BMC at the same time as seen from the uptime. In most cases the reboot will not show any stored vmcore for the reboot.
• May show intermittent NATIVE_RAID DEGRADED in mmhealth node show and in mmhealth node eventlog
2023-02-10 04:56:55.889416 EST    enclosure_needsservice                 WARNING   Enclosure 78XXXXX needs service.
2023-02-10 04:56:55.892811 EST    power_supply_failed                    WARNING   Power supply psu2_right_id1 is FAILED.
2023-02-10 04:56:55.894619 EST    voltage_sensor_failed                  WARNING   Voltage sensor psu2_v_out_id38 is FAILED.
2023-02-10 05:01:55.252427 EST    enclosure_ok                           INFO      Enclosure 78XXXXX is OK.
2023-02-10 05:01:55.255970 EST    power_supply_ok                        INFO      Power supply psu2_right_id1 is OK.
2023-02-10 05:01:55.257572 EST    voltage_sensor_ok                      INFO      Voltage 
• May show intermittent enclosure needs service in mmlsenclosure
• May show intermittent power supply parameters in ipmitool sel elist out of range as shown below.
01/21/2023 | 00:43:34 | Voltage PSU1_V_IN | Lower Critical going low  | Asserted | Reading 0 < Threshold 180 Volts
  01/21/2023 | 00:43:34 | Voltage PSU1_V_IN | Lower Non-recoverable going low  | Asserted | Reading 0 < Threshold 170 Volts
  01/21/2023 | 00:43:34 | Voltage PSU1_V_OUT | Lower Non-critical going low  | Asserted | Reading 0 < Threshold 10.50 Volts
  01/21/2023 | 00:43:34 | Voltage PSU1_V_OUT | Lower Critical going low  | Asserted | Reading 0 < Threshold 10 Volts
  01/21/2023 | 00:43:34 | Voltage PSU1_V_OUT | Lower Non-recoverable going low  | Asserted | Reading 0 < Threshold 9 Volts
  01/21/2023 | 00:43:34 | Fan PSU1_FAN_TACH | Lower Non-critical going low  | Asserted | Reading 0 < Threshold 2500 RPM
  01/21/2023 | 00:43:34 | Fan PSU1_FAN_TACH | Lower Critical going low  | Asserted | Reading 0 < Threshold 2000 RPM
  01/21/2023 | 00:43:34 | Fan PSU1_FAN_TACH | Lower Non-recoverable going low  | Asserted | Reading 0 < Threshold 1000 RPM
• May show intermittent power fault sensor in ipmitool sensor
PSU1_FAULT_SEN   | 0x0   | discrete | 0x80c0| na  | na  | na   | na   | na  | na
• May show NVMe drive missing or path missing in the pdisk list
mmvdisk pdisk list -L --rg --da --not-ok
• The NVMe may show the drive is powered off from one canister
[Thu Jan 26 09:49:14 2022] pcieport 0000:00:01.7: Slot(12): Powering off due to button press
[Thu Jan 26 09:49:19 2022] pci 0000:07:00.0: Removing from iommu group 42
[Thu Jan 26 09:49:20 2022] pcieport 0000:00:01.7: Slot(12): Power fault   

Cause

Enclosure Management software in BMC reads power supply status, enclosure fan status, drive hot plug registers over PM and I2C bus. BMC had a defect in the I2C bus error and recovery handling area resulting in intermittent incorrect reading.

Environment

ESS 3500 enclosure solution running any ESS release prior to ESS release 6.1.6.0.
ESS release 6.1.6.1 incorporates updated BMC software version 12.64 with improved I2C bus error handling and recovery. The BMC firmware level can be obtained from the ipmtool mc info command.
ipmitool mc info | grep ^Firmware
Firmware Revision         : 12.64

Resolving The Problem

Affected customers need to upgrade to ESS 6.1.6.1 or later. In some rare instances, the missing NVMe issue might persist. Please contact IBM Support if you experience this. See the following link for more details of ESS 6.1.6.1.
Customers who are not able to upgrade to the fix version, can request the efix via APAR IJ47373.
While IBM recommends a full system upgrade, the BMC firmware can be upgraded to the 12.64 level without a full ESS upgrade. To do so, download firmware release gpfs.ess.firmware-6.1.6.1-1.x86_64.rpm from Fix Central and then run mmchfirmware on the ESS 3500 nodes as follows:
1) Prepare for upgrade and follow all necessary preparation and checks for upgrade as stated in the Deployment and Upgrade procedure.
Download the 6.1.6.1-1 ESS firmware from the following link:
Note: If only the firmware upgrade is performed without a full system upgrade to release 6.1.6.1 and if a subsequent system upgrade is performed to a release level below ESS 6.1.6.1, the BMC firmware will be rolled back to that release level.
2) Install the firmware rpm in each canister.
    cd to the directory where the rpm is saved and run:
# yum upgrade ./gpfs.ess.firmware-6.1.6.1-1.x86_64.rpm
   Verify that the rpm is upgraded:
# rpm -qa | grep gpfs.ess.firmware 
  gpfs.ess.firmware-6.1.6.1-1.x86_64 
3) Update the canister firmware using mmchfirmware. Run the command from each canister, one at a time:
For Canister A:
# mmchfirmware –-type storage-enclosure –-serial-number <serial num of enclosure> -N canisterA_node_name
Where <serial num of enclosure>  is the serial number of the ESS 3500 enclosure.
Example:
# mmchfirmware --type storage-enclosure --serial-number 78EXXXX -N ess3500a2-hs
  
  mmchfirmware: Processing node ess3500a2-hs
  ess3500a2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
  ess3500a2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.1-1.
  ess3500a2-hs: Updating enclosure firmware ESM_A.
  ess3500a2-hs: Found storage-enclosure 5141-FN2 78E4004, update-id
  /usr/lpp/mmfs/updates/latest/firmware/enclosure/ess3500fw.111Z.tar.
  ess3500a2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
  ess3500a2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.1-1.
  ess3500a2-hs: Updating enclosure firmware ESM_B. 
For Canister B:
# mmchfirmware –-type storage-enclosure –-serial-number <serial num of enclosure> -N canisterB_node_name 
Where <serial num of enclosure> is the serial number of the ESS 3500 enclosure.
Example:
# mmchfirmware --type storage-enclosure --serial-number 78E4004 -N ess3500b2-emsvm-hs

  mmchfirmware: Processing node ess3500b2-emsvm-hs.test.net
  ess3500b2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
  ess3500b2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.1-1.
  ess3500b2-hs: Updating enclosure firmware ESM_A.
  ess3500b2-hs: Found storage-enclosure 5141-FN2 78E4004, update-id
  /usr/lpp/mmfs/updates/latest/firmware/enclosure/ess3500fw.111Z.tar.
  ess3500b2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
  ess3500b2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.1-1.
  ess3500b2-hs: Updating enclosure firmware ESM_B. 
4) In each canister node run the following command to verify that the BMC firmware is updated to 12.64.
# /usr/lpp/mmfs/updates/latest/firmware/enclosure/ess3500lsfw.sh -v BMC 
  BMC: 12.64
5) Known issues with the BMC version 12.64 upgrade:
In this upgrade only the BMC of the canister firmware is upgraded, and BMC is restarted (without needing to restart the canister OS). When BMC is restarted without the canister OS restart, the enclosure system status LED can become out of sync and may falsely activate. The system status LED can be properly set by running following commands:
/opt/ibm/ess/tools/samples/3500faultled.sh -e E -a R
Follow the prompts. The command must be run on the canister server where BMC is acting as primary. The script will prompt if not run on the primary canister.
If the BMC firmware is upgraded without a full system upgrade to ESS version 6.1.6.1 and the 3500faultled.sh is not available in the samples directory. Please contact IBM Support for further assistance.
Note : Internal reference D.307419

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSZL24","label":"IBM Elastic Storage System"},"ARM Category":[{"code":"a8m3p000000hBnDAAU","label":"ESS 3500"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
05 July 2023

UID

ibm16985231