MH1010
For Impact, Severity and other Firmware definitions, refer to the below 'Glossary of firmware terms' url:
https://www.ibm.com/support/pages/node/6555136 |
MH1010_166_094 / FW1010.60
2023/06/15
|
Impact: Data Severity: HIPER
System firmware changes that affect all systems
- HIPER/Pervasive: AIX logical partitions that own virtual I/O devices or SR-IOV virtual functions may have data incorrectly written to platform memory or an I/O device, resulting in undetected data loss when Dynamic Platform Optimizer (DPO), predictive memory deconfiguration occurs, or memory mirroring defragmentation is performed.
- To mitigate the risk of this issue, please install the latest FW1010 service pack (FW1010.60 or later).
- HIPER/Pervasive: A security problem was fixed for systems running vTPM 2.0 for vulnerabilities CVE-2023-1017 and CVE-2023-1018. These vulnerabilities can allow a denial of service attack or arbitrary code execution on the vTPM 2.0 device.
- A problem was fixed for a possible unexpected SRC BD70E510 with a core checkstop for an OCMB/DIMM failure with no DIMM callout. This is a low-frequency failure that only occurs when memory mirroring is disabled and an OCMB gets a PMIC fail. IBM support would be needed to determine if an OCMB was at fault for the checkstop. If an 'EQ_CORE_FIR(8)[14] MCHK received while ME=0 - non-recoverable' checkstop is seen that does not analyze to a root cause, MC_DSTL_FIR bits 0, 1, 4, and 5 could be checked in the log to determine if an OCMB was at fault.
- A problem was fixed for partitions using SLES 15 SP4 and SP5 not being able to boot if Secure Boot is Enabled and Enforced for the Linux Operating System, with SRC BA540010 reported. If the OS Secure Boot setting is Enabled and Log Only, the partition will boot, but the error log BA540020 will be generated at every boot. With the fix, a new SLES Secure Boot key certificate has been added to the Partition Firmware code.
- A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware. This update contains important reliability improvements and security hardening enhancements. This change updates the adapter firmware to XX.34.1002 for the following Feature Codes and CCIN: #EC66/EC67 with CCIN 2CF3. If this adapter firmware level is concurrently applied, AIX and VIOS VFs may become failed. Certain levels of AIX and VIOS do not properly handle concurrent SR-IOV updates and can leave the virtual resources in a DEAD state. Please review the following document for further details: https://www.ibm.com/support/pages/node/6997885. A re-IPL of the system instead of concurrently updating the SR-IOV adapter firmware would also work to prevent a VF failure. Update instructions: https://www.ibm.com/docs/en/power10?topic=adapters-updating-sr-iov-adapter-firmware
- A problem was fixed for a timeout occurring for an SR-IOV adapter firmware LID load during an IPL, with SRC B400FF04 logged. This problem can occur if a system has a large number of SR-IOV adapters to initialize. The system recovers automatically when the boot completes for the SR-IOV adapter.
- A problem was fixed for an SR-IOV virtual function (VF) failing to configure for a Linux partition. This problem can occur if an SR-IOV adapter that had been in use on prior activation of the partition was removed and then replaced with an SR-IOV adapter VF with a different capacity. As a workaround, the partition with the failure can be rebooted.
- A problem was fixed for unexpected vNIC failovers that can occur if all vNIC backing devices are in LinkDown status. This problem is very rare that only occurs if both vNIC server backing devices are in LinkDown, causing vNIC failovers that bounce back and forth in a loop until one of the vNIC backing devices comes to Operational status.
- A problem was fixed for Power Systems Private Cloud with Shared Utility Capacity (formerly known as Power Enterprise Pools 2.0 (PEP 2.0)) for a "Throttled" indicator that is missing on the HMC. PEP 2.0 throttling occurs if PEP 2.0 expiration has occurred. This is a rare event as most customers have automatic PEP 2.0 renewal and those that do not are notified prior to expiration that their PEP 2.0 is about to expire. Also, the throttling causes a performance degradation that should be noticeable.
- A problem was fixed for missing countdown expiration messages after a renewal of PEP 2.0. Power Enterprise Pools 2.0 (PEP 2.0), also known as Power Systems Private Cloud with Shared Utility Capacity, normally has automatic renewal, but if this does not occur for some reason, expiration of PEP 2.0 should be warned by countdown messages before expiration and by daily messages after expiration. As a workaround, the CMC appliance can be examined to see the current status of the PEP 2.0 subscription.
- A problem was fixed for a performance issue after PEP 2.0 throttling or usage of the optmem HMC command.
- This issue can be triggered by the following scenario for Power Enterprise Pools 2.0 (PEP 2.0), also known as Power Systems Private Cloud with Shared Utility Capacity:
- Due to a PEP 2.0 budget being reached or an issue with licensing for the pool, the CPU resources may be restricted (throttled).
- At the start of the next month, after a change in the budget limit or after correction of the licensing issue, the CPU resources will be returned to the server (un-throttled).
- At this point in time, the performance of the PEP 2.0 pool may not return to the level of performance before throttling.
- As a workaround, partitions and VIOS can be restarted to restore the performance to the expected levels. Although this fix applies concurrently, a restart of partitions or VIOS would need to be done to correct the system performance if it has been affected.
- A problem was fixed for an erroneous notification from the HMC that a PEP 2.0 workload is being throttled.
- Any system with Power Enterprise Pools 2.0 (PEP 2.0) enabled, also known as Power Systems Private Cloud with Shared Utility Capacity, may get a false throttle notification if the FW1010.50 firmware level had been activated concurrently. As a workaround, customers can call IBM service to get a renewal key which will clear the throttle indicator.
- A problem was fixed for a system with Power Enterprise Pools 2.0 (PEP 2.0) enabled, also known as Power Systems Private Cloud with Shared Utility Capacity, for an incorrect CoD history log entry on the HMC showing “0” authorized days for a PEP 2.0 activation history log entry. This can happen after applying a start/renewal PEP 2.0 activation code with designated proc support. However, a pop-up notification after applying the activation will show the correct number of authorized days. The "authorized days" is the number of authorized metered days for that activation. The error is only in what is logged in the history entry with no further impacts to the system as the firmware correctly applies the activation code for the correct number of authorized days provided in the activation code.
- A problem was fixed for the HMC Repair and Verify (R&V) procedure failing during concurrent maintenance of the #EMX0 Cable Card. This problem can occur if a partition is IPLed after a hardware failure before attempting the R&V operation. As a workaround, the R&V can be performed with the affected partition powered off or the system powered off.
- A problem was fixed for a possible incomplete state for the HMC-managed system with SRCs B17BE434 and B182953C logged, with the PowerVM hypervisor hung. This error can occur if a system has a dedicated processor partition configured to not allow processor sharing while active.
- A problem was fixed for incorrect SRC callouts being logged for link train failures on Cable Card to Drawer PCIe link. SRC B7006A32 is being logged for link train failure, where actually SRC B7006AA9 should be logged. And SRC B7006A32 is calling out cable card/PHB/planar when it should be B7006AA9 calling out the cable card/cables/drawer module. Every link train failure on Cable Card to Drawer PCIe link can cause this issue.
- A problem was fixed for the following SRCs missing a callout for the PCIe Extender (PCIEXTN): B400FF01, B400FF07, B400FF08, and B7006920. If a problem exists with the PCIe Extender card which results in one of these SRCs, the failing PCIe Extender will not be identified in the FRU callout list. As a workaround, replace the extender card if the existing FRU callout list does not resolve the issue for impacted SRCs.
- A problem was fixed for SRC B7006A99 being logged as a Predictive error calling out cable hardware when no cable replacement is needed. This SRC does not have an impact on PCIe function and will be logged as Informational to prevent unnecessary service actions for the non-functional error.
- A problem was fixed for an IBM i partition dump failing with an SRC B2008105. This may happen on IBM i partitions running v7r4 or newer and running with more than 64 virtual processors. It requires at least one DLPAR remove of a virtual processor followed by a partition dump sometime afterward. The problem can be avoided if DLPAR remove of virtual processors is not performed for the IBM i partition.
- If the problem is encountered, either the fix can be installed and the dump retried, or if the fix is not installed, the partition dump can be retried repeatedly until it succeeds.
- A problem was fixed for incomplete descriptions for the display of devices attached to the FC adapter in SMS menus. The FC LUNs are displayed using this path in SMS menus: "SMS->I/O Device Information -> SAN-> FCP-> <FC adapter>". This problem occurs if there are LUNs in the SAN that are not OPEN-able, which prevents the detailed descriptions from being shown for that devicA problem was fixed for newly delivered systems having dumps on them from a manufacturing process that was trying to read blank registry keys. These dumps can be ignored.
|
MH1010_163_094 / FW1010.51
2023/05/17
|
Impact: Security Severity: HIPER
System Firmware changes that affect all systems
-
HIPER/Pervasive: An internally discovered vulnerability in PowerVM on Power9 and Power10 systems could allow an attacker with privileged user access to a logical partition to perform an undetected violation of the isolation between logical partitions which could lead to data leakage or the execution of arbitrary code in other logical partitions on the same physical server. The Common Vulnerability and Exposure number is CVE-2023-30438. For additional information refer to https://www.ibm.com/support/pages/node/6987797 .
-
A problem was fixed for the ASMI failing to load when using the Firefox browser for a stand-alone ASMI session or a partial loading when connecting to ASMI from the HMC proxy.
-
A problem was identified internally by IBM related to SRIOV virtual function support in PowerVM. An attacker with privileged user access to a logical partition that has an assigned SRIOV virtual function (VF) may be able to create a Denial of Service of the VF assigned to other logical partitions on the same physical server and/or undetected arbitrary data corruption. The Common Vulnerability and Exposure number is CVE-2023-30440.
|
MH1010_151_094 / FW1010.50
2023/03/17
|
-
New features and functions
Support for using a Redfish (REST) API to gather power usage for all nodes in watts and the ambient temperature for the system.
The Redfish sample response is as shown below:
==>> GET redfish/v1/Systems/<>
...
"Oem": {
"IBMEnterpriseComputerSystem": {
...
...
"PowerInputWatts" : <> ( number in watts), <<<<============
"AmbientTemp" : <> (number in Celsius) <<<<============
}
},
...
System firmware changes that affect all systems
-
HIPER/Non-Pervasive: If a partition running in Power9 or Power10 compatibility mode encounters an uncorrectable memory error during a Dynamic Platform Optimization (DPO), memory guard, or memory mirroring defragmentation operation, undetected data corruption may occur in any partition(s) within the system or the system may terminate with SRC B700F105.
-
HIPER/Non-Pervasive: If a partition running in Power9 compatibility mode encounters memory errors and a Live Partition Mobility (LPM) operation is subsequently initiated for that partition, undetected data corruption within GZIP operations (via hardware acceleration) may occur within that specific partition.
-
HIPER/Non-Pervasive: If a partition with dedicated maximum processors set to 1 is shutting down or in a failed state while another partition is activating or DLPAR adding a processor, the system may terminate with SRC B700F103, B700F105, or B111E504 or undetected partition data corruption may occur if triggered by:
-
- Partition DLPAR memory add
- Partition activation
- Dynamic Platform Optimization (DPO)
- Memory guard
- Memory mirroring defragmentation
- Live Partition Mobility (LPM)
-
DEFERRED: For a system with I/O Enlarged Capacity enabled and PCIe expansion drawers attached, a problem was fixed for the hypervisor using unnecessarily large amounts of storage that could result in system termination. This happens because extra memory is allocated for the external I/O drawers which should have been excluded from "I/O Enlarged Capacity". This problem can be avoided by not enabling "I/O Enlarged Capacity". This fix requires an IPL to take effect because the Huge Dynamic DMA Window capability (HDDW) TCE tables for the I/O memory are allocated during the IPL.
-
DEFERRED: For a multi-node system, a problem was fixed for the wrong processor configurations being sent to each chip's Self-Boot Engine (SBE). With this incorrect knowledge, at the start of a memory-preserving reboot (MPIPL), the SBEs may fail to wait for other nodes to quiesce, causing non-deterministic errors. If this error
-
occurs, the system should auto-recover, but an MPIPL dump could be lost on the re-IPL.
-
DEFERRED: A problem was fixed for false PMIC N mode fails for select DDIMMs. Data between pmic2/3 was swapped and hence reported a current imbalance error leading to N mode fails in the PMIC health check telemetry log. The error is more likely to show up on RCD-less DDIMMs.
-
A change was to reduce the number of hidden logs when doing fabric hang recovery. This has very little impact on the system other than to reduce system time spent on creating unneeded logs. But the fix is shown here because it changes the SBE firmware (SBE changes result in a slightly longer firmware update time).
-
A security problem was fixed for a scenario where the IBM PowerVM Hypervisor could allow an attacker to obtain sensitive information if they gain service access to the HMC.
-
Security problems were fixed for the FSP ASMI GUI for security vulnerabilities CVE-2022-4304 (attacker who can send a high volume of requests to the FSP and has large amounts of processing power can retrieve a plaintext password) and CVE-2022-4450 (the administrator can crash web server when uploading a HTTPS certificate). For CVE-2022-4304, the vulnerability is exposed whenever the FSP is on the network. For CVE-2022-4450, the vulnerability is exposed if the FSP administrator uploads a malicious certificate.
The Common Vulnerabilities and Exposures issue numbers for these problems are CVE-2022-4304 and CVE-2022-4450.
-
A problem was fixed for a security scan with NSFOCUS reporting a medium-level vulnerability for a slow HTTPS request denial of service attack against ASMI. This occurs whenever NSFOCUS scans are run.
-
A problem was fixed for a security scan with NSFOCUS reporting the following low-priority vulnerabilities:
1. Low. Web server enabled "options"
2. Low. Response no "Referrer-Policy" header
3. Low. Response no "X-Permitted-Cross-Domain-Policies" header
4. Low. Response no "X-Download-Options" header
5. Low. Response no "Content-Security-Policy" header
There is no impact to the system from these as the FSP service processor does not provide any features which can be exploited by the five vulnerabilities.
-
A problem was fixed for the ASMI SMP cable validation not being able to detect cross-plugged SMP cables. This always occurs if the cross-plugged SMP cables are of different lengths.
-
A problem was fixed for the NVMe drive identify LED lights not being lit when service is required. As a workaround, the location code of the drive should be used to locate the drive when doing the repair operation.
-
A problem was fixed for the digital power system sweep (DPSS) not doing a self-recovery from corruption when SRC 1100D00C is logged. As a workaround, a reset of the FSP will re-download the DPSS code to correct the corruption.
-
A problem was fixed for an errant concurrent firmware update that results in a deconfigured FSP. This is a rare error that can occur if an FSP runs out of memory during the code update while a firmware file is being updated on it. If this problem occurs, the failed FSP can be recovered by doing a disruptive firmware update to get the levels back to the old driver level. Then clear the FSP deconfiguration and do an AC cycle or pinhole reset.
-
A problem was fixed for performance slow downs that can occur during the Live Partition Mobility (LPM) migration of a partition in POWER9, POWER10, or default processor compatibility modes. For this to happen to a partition in default processor compatibility mode, it must have booted on a Power10 system. If this problem occurs, the performance will return to normal after the partition migration completes. As a workaround, the partition to be migrated can be put into POWER9_base processor compatibility mode or older.
-
A problem was fixed for an SR-IOV adapter showing up as "n/a" on the HMC's Hardware Virtualized I/O menu. This is an infrequent error that can occur if an I/O drawer is moved to a different parent slot. As a workaround, the PowerVM Hypervisor NVRAM can be cleared or the I/O drawer can be moved back to the original parent slot to clean up the configuration.
-
A problem was fixed for a resource dump (rscdump) having incorrect release information in the dump header. There is a four-character length pre-pended to the value and the last four characters of the release are truncated. This problem was introduced in Power 10.
-
A problem was fixed for too frequent callouts for repair action for recoverable errors for Predictive Error (PE) SRCs B7006A72, B7006A74, and B7006A75. These SRCs for PCIe correctable error events called for a repair action but the threshold for the events was too low for a recoverable error that does not impact the system. The threshold for triggering the PE SRCs has been increased for all PLX and non-PLX switch correctable errors.
-
A problem was fixed for not being able to reduce partition memory when the PowerVM hypervisor has insufficient memory for normal operations. With the fix, a partition configuration change to reduce memory is allowed when the hypervisor has insufficient memory. A possible workaround for this error is to free up system memory by deleting a partition.
-
A problem was fixed for Power Systems Private Cloud with Shared Utility Capacity (formerly known as Power Enterprise Pools 2.0) to change system throttling from immediate to gradual over 20 days if this service is not renewed and the system becomes non-compliant. This change provides more time for the system administrator to resolve the compliance issue before jobs running on the system are impacted by the reduced resources. Once the system has become non-compliant, the number of cores available will be reduced daily over 20 days until the system is back to a base level.
-
A problem was fixed for a DLPAR remove of an adapter from a partition that could leave the adapter unusable for another partition on a DLPAR add.
-
A problem was fixed for Power Enterprise Pools (PEP) 1.0 where, when making processor (proc) changes for the partitions, fewer procs report available than there actually should be for assignment to partitions. This can happen on systems with more IFL activations available than in-use. These IFL activations can cause confusion in the calculation of the number of GP procs available for partitions as the GP procs are not properly counted as "unreturned" resources when PEP 1.0 procs are removed.
This issue can be fixed by a re-IPL of the system to reset
-
miscalculated proc amounts. Or, reapplying the PEP 1.0 procs in this situation can also fix the issue.
-
For a system with I/O Enlarged Capacity enabled, greater than 8 TB of memory, and has an adapter in SR-IOV shared mode, a problem was fixed for partition or system termination for a failed memory page relocation. This can occur if the SR-IOV adapter is assigned to a VIOS and virtualized to a client partition and then does an I/O DMA on a section of memory greater than 2 GB in size. This problem can be avoided by not enabling "I/O Enlarged Capacity".
-
A problem was fixed for cable card cable (PCIe3 Optical Cable Adapter for the PCIe3 Expansion Drawer) FRUs and location codes that may not appear in an Exchange FRU list during a service repair using the HMC. This prevents the Exchange FRU procedure from being started to complete the repair. This problem is triggered by scenarios in which cable card VPD is not or cannot be read (for example, cable card swap for an invalid configuration). These scenarios would lead to cable card ports not being added to the Location Code Maps in the PowerVM hypervisor. The presence of these location codes is required for the HMC Service Focal Point (SFP) to show them on the service panels.
-
A problem was fixed for an incorrect capacity displayed for a Fibre Channel device using SMS option "I/O Device Information". This happens every time for a device that has a capacity greater than 2 TB. For this case, the capacity value displayed may be significantly less than 2 TB. For example, a 2 TB device would be shown as having a capacity of 485 GB.
-
A problem was fixed for not all adapter ports being displayed when using the System Management Service (SMS) menu option I/O Device Information to display Fibre Channel devices that support NVMe over Fabric. The host NVMe Qualified Name (NQN) value may not be displayed either. The problem is caused by using SMS I/O Device Information to display FC NVMe over Fabric adapter ports and is dependent on the number of ports assigned to the logical partition. This issue is only seen when using I/O Device Information. All ports are correctly displayed when attempting to select a boot device or when setting the boot device list from SMS.
-
A problem was fixed for a partition firmware data storage error with SRC BA210003 logged or for a failure to locate NVMe target namespaces when attempting to access NVMe devices over Fibre Channel (FC-NVME) SANs connected to third-party vendor storage systems. This error condition, if it occurs, prevents firmware from accessing NVMe namespaces over FC as described in the following scenarios:
1) Boot attempts from an NVMe namespace over FC using the current SMS bootlist could fail.
2) From SMS menus via option 3 - I/O Device Information - no devices can be found when attempting to view NVMe over FC devices.
3) From SMS menus via option 5 - Select Boot Options - no bootable devices can be found when attempting to view and select an NVMe over FC bootable device for the purpose of boot, viewing the current device order, or modifying the boot device order.
The trigger for the problem is attempted access of NVMe namespaces over Fibre Channel SANs connected to storage systems via one of the scenarios listed above. The frequency of this problem can be high for some of the vendor storage systems.
-
A problem was fixed for an HMC lpar_netboot error for a partition with a VNIC configuration. The lpar_netboot logs show a timeout due to a missing value. As a workaround, doing the boot manually in SMS works. The lpar_netboot could also work as long as broadcast bootp is not used, but instead use lpar_netboot with a standard set of parameters that include Client, Server, and Gateway IP addresses.
-
A problem was fixed to prevent a predictive callout and guard of a processor on the first occurrence of a processor core recoverable error with FIR bits (INT_CQ_FIR[47:50]) set. This is a recoverable array error in the interrupt unit of the core that should not be called out and guarded until a certain threshold of these errors is exceeded. The SRC is B113E504 but the FIR bits in the log need to be checked to determine that this is the problem. With the fix, the threshold for the error has been set to 32 per day before there is a predictive callout and guard of the errant core.
-
A problem was fixed to prevent unnecessary predictive core guards caused by PCIe I/O errors with BC70E540 SRCs logged for an "L2FIR[13] = NCU_POWERBUS_DATA_TIMEOUT" error. This was a secondary fault of PCIe I/O errors and not a true processor core timeout.
-
A problem was fixed to isolate the core from the Matrix-Multiply Assist (MMA) for the purpose of determining core health. Without this fix, an MMA in an unavailable state could cause a core to be guarded, even though the core was otherwise usable and good.
-
A problem was fixed to allow core recovery to handle recoverable processor core errors without thresholding in the hypervisor. The thresholding can cause a system checkstop and an unnecessary guard of a core. Core
-
recovery was also changed to not threshold a processor core recoverable error with FIR bit (EQ_CORE_FIR[37]) set if LSU_HOLD_OUT_REG7[4:5] has a non-zero value.
-
A problem was fixed for an SR-IOV adapter virtual function (VF) not being accessible by the OS after a reboot or immediate restart of the logical partition (LPAR) owning the VF. This can happen for SR-IOV adapters located in PCIe3 expansion drawers as they are not being fully reset on the shutdown of a partition. As a workaround, do not do an immediate restart of an LPAR - leave the LPAR shut down for more than a minute so that the VF can quiesce before restarting the LPAR.
System firmware changes that affect certain systems
-
For a system with an IBM i partition, a problem was fixed for the IBMi 60-day "Trial 5250" function not working. The "Trial 5250" is only needed for the case of an incomplete system order that results in the IBM i 100% 5250 feature being missing. Since the "Trial 5250" is temporary anyway and valid for only 60 days, an order for the permanent 5250 feature is needed to fully resolve the problem.
|
MH1010_146_094 / FW1010.40
2022/10/31 |
Impact: Security Severity: HIPER
System firmware changes that affect all systems
- HIPER/Pervasive: The following problems were fixed for certain SR-IOV adapters in shared mode when the physical port is configured for Virtual Ethernet Port Aggregator (VEPA):
1) A security problem for CVE-2022-34331 was addressed where switches configured to monitor network traffic for malicious activity are not effective because of errant adapter configuration changes. The misconfigured adapter can cause network traffic to flow directly between the VFs and not out the physical port hence bypassing any possible monitoring that could be configured in the switch.
2) Packets may not be forwarded after a firmware update, or after certain error scenarios which require an adapter reset. Users configuring or using VEPA mode should install this update.
These fixes pertain to adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; and #EC66/EC67 with CCIN 2CF3.
Update instructions: https://www.ibm.com/docs/en/power10?topic=updates-sr-iov-firmware-update
- HIPER/Pervasive: A problem was fixed for intermittent PCIe adapter failures during an IPL with SRC B7006976 logged. The #EN1J/#EN1K PCIe4 32 GB 2-port Optical Fibre Channel adapters may fail during link training. If a failure occurs, the adapter will not be able to be used until a restart of the LPAR is done or a DLPAR is done to do a remove/add for the failed adapter slot.
- Security problems were fixed for vTPM 1.2 by updating its OpenSSL library to version 0.9.8zh. Security vulnerabilities CVE-2022-0778, CVE-2018-5407, CVE-2014-0076, and CVE-2009-3245 were addressed. These problems only impact a partition if vTPM version 1.2 is enabled for the partition.
- A security problem was fixed for vTPM 2.0 by updating its libtpms library. Security vulnerability CVE-2021-3746 was addressed. This problem only impacts a partition if vTPM version 2.0 is enabled for the partition. The biggest threat from this vulnerability is system availability.
- A change was made for DDIMM operation to comply with dram controller requirement to disable periodic ZQ calibration during concurrent row repair operation, then restore afterward. The change improves resiliency against possible memory errors during the row repair operation.
- A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware. No specific adapter problems were addressed at this new level. This change updates the adapter firmware to XX.32.1010 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; and #EC66/EC67 with CCIN 2CF3.
Update instructions: https://www.ibm.com/docs/en/power10?topic=updates-sr-iov-firmware-update
- A problem was fixed for a factory reset failing to restore "Aggressive Prefetch" to the default of "Disabled". After a factory reset, the setting for "Aggressive Prefetch" was preserved from what it was before the factory reset. The ASMI menu can be used to disable the "Aggressive Prefetch" mode.
- A problem was fixed for an intermittent service processor core dump for MboxDeviceMsg with SRCs B1818601 and B6008601 logged while the system is running. This is a timing failure related to a double file close on an NVRAM file. The service processor will automatically recover from this error with no impact on the system.
- A problem was fixed for a ramp-up on fan speeds across all nodes when only one node is running hot. This happens whenever temperatures run high on a single node of a multi-node system.
- A problem was fixed for an LPAR activation failure with SRC B2001236 logged for an NVRAM decryption error because of a bad NVRAM key. This can occur for a partition with vTPM 2.0 configured and Platform Keystore (PKS) not configured if there has been a partition recovery using the HMC. After the partition recovery and if the partition is allowed to activate before powering off the system, the partition may fail to activate on the next IPL attempt with B2001236 logged. The workaround on a partition recovery activation is to allow the partition to activate to run long enough for data to be written to the NVRAM, which will flush the vTPM 2.0 data to the service processor with the correct NVRAM key.
- A problem was fixed for an SR-IOV adapter in shared mode failing on an IPL with SRC B2006002 logged. This is an infrequent error caused by a different SR-IOV adapter than expected being associated with the slot because of the same memory buffer being used by two SR-IOV adapters. The failed SR-IOV adapter can be powered on again and it should boot correctly.
- A problem was fixed for a PCIe3 I/O Expansion Drawer not activating with only a single cable attached after Concurrent Maintenance or an IPL. When a #EJ24 cable card with CCIN 6B53 is in an x8 CEC slot and only the low cable is connected (high cable is disconnected), the PCIe connections will not activate.
The workaround is to attach both cables and then retry the operation.
- A problem was fixed for a partition with VPMEM failing to activate after a system IPL with SRC B2001230 logged for a "HypervisorDisallowsIPL" condition. This problem is very rare and is triggered by the partition's hardware page table (HPT) being too big to fit into a contiguous space in memory. As a workaround, the problem can be averted by reducing the memory needed for the HPT. For example, if the system memory is mirrored, the HPT size is doubled, so turning off mirroring is one option to save space. Or the size of the VPMEM LUN could be reduced. The goal of these options would be to free up enough contiguous blocks of memory to fit the partition's HPT size.
- A problem was fixed for a failed removal of a virtual ethernet adapter enabled as a trunk adapter in a VIOS. This happens on any attempt to remove this type of virtual ethernet adapter. Internally, a "Get Platform Info 0x010A" command from the HMC is returned with an unknown family instead of "Power 10", causing the removal error.
- A problem was fixed for an HMC incomplete state for the managed system after a concurrent firmware update. This is an infrequent error caused by an HMC query race condition while the concurrent update is rebooting tasks in the hypervisor. A system re-IPL is needed to recover from the error.
- A problem was fixed for a system crash with SRC B7000103 that can occur when adding or removing FRUs from a PCIe3 expansion drawer (Feature code #EXM0). This error is caused by a very rare race scenario when processing multiple power alerts from the expansion drawer at the same time.
- A problem was fixed for degraded performance for PCIe adapters with SRC 57B14160 logged. This happens more frequently for the IBM i OS partitions, triggered by a hot reset of the adapter during the IPL. The degraded performance may be recovered with an LPAR IPL, DLPAR, or a device reset through the OS. If this error is happening in the IBM i, the problem may recur on a re-IPL of the partition until this fix is installed.
- A problem was fixed for a system crash with a B700F103 logged after a local core checkstop of a core with a running partition. This infrequent error also requires a configuration change on the system like changing the processor configuration of the affected partition or running Dynamic Platform Optimizer (DPO).
- A problem was fixed for a rare system hang that can happen any time Dynamic Platform Optimizer (DPO), memory guard recovery, or memory mirroring defragmentation occurs for a dedicated processor partition running in Power9 or Power10 processor compatibility mode. This does not affect partitions in Power9_base or older processor compatibility modes. If the partition has the "Processor Sharing" setting set to "Always Allow" or "Allow when partition is active", it may be more likely to encounter this than if the setting is set to "Never allow" or "Allow when partition is inactive".
This problem can be avoided by using Power9_base processor compatibility mode for dedicated processor partitions. This can also be avoided by changing all dedicated processor partitions to use shared processors.
- A problem was fixed for a rare partition hang that can happen any time Dynamic Platform Optimizer (DPO), memory guard recovery, or memory mirroring defragmentation occurs for a shared processor partition running in any compatibility mode if there is also a dedicated processor partition running in Power9 or Power10 processor compatibility mode. This does not happen if the dedicated partition is in Power9_base or older processor compatibility modes. Also, if the dedicated partition has the "Processor Sharing" setting set to "Always Allow" or "Allow when partition is active", it may be more likely to cause a shared processor partition to hang than if the setting is set to "Never allow" or "Allow when partition is inactive".
This problem can be avoided by using Power9_base processor compatibility mode for any dedicated processor partitions. This problem can also be avoided by changing all dedicated processor partitions to use shared processors.
- A problem was fixed for too frequent callouts for repair action for recoverable errors for Predictive Error (PE) SRCs B7006A72, B7006A74, and B7006A75. These SRCs for PCIe correctable error events called for a repair action but the threshold for the events was too low for a recoverable error that does not impact the system. The threshold for triggering the PE SRCs has been increased.
- A problem was fixed for an SR-IOV adapter in shared mode failing during run time with SRC B400FF04 or B400F104 logged. This is an infrequent error and may result in a temporary loss of communication as the affected SR-IOV adapter is reset to recover from the error.
- A problem was fixed for an adapter port link not coming up after the port connection speed was set to "auto". This can happen if the speed had been changed to a supported but invalid value for the adapter hardware prior to changing the speed to "auto". A workaround to this problem is to disable and enable the switch port.
- A problem was fixed for the SMS menu option "I/O Device Information". When using a partition's SMS menu option "I/O Device Information" to list devices under a physical or virtual Fibre Channel adapter, the list may be missing or entries in the list may be confusing. If the list does not display, the following message is displayed:
"No SAN adapters present. Press any key to continue".
An example of a confusing entry in a list follows:
"Pathname: /vdevice/vfc-client@30000004
WorldWidePortName: 0123456789012345
1. 500173805d0c0110,0 Unrecognized device type: c"
- A problem was fixed for booting an OS using iSCSI from SMS menus that fails with a BA010013 information log. This failure is intermittent and infrequent. If the contents of the BA010013 are inspected, the following messages can be seen embedded within the log:
" iscsi_read: getISCSIpacket returned ERROR"
" updateSN: Old iSCSI Reply - target_tag, exp_tag"
- A problem was fixed for a failed NIM download/install of OS images that are greater than 32M. This only happens when using the default TFTP block size of 512 bytes. The latest versions of AIX are greater than 32M in size and can have this problem. As a workaround, in the SMS menu, change "TFTP blocksize" from 512 to 1024. To do this, go to the SMS "Advanced Setup: BOOTP" menu option when setting up NIM install parameters. This will allow a NIM download of an image up to 64M.
- A problem was fixed for a memory leak in the service processor (FSP) that can result in an out of memory (OOM) condition in the FSP kernel with an FSP dump and reset of the FSP. This can occur after the FSP has been active for more than 80 days of uptime. If the problem occurs, the system automatically recovers with a reset/reload of the FSP. This problem is more likely to occur on systems with NVMe adapters configured.
- A problem was fixed for errant DRAM memory row repairs. Row repair was going to the wrong address or not being cleared properly and then repaired with either a spare DRAM or chip mark, The row repair failures put the system closer to a predictive callout of a DRAM.
- A problem was fixed for an IPL failure with SRC BD21E510 "MC_FIR(0)[1] MC internal non-recoverable error" logged. This is a rare early IPL failure for a Self Boot Engine (SBE) error. The problem can be recovered by retrying the IPL. The memory controller (MC) that is called out on the SRC should not be guarded when doing the retry of the IPL.
- A problem was fixed for Hostboot dumps not having a copy of Hostboot memory contents. This problem reduces the ability of IBM Support to debug certain classes of Hostboot failures. Any Hostboot crash or hang will trigger a Hostboot dump that will be missing the memory information.
- A problem was fixed for a post dump IPL failing and a system dump being lost following an abnormal system termination. This can only happen on a system when the system is going through a post dump IPL and there are not sufficient operational cores on the boot processor to support an IPL. This triggers resource recovery for the cores which can fail to restore the necessary cores if extra cores have been errantly deconfigured.
- A problem was fixed for a processor core being incorrectly predictively deconfigured with SRC BC13E504 logged. This is an infrequent error triggered by a cache line delete fail for the core with error log "Signature": "EQ_L2_FIR[0]: L2 Cache Read CE, Line Delete Failed".
- A problem was fixed for a possible system checkstop for a core hardware predictive error prematurely reaching a failure threshold. This can occur if there are multiple recovery events for a core hardware error with multiple hypervisor maintenance interrupts (HMIs) issued for each recovery event, causing the failure threshold to be reached earlier than needed. With the fix, only a single HMI is issued for each recovery event.
System firmware changes that affect certain systems
- A problem was fixed for a FSP service processor on a DHCP configured network that could lose its dynamic IP address, leading to the FSP becoming inaccessible (if the redundant network is not configured). This issue is exposed when the DHCP sever is not accessible when the DHCP lease expires on the service processor. This results in an abandonment of the IP. However, the expired IP continues to be used to successfully access the FSP until the service processor is reset/rebooted. This reset typically has occurred during a service processor firmware update resulting in a failed firmware update.
- A problem was fixed for a memory leak in the service processor (FSP) that can result in an out of memory (OOM) condition in the FSP kernel with an FSP dump and reset of the FSP. This can occur after the FSP has been active for more than 80 days of uptime. If the problem occurs, the system automatically recovers with a reset/reload of the FSP. This problem is more likely to occur on systems with NVMe adapters configured.
- A problem was fixed for a FSP service processor on a DHCP configured network that could lose its dynamic IP address, leading to the FSP becoming inaccessible (if the redundant network is not configured). This issue is exposed when the DHCP sever is not accessible when the DHCP lease expires on the service processor. This results in an abandonment of the IP. However, the expired IP continues to be used to successfully access the FSP until the service processor is reset/rebooted. This reset typically has occurred during a service processor firmware update resulting in a failed firmware update.
|
MH1010_140_094 / FW1010.34
2022/08/26 |
Impact: Availability Severity: HIPER
System firmware changes that affect all systems
- HIPER/Pervasive: A problem was fixed for an issue attempting to recover from a processor core error. The failed recovery escalates to either a system checkstop or a processor core hang. The system checkstop is reported with SRC B113E504 or B181E540. The processor core hang has been observed as a partition hang and SRC B200F007 is reported when the partition fails to shutdown. The issue may also result in a partition crash or HMC Incomplete. With this fix, the processor core recovery will work correctly with no effect on the system.
|
MH1010_135_094 / FW1010.32
2022/07/14 |
Impact: Availability Severity: HIPER
System firmware changes that affect all systems
- HIPER/Pervasive: A problem was fixed for a system hang during the concurrent code update of FW1010.31 and another problem was fixed for a potential impact to performance following any concurrent code update. If the server has booted with FW1010.31, then there is no need to install FW1010.32. If the server has applied FW1010.31 concurrently and not booted on this level, then IBM recommends applying FW1010.32 or perform a system reboot on FW1010.31 to avoid the potential performance impact. If the server is running a level prior to FW1010.31, then IBM strongly recommends installing FW1010.32 to address these and other HIPER issues fixed in FW1010.31.
|
MH1010_132_094 / FW1010.31
2022/07/01 |
Impact: Data Severity: HIPER
New Features and Functions
- HIPER/Pervasive: For systems with Power Linux partitions, support was added for a new Linux secure boot key. The support for the new secure boot key for Linux partitions may cause secure boot for Linux to fail if the Linux OS for SUSE or RHEL distributions does not have a secure boot key update.
The affected Linux distributions are as follows that need the Linux fix level that includes "Key for secure boot signing grub2 builds ppc64le".
1) SLES 15 SP4 - The GA for this Linux level includes the secure boot fix.
2) RHEL 8.5- This Linux level has no fix. The user must update to RHEL: 8.6 or RHEL 9.0.
3) RHEL 8.6
4) RHEL 9.0.
The update to a Linux level that supports the new secure boot key also addresses the following security issues in Linux GRUB2 and are the reasons that the change in secure boot key is needed as documented in the following six CVEs:
1) CVE-2021-3695
2) CVE-2022-28733
3) CVE-2022-28734
4) CVE-2022-28735
5) CVE-2022-28736
6) CVE-2022-28737
Please note that when this firmware level of FW1010.31 is applied, any LInux OS not updated to a secure boot fix level will fail to secure boot. And any Linux OS partition updated to a fix level for secure boot requires a minimum firmware level of FW1010.30 or later to be be able to do a secure boot. If FW1010.30, FW1010.31 or later is not installed but the Linux fix levels for secure boot are loaded for the Linux partition, the secure boot failure that occurs will have BA540010 logged. If secure boot verification is enabled, but not enforced (log only mode), then the fixed Linux partition boots, but a BA540020 informational error will be logged.
- Support was added for new memory refresh settings to enhance reliability for new systems shipped from manufacturing. Existing systems will pick up the enhancement on the IPL following the application of this firmware level. There is no change in system performance due to this enhancement.
- Support was added for a new Advanced System Management Interface (ASMI) System Configuration panel for Prefetch settings to enable or disable an alternate configuration of the processor core/nest to favor more aggressive prefetching behavior for the cache. "Aggressive Prefetch" is disabled by default and a change to enable it must be done at service processor standby. The default behavior of the system ("Aggressive Prefetch" disabled) will not change in any way with this new feature. The customer will need to power off and enable "Aggressive Prefetch" in ASMI to get the new behavior. Only change the "Aggressive Prefetch" value if instructed by support or if recommended by a solution vendor as it might cause degraded system performance.
System firmware changes that affect all systems
- HIPER/Pervasive: A problem was fixed for an issue that may cause undetected corruption of the Translation Look Aside Buffer (TLB). This could result in undetected data corruption or a system crash.
- HIPER/Pervasive: A problem was fixed for an issue where a register file soft error could result in undetected data corruption or a system crash. If a soft error is detected a log will be generated.
- HIPER/Pervasive: A problem was fixed for a checkstop with SRC B113E504 logged that could occur for a recoverable core event anytime after a concurrent code update has been performed. If this service pack is not installed, then a system IPL is required to eliminate the exposure.
- HIPER/Pervasive: A problem was fixed for a recoverable processor core error which fails to recover and causes a system checkstop with SRC B113E504 or B181E540 logged. With the fix, the core recovery is successful with no impact to the running workload.
- HIPER/Non-Pervasive: A problem was fixed for possible undetected data corruption, or a hardware checkstop. In IBM internal testing, it was found that the execution of the new Power10 STXVP instruction may cause undetected data corruption, or a hardware detected error reported with reference code B111E540 in certain instances.
The following applications on AIX 7.3 and/or Linux are currently known to be exposed:
OpenBLAS 0.3.12
ESSL 7.1
Eigen 3.4
Applications compiled with Open XL v17.1.0 , GCC V10/V11 or CLANG/LLVM 12, 13, 14
Any other applications exploiting the Power10 STXVP instruction.
- A security problem was fixed for a flaw in OpenSSL certificate parsing that could result in an infinite loop in the hypervisor, causing a hang in a Live Partition Mobility (LPM) target partition. The trigger for this failure is an LPM migration of a partition with a corrupted physical trusted platform module (pTPM) certificate. This is expected to be a rare problem. The Common Vulnerability and Exposure number for this problem is CVE-2022-0778.
- A problem was fixed for a potential performance impact for systems that have Lateral Cast Out Control set to disabled. This problem can occur when a processor is deconfigured. Performing a re-IPL of the system will recover from this problem.
- A problem was fixed for a change made to disable Service Location Protocol (SLP) by default for a newly shipped system so that the SLP is disabled by a reset to manufacturing defaults on all systems and to also disable SLP on all systems when this fix is applied by the firmware update. The SLP configuration change has been made to reduce memory usage on the service processor by disabling a service that is not needed for normal system operations. In the case where SLP does need to be enabled, the SLP setting can be changed using ASMI with the options "ASMI -> System Configuration -> Security -> External Services Management" to enable or disable the service. Without this fix, resetting to manufacturing defaults from ASMI does not change the SLP setting that is currently active.
- A problem was fixed for a missing warning in the ASMI Power On/Off menu that a power off while system dump is in progress will cause a truncated dump. The warning is displayed correctly in the ASMI Immediate Power Off menu. This fix also adds a warning that a power off should not be performed when a firmware update is in progress.
- A problem was fixed for a rare service processor core dump for NetsCommonMsgServer with SRC B1818611 logged that can occur when doing an AC power-on of the system. This error does not have a system impact beyond the logging of the error as an auto-recovery happens.
- A problem was fixed for a partition reboot recovery for an adapter in SR-IOV shared mode that rebooted with an SR-IOV port missing. Prior to the reboot, this adapter had SR-IOV ports that failed and were removed after multiple adapter faults, This problem should only occur rarely as it requires a sequence of multiple faults on an SR-IOV adapter in a short time interval to force the SR-IOV Virtual Function (VF) into the errant unrecoverable state. The missing SR-IOV port can be recovered for the partition by doing a remove and add of the failed adapter with DLPAR, or the system can be re-IPLed.
- A problem was fixed for an apparent hang in a partition shutdown where the HMC is stuck in a status of "shutting down" for the partition. This infrequent error is caused by a timing window during the system or partition power down where the HMC checks too soon and does not see the partition in the "Powered Off" state. However, the power off of the partition does complete even though the HMC does not acknowledge it. This error can be recovered by rebuilding the HMC representation of the managed system by following the below steps:
1) In the navigation area on the HMC, select Systems Management > Servers.
2) In the contents pane, select the required managed system.
3) Select Tasks > Operations > Rebuild.
4) Select Yes to refresh the internal representation of the managed system.
- A problem was fixed that could potentially impact the performance of a dedicated processor partition after DLPAR is used to dynamically remove a dedicated processor from the partition. This can affect all dedicated processor partitions but would more likely affect idle partitions or partitions set to share processors while active. Performing a re-IPL of the partition will recover from this problem.
- A problem was fixed for a PowerVM hypervisor task failure when using the "chhwres" command on the HMC to change an SR-IOV adapter firmware level to the alternate level with the "alternate_config " parameter. This problem can occur if NVRAM was in use by the adapter prior to the attempt to change the adapter firmware level. A re-IPL of the system is needed to recover from this error. Below is an example of an HMC command that can fail along with the error message from the HMC:
chhwres -m d135a -r sriov --rsubtype adapter -o s -a "alternate_config=1,adapter_id=4"
HSCL129A The operation to switch the adapter in slot 4 to dedicated mode failed with the following errors:
HSCL1400 An error has occurred during the operation to the managed system. Try the task again.
- A problem was fixed for a concurrent core initialization operation failure during a concurrent firmware update. This problem can occur if a core has been deconfigured due to exceeding a recoverable error threshold. Performing a re-IPL of the system will recover from this problem.
- A problem was fixed for removing an unneeded callout for the PCIe adapter cassette extender card from eleven platform event logs with SRCs matching the B7006xxx pattern. This fix will prevent unnecessary hardware replacement. The PCIe adapter cassette has CCIN 6B91 and PN 02WF424. The following SRCs have been corrected to remove the unneeded callout: B7006977, B7006A2A, B7006A2B, B7006A75, B7006A88, B7006A93, B7006A98, B7006A9D, B7006AA1, B7006AA9, and B7006AB1.
Note: the PCIe adapter cassette is never the first callout as it always follows the cable card in the callout list.
- A problem was fixed for a penalty throttle for invalid AIX Key Entitlement date and PEP 2.0 activation attempts that blocks further activation attempts until there is a re-IPL of the system. This occurs if an activation code for these specific resources is improperly entered after five previous failed attempts. With the fix, the penalty throttle is cleared after one hour has expired, and then additional activations for the affected resources can be entered again. As a workaround, a re-IPL of the system clears the number of failed activation attempts, allowing new activations to be entered.
- A problem was fixed for a hypervisor task failure with SRC B7000602 logged when running debug macro "sbdumptrace -sbmgr -detail 2" to capture diagnostic data. The secure boot trace buffer is not aligned on a 16-byte boundary in memory which triggers the failure. With the fix, the hypervisor buffer dump utility is changed to recognize 8-byte aligned end of buffer boundaries.
- A problem was fixed for a hang in the IPL of the system when it trying to power on. The problem is very infrequent and caused by a slow response from the IIC bus when the IIC bus is busy with multiple requests. To recover from the problem, reset the service processor and try the IPL again.
- A problem was fixed for a failed correctable error recovery for a DIMM that causes a flood of SRC BC81E580 error logs and also can prevent dynamic memory deallocation from occurring for a hard memory error. This is a very rare problem caused by an unexpected number of correctable error symbols for the DIMM in the per-symbol counter registers.
- A problem was fixed for certain LPC clock failures not guarding the appropriate hardware. This problem could lead to repeated failures on subsequent reboots for a hard failure. It would also not prevent future service processor failovers, leading to more errors and long failure scenarios. This error is seen when there is an LPC clock failure on the redundant path for the backup service processor during an IPL.
- A problem was fixed for deconfigured ECO cores reducing the Workload Optimized Frequency (WOF) more than it should, thereby causing system performance to be reduced.
- A problem was fixed for the isolation, callouts, and guard for core errors that cause a system checkstop. When a core causes a system checkstop, the isolation of the core is invalid and there is no callout or guard of the failing core.
- A problem was fixed for an IPL failure with RC_STOP_TRANSITION_PENDING hardware procedure error on a warm (memory-preserving ) re-IPL of the system if there were certain processor cores deconfigured at runtime. For this problem to occur, a core must have been deconfigured at runtime prior to the re-IPL of the system. A workaround to this problem is to power off the system and then do a power on IPL.
- A problem was fixed for a checkstop that can occur on a warm (memory-preserving ) re-IPL of the system if there were any processor cores deconfigured at runtime. For this problem to occur, a core must have been deconfigured at runtime prior to the re-IPL of the system. A workaround to this problem is to power off the system and then do a power on IPL.
- A problem was fixed for a hypervisor hang that can occur during concurrent firmware update resulting in an Incomplete managed system state on the HMC. The issue can occur when the Processor Sharing option for dedicated processor partitions is set to "Never Allow" or the system contains unlicensed processors. Exposure to this issue can be reduced by configuring the Processor Sharing option for dedicated processor partitions to "Allow Always".
- A problem was fixed for possible Serial Present Detect (SPD) EEPROM corruption on a memory DIMM during certain power off scenarios, causing loss of a DIMM with SRC BC8A1D07, BC201D48, or B155A437 logged. This problem can occur for certain uncontrolled power off scenarios such as pulling the AC power cord when the system is powered on, or other loss of AC power when system is running. If this problem happens, the failing memory DIMM must be replaced.
System firmware changes that affect certain systems
- For a system that does not have an HMC attached, a problem was fixed for a system dump 2GB or greater in size failing to offload to the OS with an SRC BA280000 logged in the OS and an SRC BA28003B logged on the service processor. This problem does not affect systems with an attached HMC since in that case system dumps are offloaded to the HMC, not the OS, where there is no 2GB boundary error for the dump size.
|
MH1010_122_094 / FW1010.22
2022/05/19 |
Impact: Availability Severity: HIPER
Special Note: If you have applied FW1010.20, FW1010.21, or FW1010.22 concurrently a system reboot is strongly recommended. If a reboot is not done, your system could experience an unexpected outage. If a recoverable core event occurs anytime after a concurrent code update has been performed, the system will terminate. A system IPL will eliminate the exposure.
System firmware changes that affect all systems
- HIPER/Pervasive: A problem was fixed for loss of memory resources during the system IPL with SRCs BC20E504 and BC20090F logged and memory DIMMs deconfigured. This happens because of an intermittent failure during DIMM initialization. These memory errors can be recovered by clearing all the memory deconfiguration and then doing a re-IPL of the system. The problem has a greater likelihood of occurrence on servers at FW1010.20 or FW1010.21.
|
MH1010_120_094 / FW1010.21
2022/04/29 |
Impact: Data Severity: HIPER
System firmware changes that affect all systems
- HIPER/Non-Pervasive: A problem was fixed for possible undetected data corruption. In IBM internal testing, it was found that the execution of the new Power10 STXVP instruction may cause undetected data corruption in certain instances.
The following applications on AIX 7.3 and/or Linux are currently known to be exposed
OpenBLAS 0.3.12
ESSL 7.1
Eigen 3.4
Applications compiled with Open XL v17.1.0 , GCC V10/V11 or CLANG/LLVM 12, 13, 14
Any other applications exploiting the Power10 STXVP instruction.
- A change was made to modify a core error from a core checkstop to a system checkstop with SRC B113E504 logged. The core reporting the error will be deconfigured.
|
MH1010_117_094 / FW1010.20
2022/03/31 |
Impact: Availability Severity: SPE
New Features and Functions
- Support was added for an Advanced System Management Interface (ASMI) System Configuration panel option to disable or enable the system Lateral Cast-Out function (LCO). LCO is enabled by default and a change to disable it must be done at service processor standby. POWER processor chips since POWER7 have a feature called “Lateral Cast-Out” (LCO), enabled by default, where the contents of data cast out of one core’s L3 can be written into another core’s L3. Then if a core has a cache miss on its own L3, it can often find the needed data block in another local core’s L3. This has the useful effect of slightly increasing the length of time that a storage block gets to stay in a chip’s cache, providing a performance boost for most applications. However, for some applications such as SAP HANA, the performance can be better if LCO is disabled. More information on how LCO is being configured by SAP HANA can be found in the SAP HANA on Power Advanced Operation Guide manual that can be accessed using the following link:
http://ibm.biz/sap-linux-power-library
Follow the "SAP HANA Operation" link on this page to the "SAP HANA Operation Guides" folder. In this folder, locate the updated "SAP_HANA_on_Power_Advanced_Operation_Guide" manual that has a new topic added of "Manage IBM Power Lateral Cast Out settings" which provides the additional information.
The default behavior of the system (LCO enabled) will not change in any way by this new feature. The customer will need to power off and disable LCO in ASMI to get the new behavior.
- Support was added for Secure Boot for SUSE Linux Enterprise Server (SLES) partitions. The SUSE Linux level must be SLES 15 SP4 or later. Without this feature, partitions with SLES 15 SP4 or later and which have the OS Secure Boot partition property set to "Enabled and Enforced" will fail to boot. A workaround to this is to change the partition's Secure Boot setting in the HMC partition configuration to "Disabled" or "Enabled and Log only".
System firmware changes that affect all systems
- A problem was fixed for a possible unexpected SRC B1812641 logged if the system is powered off immediately after an IPL. The frequency of this problem is expected to be very rare because systems are not normally powered off immediately after powering on. If this SRC occurs in this scenario, it can be ignored.
- A problem was fixed for a logical partition failing to boot with an SRC B700F104 logged after a memory DDIMM power fault. This is a rare problem needing a double failure on the Power Management Integrated Circuit (PMIC) that handles memory DDIMM power regulation for the OpenCAPI Memory Buffer (OCMB). A re-IPL of the system is needed to recover from this problem.
- A problem was fixed for a firmware update error with "HSCF0180E Operation failed" displayed on the HMC with error code E302F854. This fix is only available for firmware updates from FW1010.20 to a later service pack. For firmware updates from earlier levels to FW1010.20, a failure is expected unless the following circumvention is performed: On the firmware update from the HMC, select the "Advanced options" to automatically accept the new code level. This is the default setting for an HMC at PTF levels MF69286 or MF69287 for HMC V10 R1 M1011.2 For earlier levels of the HMC, the automatically accept option must be manually changed to on when performing the code update as it defaults to off. To do this, use the following steps:
1. When running the HMC code update wizard, click on "Advanced options".
2. From "Advanced options", select "Install and Activate (Implied retrieve)".
3. On the "Install and Activate panel", you will see the guidance text of "Select a LIC level type and accept option; then click OK." The two accept options displayed are as follows:
o Automatically accept
o Do Not automatically accept
To prevent the problem from occurring, the "Automatically accept" option must be selected.
- A problem was fixed for errors that can occur if doing a Live Partition Mobility (LPM) migration and a Dynamic Platform Optimizer (DPO) operation at the same time. The migration may abort or the system or partition may crash. This problem requires running multiple migrations and DPO at the same time. As a circumvention, do not use DPO while doing LPM migrations.
- A problem was fixed for a system hypervisor hang and an Incomplete state on the HMC after a logical partition (LPAR) is deleted that has an active virtual session from another LPAR. This problem happens every time an LPAR is deleted with an active virtual session. This is a rare problem because virtual sessions from an HMC (a more typical case) prevent an LPAR deletion until the virtual session is closed, but virtual sessions originating from another LPAR do not have the same check.
- A problem was fixed for vTPM 2.0 updates not being applied concurrently on a firmware update. The updates are applied after a reboot of the system.
- A problem was fixed for vague and misleading errors caused by using an invalid logical partition (LP) id for a resource dump request. With the fix, the invalid LP id is rejected immediately as a user input error instead of being processed by the main storage dump to create what appear to be severe errors.
- A problem was fixed for a partition with an SR-IOV logical port (VF) having a delay in the start of the partition. If the partition boot device is an SR-IOV logical port network device, this issue may result in the partition failing it boot with SRCs BA180010 and BA155102 logged and then stuck on progress code SRC 2E49 for an AIX partition. This problem is infrequent because it requires multiple error conditions at the same time on the SR-IOV adapter. To trigger this problem, multiple SR-IOV logical ports for the same adapter must encounter EEH conditions at roughly the same time such that a new logical port EEH condition is occurring while a previous EEH condition's handling is almost complete but not notified to the hypervisor yet. To recover from this problem, reboot the partition.
- A problem was fixed for a secondary fault after a partition creation error that could result in a Terminate Immediate (TI) of the system with an SRC B700F103 logged. The failed creation of partitions can be explicit or implicit that might trigger the secondary fault. One example of an implicit partition create is the ghost partition created for a Live Partition Mobility (LPM) migration. This type of partition can fail to create when there is insufficient memory available for the hardware page table (HPT) for the new partition.
- A problem was fixed for an I/O adapter slot error when powering on the slot with SRC B4000202 and B400F104 logged. One example where this problem has been seen is when moving an SR-IOV adapter to shared mode. This problem is infrequent and can be recovered by retrying the operation that failed, such as DLPAR, starting the partition, or moving the SR-IOV adapter.
- A problem was fixed for a System Management Services (SMS) iSCSI information panel being incorrect and an SMS abort when navigating away from the panel. The iSCSI target and initiator names are not shown. The configured IP addresses to be used for an iSCSI boot are all zeroes even after they are set. Navigating away from the iSCSI information panel causes an SMS abort. This problem is triggered by setting an iSCSI disk alias in SMS menus then attempting to show information with the following selection: "Select Boot Options -> Configure Boot Device Order -> Select 1st Boot Device ->Network -> ISCSI -> iscsi-disk1 -> Information". The probability is low that this issue will be encountered because it requires iSCSI disk aliases to be used for a boot. Normally for an iSCSI boot disk, most users use a fully qualified iSCSI OF device path which does not trigger the problem. If an SMS abort does occur when navigating away from the iSCSI information menu, the logical partition (LPAR) can be restarted to SMS menus.
- A problem was fixed for a Hostboot hang during an IPL with SRC BC141E2B logged. This is a very rare failure for a timing problem involving multiple process threads. To recover from the problem, do a re-IPL of the system.
- A problem was fixed for detecting a bad SBE SEEPROM with a SEEPROM and processor callout with SRC BC102224 logged when an SBE update is attempted and failed. The fix allows the boot to continue on the old level of the SEEPROM level. This is a rare problem that only occurs with an SBE SEEPROM that cannot be written. Without the fix, the IPL will loop and hang with issues with the SBE update being continually logged.
- A problem was fixed for a clock error during the IPL that should have been recoverable but instead failed the IPL with extra error logs that included BC8A285E and B111B901. The trigger for this problem requires a recoverable Hostboot IPL failure of some kind to occur (such as a clock error) and specifically a situation that does not result in a deconfiguration of Hostboot targets.
- A problem was fixed for a system hang caused by an Open Memory Interface (OMI) memory loop. This is a very rare error that can only occur if the OMI host memory controller data link has gone into degraded bandwidth mode (x8->x4) because of another error and it also requires a specific memory data pattern to be transmitted when in this degraded mode for the problem to occur.
- A problem was fixed for an IPL failure involving a processor that does not have any functional cores. For this rare problem to occur, a processor with only one functional core must have that core fail with a checkstop. Then on the ensuing post-dump IPL, the error occurs during the deconfiguration of the failed processor. This fix updates the Self Boot Engine (SBE).
- A problem was fixed for ASMI TTY menus allowing an unsupported change in hypervisor mode to OPAL. This causes an IPL failure with BB821410 logged if OPAL is selected. The hypervisor mode is not user-selectable in POWER9 and POWER10. Instead, the hypervisor mode is determined by the MTM of the system. With this fix, the "Firmware Configuration" option in ASMI TTY menus is removed so that it matches the options given by the ASMI GUI menus.
System firmware changes that affect certain systems
- For a system with a AIX or Linux partition, a problem was fixed a partition start failure for AIX or Linux with SRC BA54504D logged. This problem occurs if the partition is a MDC default partition with virtual Trusted Platform Module (vTPM) enabled. As a circumvention, power off the system and disable vTPM using the HMC GUI to change the default partition property for Virtualized Trusted Platform Module (VTPM) to off.
- For a system with an IBM i partition in MDC mode, a problem was fixed for a possible system hang if an HMC virtual IBM i console fails to connect. A rare timing problem with a shared lock can occur during the console connect attempt, This problem can be recovered by a re-IPL of the system.
- For systems with Linux partitions, a problem was fixed for Linux energy scale features not being enabled in Linux partitions for POWER10. With the problem, Linux is prevented from knowing that energy scale operations are available for use by the partition.
|
|
|
MH1010_094_094 / FW1010.10
2021/12/06 |
Impact: Availability Severity: HIPER
New Features and Functions
- Support for three and four node configurations for IBM Power System E1080 (9080-HEX).
- Support for PowerVM enablement of Virtual Trusted Platform Module (vTPM) 2.0.
- Support for Remote restart for vTPM 2.0 enabled partitions. Remote restart is not supported for vTPM 1.2 enabled partitions.
- TPM firmware upgraded to Nuvoton 7.2.3.0. This allows Live Partition Mobility (LPM) migrations from systems running FW920/FW930 and older service pack levels of FW940/FW950 to FW1010.10 systems.
- Support vNIC and Hybrid Network Virtualization (HNV) system configurations in Live Partition Mobility (LPM) migrations to and from FW1010,10 systems. Note- this is not supported on the earlier levels of FW1010.
- Support to increase the clock frequency on the 256GB and 128GB 4U DDIMMs to 2933 Mbps, up from 2666 Mbps.
- Support to allow a partition that fits in a single drawer to be spread across multiple drawers for I/O performance reasons.
- Support was added for OMI Time Domain Reflectometry (TDR) screening for ESD damage on the processor when replacing DIMMs This damage, when undetected, could lead to IPL or runtime OMI errors.
- DISRUPTIVE: Added information to #EXM0 PCIe3 Expansion Drawer error logs that will be helpful when analyzing problems.
- Support to add OMI Connected Memory Buffer Chip (OCMB ) related information into the HOSTBOOT and HW system dumps.
System firmware changes that affect all systems
- HIPER/Non-Pervasive: A problem was fixed for the IBM PowerVM Hypervisor where through a specific sequence of VM management operations could lead to a violation of the isolation between peer VMs. This Common Vulnerability and Exposure number is CVE-2021-38918.
- HIPER/Non-Pervasive: A processor configuration setting was changed to avoid a timing issue that can lead to system termination under high processor temperature conditions.
- A problem was fixed for system NVRAM corruption that can occur during PowerVM hypervisor shutdown. This is a rare error caused by a timing issue during the hypervisor shutdown. If this error occurs, the partition data will not be able to read from the invalid NVRAM when trying to activate partitions, so the NVRAM must be cleared and the partition profile data restored from the HMC.
- A problem was fixed for Live Partition Mobility (LPM) to remove restrictions that were active for firmware levels FW1010.00, FW1010.01, and FW1010.02. For more information on the rules for migrating between firmware levels, refer to this LPM support matrix document: https://www.ibm.com/docs/en/power10?topic=mobility-firmware-support-matrix-partition.
- A problem was fixed for system fans not increasing in speed when partitions are booted with PCIe hot adapters that require additional cooling. This fan speed problem can also occur if there is a change in the power mode that requires a higher minimum speed for the systems fans than is currently active. Fans running at a lower speed than required for proper system cooling could lead to over temperature conditions for the system.
- A problem was fixed for certain PCIe3 Fibre Channel adapters going to an unknown and undetected state after a power off/on or a DLPAR add or remove operation with SRCs BA180020 and BA250010 logged. The affected adapters are the PCIe3 x8 4-port Fibre Channel Adapters (16 Gb/s) with feature codes #EN1E and #EN1F with CCIN 579A and the PCIe3 x8 2-port Fibre Channel Adapters (16 Gb/s) with feature codes #EN1G and #EN1H; with CCIN 579B.
- A problem was fixed for performance data collection that may be inaccurate due to incorrect processor bus topology reporting by the PowerVM hypervisor. This will happen anytime a performance tool uses the "H_GetPerformanceCounterInfo" hypervisor call to get the processor bus topology data.
- A problem was fixed for the system powering off after a hardware discovery IPL. This will happen if a hardware discovery IPL is initiated while the system is set to "Power off when last partition powers off". The system will power off when the Hardware Discovery Information (IOR) partition that does hardware discovery powers off. As a workaround, one should not use the "Power off when last partition powers off" setting when doing the hardware discovery IPL. Alternatively, one can just do a normal IPL after the system powers off, and then continue as normal.
- A problem was fixed for a partition hang or unexpected interrupt behavior following a Live Partition Mobility (LPM) operation. This can happen after migrating a partition with the effective processor compatibility mode of Power9 from a Power9 or Power10 system to a Power10 system. A partition can have an effective processor compatibility mode of Power9 when the partition supports Power9 processor compatibility mode and if either of the following is true:
1) The user selected "POWER9" processor compatibility mode on the HMC.
2) Or the user selected "POWER10" compatibility mode and the partition does not support Power10 hardware.
This will not occur if a user selects the "default" or the "POWER9_Base" processor compatibility modes on the HMC. The partition hangs may not be seen until the partition is migrated back to a Power9 system. To recover from the problem, the partition can be rebooted.
- A problem was fixed for a PCIe3 Expansion Drawer Cable Card (#EJ24) losing links during the IPL. This is a rare problem that results in failures displaying the PCIe Hardware Topology Screen from the HMC and ASMI and it also can prevent ability to do concurrent maintenance on the cable card. As a workaround, power off the system, reseat or replace the cable card causing issues, and power on the system.
- A problem was fixed for dedicated processor partitions with "Maximum Processors" set to 1 that may encounter dispatching delays. The issue can occur anytime after Dynamic Platform Optimization (DPO), memory guard, or processor guard occurs on a dedicated processor partition with "Maximum Processors" set to 1. As a workaround, change the "Maximum Processors" for all dedicated processor partitions to at least 2.
- DISRUPTIVE: A problem was fixed for the lack of a error log notification for a TPM firmware update failure. The error log for a failed update is the same as the one for the working update case with SRC B7009005 logged. In the failed case, the system is running on the old level of TPM firmware but without the proper notification to the user that this has happened, and it may result in a secured boot failure.
- A problem was fixed for an HMC ExchangeFru operation which may fail when attempting to repair an EMX0 PCIe3 Expansion Drawer Module. This error only occurs with the RightBay and in the case where the Low CXP cable has a fault or is improperly plugged. A workaround to the problem can be done by connecting or replacing the Low CXP cable and then retrying the repair procedure.
- A problem was fixed for the HMC Repair and Verify (R&V) procedure failing with "Unable to isolate the resource" during concurrent maintenance of the #EMX0 Cable Card. This could lead one to take a disruptive action in order to do the repair. This should occur infrequently and only with cases where a physical hardware failure has occurred which prevents access to the PCIe reset line (PERST) but allows access to the slot power controls.
As a workaround, pulling both cables from the Cable Card to the #EMX0 expansion drawer will result in a completely failed state that can be handled by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC. Then retry the R&V operation to recover the Cable Card.
- A problem was fixed to prevent a flood of informational PCIe Host Bridge (PHB) error logs with SRC B7006A74 that cause a wrap of internal flight recorders and loss of data needed for problem debug. This flood can be triggered by bad cables or other issues that cause frequent informational error logs. With the fix, thresholding has been added for informational PHB correctable errors at 10 in 24 hours before a Predictive Error is logged.
- A problem was fixed for performance that may not be optimal for shared processor partitions after the Dynamic Platform Optimizer (DPO) is run. The PowerVM hypervisor tries to evenly spread the home dispatching cores on the same chip across all the shared cores on the chip. Because of this problem, there are situations where the hypervisor may not be spreading the virtual processors across all the shared cores on a chip. Note, the partition is assigned the optimal processor chips, just not the optimal cores in some situations. This problem only occurs after a DPO operation with shared processor partitions. To recover from the problem, a system reboot is needed to correct the accounting data that is used to track home core affinity.
- A problem was fixed to reduce an IPL window where the resource values for Power Enterprise Pools (PEP) 1.0 pool are pending prior to a system IPL completing. With the fix, the IPL time for a system in a PEP 1.0 pool has been decreased such that the partition min/cur/max values for PEP are available sooner. It is still the case that the IPL must be completed before the PEP resource values are correct.
- A problem was fixed for incorrect Power Enterprise Pools(PEP) 2.0 throttling when the system goes out of compliance. When the system is IPLed after going out of compliance, the amount of throttled resources is lower than it should be on the first day after the IPL. Later on, the IBM Cloud Management Console (CMC) corrects the throttle value. This problem requires that a PEP 2.0 system has to go out of compliance, so it should happen only rarely. To recover from this problem, the user can wait for up to one day after the IPL or have the CMC resend the desired PEP Throttling resource amount to correct it immediately.
- DISRUPTIVE: A problem was fixed for no errors being logged when unsupported cables are installed for the PCIe expansion drawer enhanced fanout module (#EMXH). Cables with feature codes #ECC6, #ECC7, #ECC8, and #ECC9 should be detected as bad cables on the install but they are not. To recover from this problem, replace the cables with the correct supported cables.
- A PowerVM hypervisor Terminate Immediate (TI) was added for the case where an NX that can fail unexpectedly and where the NX unit is not functioning correctly. The trigger for this problem is a symmetric NX job failing with a rare target space exhausted completion code (CC = 13) for jobs that do not require target space.
- A problem was fixed for certain SR-IOV adapters that encountered a rare adapter condition, had some response delays, and logged an Unrecoverable Error with SRC B400FF02. With the fix, handling of this rare condition is accomplished without the delay and an Informational Error is logged. and the adapter initialization continues without interruption. This fix pertains to adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; and #EC66/EC67 with CCIN 2CF3.
Update instructions: https://www.ibm.com/docs/en/power10?topic=updates-sr-iov-firmware-update
- A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where
the SR-IOV adapter goes through EEH error recovery, causing an informational error with SRC B400FF04 and additional information text that indicates a command failed. This always happens when an adapter goes through EEH recovery and a physical port is in VEPA mode. With the fix, the informational error is not logged..
Update instructions: https://www.ibm.com/docs/en/power10?topic=updates-sr-iov-firmware-update
- A problem was fixed for certain SR-IOV adapters where Virtual Functions (VFs) failed to configure after an immediate restart of a logical partition (LPAR) or a shutdown/restart of an LPAR. This problem only happens intermittently but is more likely to occur for the immediate restart case. A workaround for the problem is to try another shutdown and restart of the partition or use DLPAR to remove the failing VF and then use DLPAR to add it back in. This fix pertains to adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; and #EC66/EC67 with CCIN 2CF3.
The fix is in the Partition Firmware and is effective immediately after a firmware update to the fix level.
- A problem was fixed for SMS menus not showing on a failed boot for a partition with a bad configuration but the HMC displaying the state of the partition as "Running" instead of "Open Firmware". This problem always occurs for a brand new partition with no I/O devices or for a partition with boot devices in the boot list which do not exist in the logical partition (LPAR). The boot mode must also be set to "Normal". This causes the partition to fail the boot (expected), but then drop into OF prompt (unexpected) along with setting the LPAR state to "Running". The correct behavior is to display SMS menus on LPAR boot failure and to set the LPAR state to "Open Firmware". As a workaround, subsequent reboots of the LPAR can be stopped at the SMS menus, and the HMC will display "Open Firmware' as the LPAR state.
- A problem was fixed for a risk of overcurrent in the system in power management modes where Workload Optimized Frequency (WOF) is disabled. Overcurrent, if it occurs, is handled by throttling that is induced by the On-Chip Controller (OCC) Powerstate General Purpose Engine ( PGPE) to keep the system safe, but at a reduction in performance. As a workaround, keep the power management mode set to the default Maximum Performance mode.
- A problem was fixed for the On-Chip Controller (OCC) going into Safe Mode causing degraded performance for a memory card failure with SRC B1242A00 logged. This is a rare failure that requires a memory channel failure. As a workaround, the failed DIMM FRU that is called out can be replaced. It is also possible that a second DIMM FRU is called out with SRC B1242A09 for an unrelated DIMM in the memory sub-channel.. This DIMM is not failed and should not be replaced. This fix also eliminates the second incorrect DIMM callout and the B1242A09 SRC.
- A problem was fixed for error log callouts for the Trusted Platform Module (TPM). Without the fix, the failing TPM will not be called out, just the processor.
- A problem was fixed for an error in the Time Of Day (TOD) cock register not calling out and guarding the system reference clock.
- A problem was fixed for missing callouts for a hardware reference clock error with SRC BC50240B logged. The trigger for this problem is a system checkstop due to LPC bus issues originating from the reference clock. With the fix, the PNOR is guarded to ensure that the service processor fails over on this type of error.
- A problem was fixed for a system IPL termination with SRC B181E6C7 logged. This is a very rare problem. To recover, perform the IPL of the system again.
- A problem was fixed for fan (AMD) errors during a system power off with SRC 11007610, 11007650 and B1812047 logged. This infrequent error is triggered by a system slowdown caused by a flood of informational error logs. This is not a hardware problem so there is nothing needed to be replaced. The problem power off does complete after a period of time of about 20 to 30 minutes.
- A problem was fixed for OpenPOWER Memory Interface (OMI) Connected Memory Buffer Chip (OCMB) Predictive errors with SRC BC20E504 logged but not being reported to the OS or having hardware FRU callouts for guard action. The SRC is a Predictive Error (PE) and "Guard Predictive" but no guards are seen in the system for the problem. The signature description of the error is " ocmb(n0p49) (OMIDLFIR[6]) OMI-DL0 parity error detection on a lane ".
- A problem was fixed for a missing processor core callout for a failed core on a multi-node system. The error is logged with SRC B111E550 with signature description "pu(n0p0) No active error bits found", indicating that error isolation for a checkstop attention failed. This is a rare error that can occur during the system IPL for a system with more than one node where a core has a fault in one of nodes.
- A problem was fixed for a power off progress code SRC C19220FF not being cleared on the panel when the power off of the system has completed successfully. This happens every time during a power off. The progress code does change on the next user action such as a power on. For POWER9, the SRC C19220FF progress code is cleared at the end of the power off.
- A problem was fixed for a system failure on the first AC apply for a new system install with SRC B1813436 logged. This problem occurs frequently on the first apply of AC on a new system. The workaround is to do another AC cycle until the system boots to service processor standby. This could be one or more AC cycles of the system. The problem could recur after a recovery AC cycle if the system is allowed to stay at service processor standby for a couple of hours before doing a power on IPL of the system. If this happens, do another AC cycle and then IPL power on immediately to boot the system.
- A problem was fixed for possibly the wrong Workload Optimized Frequency (WOF) table being selected for a processor, causing the system to run at a non-optimal speed. This problem can happen when the number of present cores is less than the number of all possible cores for a processor.
- A problem was fixed for incorrect behavior in the guard and resource recovery for the Trusted Platform Module (TPM). These are the two fixed scenarios:
1) If a guarded TPM part is replaced, the guard record will not be automatically removed, preventing the new TPM part from going into service.
2) If a TPM is guarded and the system would not be able to boot without it, resource recovery could recover the TPM. If the system is later powered off and the user clears the guard record, and then IPLs again, the system could skip applying other non-TPM guard records, bringing guarded parts back into service unexpectedly.
As a workaround for a TPM that is guarded after a replace operation, use the service processor ASMI utility to manually clear the TPM guard records. For the case where guard records are not applied, move or remove the guarded parts as needed and IPL the system.
- DEFERRED: A problem was fixed for an unexpected failover of the service processor for a PMIC/DDIMM fault with SRC B124B901 logged. PMIC is the Power Management Integrated Circuit (PMIC) for DDIMM power regulation. This problem is rare because it requires a dual PMIC failure to get the failover. No recovery is needed as the failover keeps the system running.
- A problem was fixed for a system termination for a failed clock card with an error on the LPC bus. The failed clock card was not guarded as needed, so it caused the subsequent re-IPL to fail. As a workaround, the failed clock card can be manually guarded using ASMI, and then the system will be able to IPL.
- A problem was fixed for a failed SEEPROM on a secondary processor causing a re-IPL hang and a system termination. In this case, the firmware should have attempted to boot off of the alternate SEEPROM side but it kept booting off the failed SEEPROM side. This problem is rare and requires a hard SEEPROM failure to happen.
- A problem was fixed for a processor being deconfigured if a single SMPGROUP (SMP link) is guarded or deconfigured for the processor. With the fix, the processor is not deconfigured unless all the SMP links to the processor have failed.
- DEFERRED: A problem was fixed for many UE errors occurring for accessing Workload Optimized Frequency (WOF) data for cores on a non-boot chip that has all cores deconfigured. SRCs BC10332B and BCBA090F are logged incorrectly for each deconfigured core. A workaround for this problem is to reconfigure one of the missing processor cores or replace the processor hardware.
- DEFERRED: A problem was fixed to better distinguish clock card transient errors from clock card hard errors and also be able to detect a failed clock crystal oscillator during the IPL. This fix will reduce clock card callouts for the very rare clock transient errors and crystal oscillator failures. If any clock card is guarded with a Predictive error, an AC power cycle before a re-IPL will enable the firmware to detect a failed crystal oscillator.
- A problem was fixed for a clock oscillator fault with SRC B158BA24 logged causing processors to be unnecessarily guarded with BC10090F logged. The loss of processors could prevent the system from IPLing using the redundant clock source. This should be an infrequent problem. To recover from the loss of processors, manually reconfigure the affected processors through ASMI, and IPL again.
- A problem was fixed to remove a boot delay of at least one minute to reduce the time needed for a power on IPL.
- DEFERRED: A problem was fixed for a Hostboot terminate with SRC BC8A0506 logged when there was a functional but imperfect OMI connection to the DIMMs. To recover from this problem, the failing FRUs must be replaced.
- A problem was fixed for unnecessary guards to the TPM(s). If a single processor's module VPD is not accessible from all sources (cache, primary SEEPROM, backup SEEPROM), then discovery of remaining system parts will fail and any unprocessed parts will erroneously be marked as not present, potentially causing fatal guards to the TPM(s). The problem is triggered when a single processor's module VPD is not accessible from all sources (cache, primary SEEPROM, backup SEEPROM). Losing all three sources is considered an infrequent occurrence.
As a workaround, isolate the part whose VPD cannot be read from all of its sources, and fix the VPD or replace the part.
- A problem was fixed for a fused core guarded at runtime having its "deconfigured by error log ID" value reported by ASMI and GUI as 0. With the fix, the error log ID that led to the deconfiguration is reported.
- A problem was fixed for a Hostboot hang on a warm re-IPL with SRC BC130311 logged when the first 4 cores of the first processor are dead cores.
- A problem was fixed for a Power Management halt error that could prevent the On-Chip Controller (OCC) Safe mode from being fully achieved (OCC is disabled but frequencies are not throttled), resulting in the system running at valid high voltage and frequencies but without the means to react to future thermal events. This could cause the processors to run too hot and generate over-temperature warnings in some situations.
This fix was previously delivered for service pack FW1010.02 but it was found that frequencies were not being throttled in some cases when in Sate Mode.
- A problem was fixed for processor cores marked dead by the hypervisor preventing a re-IPL and dump collection with SRC B150BA2A logged during the Hostboot failure on the re-IPL. With the fix, processing actions on the dead cores are skipped on the re-IPL so that the IPL can complete.
This fix updates the Self Boot Engine (SBE).
System firmware changes that affect certain systems
- On systems with IBM i partitions, the PowerVM hypervisor is vulnerable to a carefully crafted IBM i hypervisor call that can lead to a system crash This Common Vulnerability and Exposure number is CVE-2021-38937.
- For a system with an AIX or Linux partition. a problem was fixed for Platform Error Logs (PELs) that are truncated to only eight bytes for error logs created by the firmware and reported to the AIX or Linux OS. These PELs may appear to be blank or missing on the OS. This rare problem is triggered by multiple error log events in the firmware occurring close together in time and each needing to be reported to the OS, causing a truncation in the reporting of the PEL. As a problem workaround, the full error logs for the truncated logs are available on the HMC or using ASMI on the service processor to view them.
- For a system with a Linux partition using an SR-IOV adapter, a problem was fixed for ping failures and packet loss for an SR-IOV logical port when a Dynamic DMA Window (DDW) changes from a bigger DMA window page size (such as 64K) back to the smaller default window page size (4K). This can happen during an error recovery that causes a DDW reset back to the default window page size.
- For a system with an AIX partition, a problem was fixed for a missing AIX errpt error log for an AIX Access Key that has expired. As a workaround, the AIX user can query the Expiration Date from the AIX command line and directly see if it has expired or not using the AIX "lparstat" command as shown in this example:
# lparstat -u
FW Update Access Key Expiration (YYYYMMDD): 20220801
AIX Update Access Key Expiration (YYYYMMDD): 20211017
AIX Image Date (YYYYMMDD): 20211210
- For systems with IBM i partitions, a problem was fixed for incorrect Power Enterprise Pools (PEP) 2.0 messages reporting "Out of Compliance" with regards to IBM i licenses. These messages can be ignored as there is no compliance issue to address in this case.
|
MH1010_076_076 / FW1010.02_2
(FW1010.02 rebuild/refresh)
2024/11/07
|
Impact: Security Severity: Hiper
System firmware changes that affect all systems
- A security problem was fixed for CVE-2024-45656
|
MH1010_075_075 / FW1010.02
2021/10/14 |
Impact: Availability Severity: Hiper
System firmware changes that affect all systems
- HIPER/Non-Pervasive: A problem was fixed for a system checkstop during a power on IPL when service processor FSP B is the primary with SRC BC50E504 logged. This problem always happens if FSP B is in the primary role for the IPL. If FSP B is currently the primary service processor, this problem can be circumvented by doing an HMC FSP failover to make FSP A the primary and FSP B the secondary.
- A problem was fixed for a Power Management halt error that could prevent the On-Chip Controller (OCC) Safe mode from being fully achieved (OCC is disabled but frequencies are not throttled), resulting in the system running at valid high voltage and frequencies but without the means to react to future thermal events. This could cause the processors to run too hot and generate over-temperature warnings in some situations.
This fix updates the Self-Boot Engine (SBE).
|
MH1010_069_069 / FW1010.01
2021/09/28 |
Impact: Availability Severity: Hiper
This service pack is a mandatory install service pack.
New Features and Functions
- DEFERRED: The Minimum Secure Version level was updated to correlate to FW1010.01. This change will prevent a back-level firmware update of the system to FW1010.00
System firmware changes that affect all systems
- HIPER/Non-Pervasive: A problem was fixed for memory DIMM failures during the IPL with SRCs BC20090F and BC20E504 logged. This is an intermittent and rare problem for a false memory training error that can be recovered from by unguarding the failed DIMMs and doing another IPL of the system.
- HIPER/Non-Pervasive: A problem was fixed for processor spare lane deployment in case of lane failures. Without the spare lane fix the processor bus goes to half-bandwidth with a degrade in performance when there are link errors. To recover from this error, the processor must be replaced.
- A problem was fixed for a system failure during processor recovery with SRC B113E504 logged. The occurrence of errors which trigger the need for processor recovery are rare.
- A security problem was fixed for the PowerVM Hypervisor that could allow a privileged user to gain access to another VM due to an assignment of duplicate World Wide Port Names (WWPNs). In some cases, the PowerVM hypervisor can assign duplicate WWPN ids to virtual fiber channel adapters in peer VMs after a specific series of service actions are performed The WWPN needs to be a unique identifier in the network. This Common Vulnerabilities and Exposures(CVE) id is CVE-2021-38923.
|
MH1010_064_064 / FW1010.00
2021/09/17 |
Impact: New Severity: New
GA Level with key features included listed below.
New Features and Functions
- This server firmware includes the SR-IOV adapter firmware level xx.30.1004 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; and #EC66/EC67 with CCIN 2CF3.
- Added support in ASMI for a new panel to do Self-Boot Engine (SBE) SEEPROM validation. This validation can only be run at the service processor standby state.
If the validation detects a problem, IBM recommends the system not be used and that IBM service be called.
- Support was added for a new service processor command that can be used to 'lock' the power management mode, such that the mode can not be changed except by doing a factory reset.
- Support was changed to disable Service Location Protocol (SLP) by default for newly shipped systems or systems that are reset to manufacturing defaults. This change has been made to reduce memory usage on the service processor by disabling a service that is not needed for normal system operations.
- Support was added to generate a service processor fipsdump whenever there is Hostboot (HB) TI and HB dump. Without this new support, an HB crash (with an HB dump) does not generate a fipsdump and the FSP FFDC at that point in time. So it was difficult to correlate what was seen in the HB dump to what was happening on the FSP at the time of the HB fail.
- Support added to Redfish to provide a command to set the ASMI user passwords using a new AccountService schema. Using this service, the ASMI admin, HMC, and general user passwords can be changed.
- Support for Live Partition Mobility (LPM) to allow LPM migrations when virtual optical devices are configured for a source partition. LPM automatically removes virtual optical devices as part of the LPM process. Without this enhancement, LPM is blocked if virtual optical devices are configured.
- Support for Live Partition Mobility (LPM) to select the fastest network connection for data transfer between Mover Service Partitions (MSPs). The configured network capacity of the adapters is used as the metric to determine what may provide the fastest connection. The MSP is the term used to designate the Virtual I/O Server that is chosen to transmit the partition’s memory contents between source and target servers.
- Support has been dropped for the smaller logical-memory block (LMB) sizes of 16MB, 32MB, and 64MB. 128MB and 256MB are the only LMB sizes that can be selected in ASMI.
- Support has been dropped for Active Memory Sharing (AMS) on POWER10 servers.
- Support for PowerVM for an AIX Update Access Key (UAK) for AIX 7.2. Interfaces are provided that validate the OS image date against the AIX UAK expiration date. Informational messages are generated when the release date for the AIX operating system has passed the expiration date of the AIX UAK during normal operation. Additionally, the server periodically checks and informs the administrator about AIX UAKs that are about to expire, AIX UAKs that have expired, or AIX UAKs that are missing. It is recommended that you replace the AIX UAK within 30 days prior to expiration.
For more information, please refer to the Q&A document for "Management of AIX Update Access Keys" at
https://www.ibm.com/support/pages/node/6480845.
- Support for LPAR Radix Page Table mode in PowerVM.
- Support for PowerVM encrypted NVRAM that enables encryption of all partition NVRAM data and partition configuration information.
- Support for isolating faults to a single node that occur between an SMP cable and two nodes by using Time Domain Reflectometry (TDR).
- Support for booting IBM i from a PCIe4 LP 32Gb 2-port Optical Fibre Channel Adapter with Feature Code #EN1K.
- Support for VIOS 3.1.3 (based on AIX 7.2 TL5 (AIX 72X) on POWER10 servers.
- Support for the IBM 4769 PCIe3 Cryptographic Coprocessor hardware security module (HSM). This HSM has Feature Code #EJ37 with CCIN C0AF. Its predecessors are IBM 4768, IBM 4767, and IBM 4765.
- Support for a mainstream 800GB NVME U.2 7 mm SSD (Solid State Drive) PCIe4 drive in a 7 mm carrier with Feature Code #EC7Q and CCIN 59B4 for AIX, Linux, and VIOS.
- Support for a PCIe4 x16 to CXP Converter card for the attachment of two active optical cables (AOC) to be used for external storage and PCIe fan-out attachment to the PCIe expansion drawers. This cable card has Feature Code #EJ24 and CCIN 6B53.
- Support for new PCIe 4.0 x8 dual-port 32 Gb optical Fibre Channel (FC) short form adapter based on the Marvell QLE2772 PCIe host bus adapter (6.6 inches x 2.731 inches). The adapter provides two ports of 32 Gb FC capability by using SR optics. Each port can provide up to 6,400 MBps bandwidth. This adapter has feature codes #EN1J/#EN1K with CCIN 579C.
- Support for new PCIe 3.0 16 Gb quad-port optical Fibre Channel (FC)l x8 short form adapter based on the Marvell QLE2694L PCIe host bus adapter (6.6 inches x 2.371 inches). The adapter provides four ports of 16 Gb FC capability by using SR optics. Each port can provide up to 3,200 MBps bandwidth. This adapter has feature codes #EN1E/#EN1F with CCIN 579A.
|