Power9 System Firmware Fix History - Release levels VM9xx

Fix Readme

Abstract

Firmware History for VM9xx Levels.

Content

VM950

VM950 For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url: https://www.ibm.com/support/pages/node/6555136
VM950_175_045 / FW950.D0 2024/12/19	Impact: Security Severity: HIPER System firmware changes that affect all systems Security problem was fixed for CVE-2023-45863 Security problem was fixed for CVE-2024-35960 Security problem was fixed for CVE-2024-35888 Security problem was fixed for CVE-2023-52781 Security problem was fixed for CVE-2024-26859 Security problem was fixed for CVE-2023-52686 Security problem was fixed for CVE-2024-26735 Security problem was fixed for CVE-2024-26763 Security problem was fixed for CVE-2024-26688 Security problem was fixed for CVE-2024-26733 Security problem was fixed for CVE-2023-52881 Security problem was fixed for CVE-2024-26659 Security problem was fixed for CVE-2024-26934 Security problem was fixed for CVE-2024-36883 Security problem was fixed for CVE-2023-52486 Security problem was fixed for CVE-2024-26671 Security problem was fixed for CVE-2024-36004 Security problem was fixed for CVE-2024-36008 Security problem was fixed for CVE-2024-26791 Security problem was fixed for CVE-2024-26644 Security problem was fixed for CVE-2023-52615 Security problem was fixed for CVE-2024-26640 Security problem was fixed for CVE-2023-52607 Security problem was fixed for CVE-2023-52457 Security problem was fixed for CVE-2023-52451 Security problem was fixed for CVE-2023-52454 Security problem was fixed for CVE-2023-6356 Security problem was fixed for CVE-2023-6536 Security problem was fixed for CVE-2023-6535 Security problem was fixed for CVE-2024-23850 Security problem was fixed for CVE-2024-0841 Security problem was fixed for CVE-2023-6915 Security problem was fixed for CVE-2023-6932 A problem has been fixed related to installing new firmware for platforms with EMX0 PCIe expansion drawers attached. SRC10009047 and SRC10009109 may be logged during the installation of new system firmware. A problem was fixed for a Novalink partition where it may not get automatically restarted if it encounters a software issue during initialization. A problem was fixed where an HMC will not properly display a server as being in a PEP2 pool. After a concurrent firmware update is performed to this Service Pack level, the problem can be fixed by rebooting the HMC or resetting the server's HMC connection. The built-in version of OpenSSL that PowerVM uses has been updated to the most current LTS release of OpenSSL3, 3.0.15. A problem was fixed where the certificate generation feature on IBM CertHub requires inclusion of the CN (Common Name) value in the CSR (certificate signing request). The FSP does not prompt for the CN field. Therefore, the CSR file generated from the FSP will fail to create a CA signed SSL certificate. The fix will prompt the user to add an CN in ASMi while CSR creation, which is required by IBM CertHub. A problem was fixed where the certificate generation feature on IBM CertHub requires inclusion of the OU (organization unit) value in the CSR (certificate signing request). The FSP does not prompt for the OU field. Therefore, the CSR file generated from the FSP will fail to create a CA signed SSL certificate. The fix will prompt the user to add an OU in ASMi while CSR creation, which is required by IBM CertHub. A problem was fixed where Redfish API output displays the psu input power value as Line input voltage. A crash caused by SRC B113E504 was updated to isolate to the correct hardware for the crash.
VM950_174_045 / FW950.C2 2024/12/19	Impact: Security Severity: HIPER System firmware changes that affect all systems Security problem was fixed for CVE-2023-45863 Security problem was fixed for CVE-2024-35960 Security problem was fixed for CVE-2024-35888 Security problem was fixed for CVE-2023-52781 Security problem was fixed for CVE-2024-26859 Security problem was fixed for CVE-2023-52686 Security problem was fixed for CVE-2024-26735 Security problem was fixed for CVE-2024-26763 Security problem was fixed for CVE-2024-26688 Security problem was fixed for CVE-2024-26733 Security problem was fixed for CVE-2023-52881 Security problem was fixed for CVE-2024-26659 Security problem was fixed for CVE-2024-26934 Security problem was fixed for CVE-2024-36883 Security problem was fixed for CVE-2023-52486 Security problem was fixed for CVE-2024-26671 Security problem was fixed for CVE-2024-36004 Security problem was fixed for CVE-2024-36008 Security problem was fixed for CVE-2024-26791 Security problem was fixed for CVE-2024-26644 Security problem was fixed for CVE-2023-52615 Security problem was fixed for CVE-2024-26640 Security problem was fixed for CVE-2023-52607 Security problem was fixed for CVE-2023-52457 Security problem was fixed for CVE-2023-52451 Security problem was fixed for CVE-2023-52454 Security problem was fixed for CVE-2023-6356 Security problem was fixed for CVE-2023-6536 Security problem was fixed for CVE-2023-6535 Security problem was fixed for CVE-2024-23850 Security problem was fixed for CVE-2024-0841 Security problem was fixed for CVE-2023-6915 Security problem was fixed for CVE-2023-6932
VM950_168_045 / FW950.C1 2024/10/25	Impact: Security Severity: HIPER System firmware changes that affect all systems A security problem was fixed for CVE-2024-45656
VM950_161_045 / FW950.C0 2024/09/27	Impact: Availability Severity: ATT System firmware changes that affect all systems A change was made to a base framework layer of various virtualization features, including SR-IOV. While the change is not expressly needed for SR-IOV, concurrent system firmware updates to this level will cause SR-IOV adapters to indicate that a update is available. Applying the update will cause a re-initialization of the specific SR-IOV adapter environment, causing a brief outage. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A security problem was fixed for CVE-2024-41781. A problem was fixed for an HMC ExchangeFru operation which may fail with SRC B7006A9E when attempting to repair an EMX0 PCIe3 Expansion Drawer Module. This error only occurs with the RightBay in the case where the Low CXP cable has a fault or is improperly plugged. To workaround the problem, connect or replace the Low CXP cable and then retry the repair procedure. A problem was fixed in the firmware for the EMX0 PCIe Gen3 I/O expansion drawer calling out cable thermal or power alarms. Most likely System Reference Codes logged can be: SRC B7006A99 SRC B7006AA6 SRC B7006AA7. This fix only pertains to systems with an attached EMX0 PCIe Gen3 I/O expansion drawer having EMXH fanout modules. A problem was fixed for SRC B7006A99 with word4 of 3741412C being logged as a Predictive error calling out cable hardware when no cable replacement is needed. This SRC does not have an impact on PCIe function and will be logged as Informational to prevent unnecessary service actions for the non-functional error. A problem was fixed for expansion drawer serviceable events not including expansion drawer cables in the FRU callout list when the expansion drawer cable may be the source of the problem. The fix changes some uses of SRC B7006A84 to either SRC B7006A85 or SRC B7006A89 to correctly include expansion drawer cables in the FRU callout list. DEFERRED: A problem was fixed in the firmware for the EMX0 PCIe Gen3 I/O expansion drawer calling out cable or other related hardware, possibly leading to link degradation. Most likely System Reference Codes logged can be: SRC B7006A80, SRC B7006A85, SRC B7006A88, SRC B7006A89. This fix only pertains to systems with an attached EMX0 PCIe Gen3 I/O expansion drawer having EMXH fanout modules. A problem was fixed that would cause an LPM to fail due to an insufficient memory for firmware error while deleting a partition on the source system. A problem was fixed for a rare problem creating and offloading platform system dumps. An SRC B7000602 will be created at the time of the failure. The fix allows for platform system dumps to be created and offloaded normally. A problem was fixed where, if TPM hardware communication becomes unstable, it can lead to sporadic LPM (Live Partition Mobility) failures. This fix adds robustness to LPM operations to avoid usage of TPM hardware that is deemed unstable in preference of more stable TPM HW or customer configured PowerVM Trusted System Key. An enhancement was made to provide a daily TPM health check to allow for advance notification of a TPM failures so that it can be addressed before performing operations dependent on it such as LPM, Disruptive System Dumps, etc. The first two times this daily TPM health check fails, a new informational SRC will be posted: B700900D. After 3 failures, the TPM will be marked as failed and the existing serviceable TPM failure SRC will posted instead. A problem was fixed where an LPAR posted error log with SRC BA54504D. The problem has been seen on systems where only one core is active. A problem was fixed for possible intermittent shared processor LPAR dispatching delays. The problem only occurs for capped shared processor LPARs or uncapped shared processor LPARS running within their allocated processing units. The problem is more likely to occur when there is a single shared processor in the system. An SRC B700F142 informational log may also be produced. A problem was fixed for a possible system hang during a Dynamic Platform Optimization (DPO), memory guard recovery, or memory mirroring defragmentation operation. The problem only occurs if the operation is performed while an LPAR is running in POWER9 processor compatibility mode. A problem was fixed where ASMI menus are not displayed correctly in all the languages. A problem was fixed where the firmware update process failed when the FSP went through the reset/reload due to FSP boot watchdog timeout error.
VM950_149_045 / FW950.B0 2024/04/16	Impact: Availability Severity: ATT System firmware changes that affect all systems A problem was fixed where a long running firmware operation involving elastic and trial-based CoD (Capacity on Demand) entitlements may time-out. This results in the server state being set to incomplete on the HMC, which will require a rebuild of the server. A problem was fixed where virtual serial numbers may not all be populated on a system properly when an activation code to generate them is applied. This results in some virtual serial numbers being incorrect or missing. A problem was fixed for an intermittent issue preventing all Power Enterprise Pool mobile resources from being restored after a server power on when both processor and memory mobile resources are in use. Additionally, a problem was fixed where Power Enterprise Pools mobile resources were being reclaimed and restored automatically during server power on such that resource assignments were impacted. The problem only impacts systems utilizing Power Enterprise Pools 1.0 resources. A problem was fixed where the target system would terminate with a B700F103 during LPM (Logical Partition Migration). The problem only occurs if there were low amounts of free space on the target system. A problem was fixed for partitions configured to use shared processor mode and set to capped potentially not being able to fully utilize their assigned processing units. To mitigate the issue if it is encountered, the partition processor configuration can be changed to uncapped. A problem was fixed where a bad core is not guarded and repeatedly causes system to crash. The SRC requiring service has the format BxxxE540. The problem can be avoided by replacing or manually guarding the bad hardware. This service pack implements a new Update Access Key (UAK) Policy. See description at https://www.ibm.com/support/pages/node/7131459 . A problem was fixed where CIMP provided sensor values (Ex. ambient temperature sensor) were not coming back after an FSP reset at system power off state. A security problem is fixed in service processor firmware by upgrading curl library to the latest version beyond 8.1.0. The Common Vulnerabilities and Exposures number for this problem is CVE-2023-28322. An enhancement was made related to vNIC failover performance. The performance benefit will be gained when a vNIC client unicast MAC address is unchanged during the failover. The performance benefit is not very significant but a minor one compared to overall vNIC failover performance. A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware. No specific adapter problems were addressed at this new level. This change updates the adapter firmware to 16.35.2000 for Feature codes EC67,EC66 and CCIN 2CF3. If this adapter firmware levels are concurrently applied, AIX and VIOS VFs may become failed. Certain levels of AIX and VIOS do not properly handle concurrent SR-IOV updates and can leave the virtual resources in a DEAD state. Please review the following document for further details: https://www.ibm.com/support/pages/node/6997885. A re-IPL of the system instead of concurrently updating the SR-IOV adapter firmware would also work to prevent a VF failure. Update instructions: https://www.ibm.com/docs/en/power9?topic=adapters-updating-sr-iov-adapter-firmware A problem was fixed where service for a processor FRU was requested when no service is actually was required. The SRC requiring service has the format BxxxE504 with a PRD Signature description matching (OCC_FIR[45]) PPC405 cache CE. The problem can be ignored unless the issue is persistently reported on subsequent IPLs. If that occurs, hardware replacement may be required.
VM950_145_045 / FW950.A0 2024/01/18	Impact: Data Severity: HIPER System firmware changes that affect all systems HIPER: Power9 servers with an I/O adapter in SRIOV shared mode, and an SRIOV virtual function assigned to an active Linux partition assigned 8GB or less of platform memory may have undetected data loss or data corruption when Dynamic Platform Optimizer (DPO), memory guard recovery, or memory mirroring defragmentation is performed. A security problem was fixed for CVE-2023-46183. A change was made to update the POWER hypervisor version of OpenSSL A security problem was fixed for CVE-2023-33851. A problem was fixed for assignment of memory to a logical partition which does not maximize the affinity between processors and memory allocations of the logical partition. This problem can occur when the system is utilizing Active Memory Mirroring (AMM) on a memory-constrained system. This only applies to systems which are capable of using AMM. As a workaround, Dynamic Platform Optimizer (DPO) can be run to improve the affinity. A problem was fixed for Logical Partition Migration (LPM) failures with an HSCLB60C message. The target partition will be rebooted when the failure occurs. This error can occur during the LPM of partitions with a large amount of memory configured (32TB or more) and where an LPM failover has started on one of the connections to a Virtual I/O Server (VIOS) designated as the Mover Service Partitions (MSP). A problem was fixed for an incorrect SRC B7005308 "SRIOV Shared Mode Disabled" error log being reported on an IPL after relocating an SRIOV adapter. This error log calls out the old slot where the SRIOV adapter was before being relocated. This error log occurs only if the old slot is not empty. However, the error log can be ignored as the relocation works correctly. A problem was fixed for transitioning an IO adapter from dedicated to SR-IOV shared mode. When this failure occurs, an SRC B4000202 will be logged. This problem may occur if an IO adapter is transitioned between dedicated and SR-IOV shared mode multiple times on a single platform IPL. A problem was fixed for Logical Partition Migration (LPM) to better handle errors reading/writing data to the VIOS which can lead to a VIOS and/or Hypervisor hang. The error could be encountered if the VIOS crashes during LPM. A problem was fixed that prevents dumps (primarily SYSDUMP files) greater than or equal to 4GB (4294967296 bytes) in size from being offloaded successfully to AIX or Linux operating systems. This problem primarily affects larger dump files such as SYSDUMP files, but could affect any dump that reaches or exceeds 4GB (RSCDUMP, BMCDUMP, etc.). The problem only occurs for systems which are not HMC managed where dumps are offloaded directly to the OS. A side effect of an attempt to offload such a dump will be the continuous writing of the dump file to the OS until the configured OS dump space is exhausted which will potentially affect the ability to offload any subsequent dumps. The resulting dump file will not be valid and can be deleted to free dump space. A problem was fixed for errors reported or partition hangs when using the SMS menu I/O Device Information to list SAN devices. One or more of SRCs BA210000, BA210003, or BA210013 will be logged. As a possible workaround, verify at least one LUN is mapped to each WWPN zoned to the partition. The partition console may display text similar to the following: Detected bad memory access to address: ffffffffffffffff Package path = / Loc-code = ... Return Stack Trace ------------------ @ - 2842558 ALLOC-FC-DEV-ENTRY - 2a9f4b4 RECORD-FC-DEV - 2aa0a00 GET-ATTACHED-FC-LIST - 2aa0fe4 SELECT-ATTACHED-DEV - 2aa12b0 PROCESS-FC-CARD - 2aa16d4 SELECT-FC-CARD - 2aa18ac SELECT-FABRIC - 2aae868 IO-INFORMATION - 2ab0ed4 UTILS - 2ab6224 OBE - 2ab89d4 evaluate - 28527e0 invalid pointer - 2a79c4d invalid pointer - 7 invalid pointer - 7 process-tib - 28531e0 quit - 2853614 quit - 28531f8 syscatch - 28568b0 syscatch - 28568b A problem was fixed for fetching the CPU temperature data from HMC energy and thermal metrics.
VM950_136_045 / FW950.90 2023/09/22	Impact: Availability Severity: SPE System firmware changes that affect all systems A problem was fixed for being unable to make configuration changes for partitions, except to reduce memory to the partitions, when upgrading to a new firmware release. bhThis can occur on systems with SR-IOV adapters in shared mode that are using most or all the available memory on the system, not leaving enough memory for the PowerVM hypervisor to fit. As a workaround, configuration changes to the system to reduce memory usage could be made before upgrading to a new firmware release. A problem was fixed for possible performance degradation in a partition when doing Nest Accelerator (NX) GZIP hardware compression. The degradation could occur if the partition falls back to software-based GZIP compression if a new Virtual Accelerator Switchboard (VAS) window allocation becomes blocked. Only partitions running in Power9 processor compatibility mode are affected. A problem was fixed for a Live Partition Mobility (LPM) migration hang that can occur during the suspended phase. The migration can hang if an error occurs during the suspend process that is ignored by the OS. This problem rarely happens as it requires an error to occur during the LPM suspend. To recover from the hang condition, IBM service can be called to issue a special abort command, or, if an outage is acceptable, the system or VIOS partitions involved in the migration can be rebooted. A problem was fixed for a possible shared processor partition becoming unresponsive or having reduced performance. This problem only affects partitions using shared processors. As a workaround, partitions can be changed to use dedicated processors. If a partition is hung with this issue, the partition can be rebooted to recover. A problem was fixed that causes slot power on processing to occur a second time when the slot is already powered on. The second slot power-on can occur in certain cases and is not needed. There is a potential for this behavior to cause a failure in older adapter microcode. A problem was fixed for SRC B7006A99 being logged as a Predictive error calling out cable hardware when no cable replacement is needed. This SRC does not have an impact on PCIe function and will be logged as Informational to prevent unnecessary service actions for the non-functional error. A problem was fixed for an extra IFL (Integrated Facility for Linux) proc resource being available during PEP 2.0 throttling. This issue can be triggered by the following scenario for Power Enterprise Pools 2.0 (PEP 2.0), also known as Power Systems Private Cloud with Shared Utility Capacity: PEP 2.0 throttling has been engaged and IFL processors are being used in the environment. A problem was fixed for inconsistencies in the link status LED to help with the service of faulty cables using the link activity lights. With the fix, LEDs are now “all or none”. If one lane or more is active in the entire link where the link spans both cables, then both link activity LEDs are activated. If zero lanes are active (link train fail), then the link activity LEDs are off. A problem was fixed for a boot failing from the SMS menu if a network adapter has been configured with VLAN tags. This issue can be seen when a VLAN ID is used during a boot from the SMS menu and if the external network environment, such as a switch, triggers incoming ARP requests to the server. This problem can be circumvented by not using VLAN ID from the SMS menu. After the install and boot, VLAN can be configured from the OS. A problem was fixed for a bad format of a PEL reported by SRC BD802002. In this case, the malformed log will be a Partition Firmware created SRC of BA28xxxx (RTAS hardware error), BA2Bxxxx (RTAS non-hardware error), or BA188001 (EEH Temp error) log. No other log types are affected by this error condition. This problem occurs anytime one of the affected SRCs is created by Partition Firmware. These are hidden informational logs used to provide supplemental FFDC information so there should not be a large impact on system users by this problem. A problem was fixed for DLPAR removes of embedded I/O (such as integrated USB) that fail. An SRC BA2B000B hidden log will also be produced because of the failure. This error does not impact DLPAR remove of slot-based (hot-pluggable) I/O. Any attempt to DLPAR remove embedded I/O will trigger the issue and result in a DLPAR failure. A problem was fixed for the total hardware uptime on the ASMI power on/off system page being incorrect. For a system run for a longer time (more than 30 days), the uptime value overflows and resets to 0, before counting up again. With the fix, the internal 32-bit counter has been increased to 64 bits to prevent the overflow condition. A problem was fixed for SRC 110015x1 for a current share fault calling out a power supply for replacement. For this SRC, the power supply does not need to be replaced or serviced, so this fix changes the SRC to be informational instead of a serviceable event. As a workaround, this SRC can be ignored. A problem was fixed for an incorrect “Current hardware uptime” being displayed on the backup FSP ASMI welcome screen. Since this value cannot be maintained by the backup FS, the field has been removed from the backup FSP with the fix. The “Current hardware uptime” value can be found shown correctly on the primary FSP ASMI welcome screen. A problem was fixed for a missing hardware callout for NVMe drives that are having a temperature failure (failure to read temperature or over temperature).
VM950_131_045 / FW950.80 2023/05/26	Impact: Data Severity: HIPER System firmware changes that affect all systems HIPER/Pervasive: AIX logical partitions that own virtual I/O devices or SR-IOV virtual functions may have data incorrectly written to platform memory or an I/O device, resulting in undetected data loss when Dynamic Platform Optimizer (DPO), predictive memory deconfiguration occurs, or memory mirroring defragmentation is performed. To mitigate the risk of this issue, please install the latest FW950 service pack (FW950.80 or later). A security problem was fixed for a scenario where the IBM PowerVM Hypervisor could allow an attacker to obtain sensitive information if they gain service access to the HMC. The Common Vulnerabilities and Exposures number for this problem is CVE-2023-25683. A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware. This update contains important reliability improvements and security hardening enhancements This change updates the adapter firmware to XX.34.1002 for the following Feature Codes and CCIN: #EC66/EC67 with CCIN 2CF3. If this adapter firmware level is concurrently applied, AIX and VIOS VFs may become failed. Certain levels of AIX and VIOS do not properly handle concurrent SR-IOV updates, and can leave the virtual resources in a DEAD state. Please review the following document for further details: https://www.ibm.com/support/pages/node/6997885. A re-IPL of the system instead of concurrently updating the SR-IOV adapter firmware would also work to prevent a VF failure. Update instructions: https://www.ibm.com/docs/en/power9?topic=adapters-updating-sr-iov-adapter-firmware A problem was fixed for an SR-IOV virtual function (VF) failing to configure for a Linux partition. This problem can occur if an SR-IOV adapter that had been in use on prior activation of the partition was removed and then replaced with an SR-IOV adapter VF with a different capacity. As a workaround, the partition with the failure can be rebooted. A problem was fixed for a timeout occurring for an SR-IOV adapter firmware LID load during an IPL, with SRC B400FF04 logged. This problem can occur if a system has a large number of SR-IOV adapters to initialize. The system recovers automatically when the boot completes for the SR-IOV adapter. A problem was fixed for the ASMI "Real-time Progress Indicator" not refreshing automatically to show the new progress codes. The ASMI must be refreshed manually to show the new progress codes during the IPL. A problem was fixed for a system failing an IPL with SRC B700F10A but not calling out the processor with the TOD error. This happens whenever the PowerVM hypervisor does a TI checkstop due to a TOD error. As a workaround, the bad processor must be guarded or replaced to allow the system to IPL. A problem was fixed for the Redfish (REST) API not returning data. The REST API to gather power usage for all nodes in watts and the ambient temperature for the system does not return the data. The new schema IBMEnterpriseComputerSystem.v1_1_0.json is missing, causing the Redfish GETs to fail. A problem was fixed for unexpected vNIC failovers that can occur if all vNIC backing devices are in LinkDown status. This problem is very rare that only occurs if both vNIC server backing devices are in LinkDown, causing vNIC failovers that bounce back and forth in a loop until one of the vNIC backing devices comes to Operational status. A problem was fixed for the HMC Repair and Verify (R&V) procedure failing during concurrent maintenance of the #EMX0 Cable Card. This problem can occur if a partition is IPLed after a hardware failure before attempting the R&V operation. As a workaround, the R&V can be performed with the affected partition powered off or the system powered off. A problem was fixed for a possible incomplete state for the HMC-managed system with SRCs B17BE434 and B182953C logged, with the PowerVM hypervisor hung. This error can occur if a system has a dedicated processor partition configured to not allow processor sharing while active. A problem was fixed for Power Systems Private Cloud with Shared Utility Capacity (formerly known as Power Enterprise Pools 2.0) to change system throttling from immediate to gradual over 20 days if this service is not renewed and the system becomes incompliant. This change provides more time for the system administrator to resolve the compliance issue before jobs running on the system are impacted by the reduced resources. Once the system has become non-compliant, the number of cores available will be reduced daily over 20 days until the system is back to a base level. A problem was fixed for missing countdown expiration messages after a renewal of PEP 2.0. Power Enterprise Pools 2.0 (PEP 2.0), also known as Power Systems Private Cloud with Shared Utility Capacity, normally has automatic renewal, but if this does not occur for some reason, expiration of PEP 2.0 should be warned by countdown messages before expiration and by daily messages after expiration. As a workaround, the CMC appliance can be examined to see the current status of the PEP 2.0 subscription. A problem was fixed for a performance issue after PEP 2.0 throttling or usage of the optmem HMC command. This issue can be triggered by the following scenario for Power Enterprise Pools 2.0 (PEP 2.0), also known as Power Systems Private Cloud with Shared Utility Capacity: Due to a PEP 2.0 budget being reached or an issue with licensing for the pool, the CPU resources may be restricted (throttled). At the start of the next month, after a change in the budget limit or after correction of the licensing issue, the CPU resources will be returned to the server (un-throttled). At this point in time, the performance of the PEP 2.0 pool may not return to the level of performance before throttling. As a workaround, partitions and VIOS can be restarted to restore the performance to the expected levels. Although this fix applies concurrently, a restart of partitions or VIOS would need to be done to correct the system performance if it has been affected. A problem was fixed for incomplete descriptions for the display of devices attached to the FC adapter in SMS menus. The FC LUNs are displayed using this path in SMS menus: "SMS->I/O Device Information -> SAN-> FCP-> <FC adapter>". This problem occurs if there are LUNs in the SAN that are not OPEN-able, which prevents the detailed descriptions from being shown for that device. A problem was fixed for an HMC lpar_netboot error for a partition with a VNIC configuration. The lpar_netboot logs show a timeout due to a missing value. As a workaround, doing the boot manually in SMS works. The lpar_netboot could also work as long as broadcast bootp is not used, but instead use lpar_netboot with a standard set of parameters that include Client, Server, and Gateway IP addresses. Problems were fixed for DIMM memory row repairs that were failing and getting repaired with chip marks instead, reducing the repairability of the memory.
VM950_124_045 / FW950.71 2023/05/17	Impact: Security Severity: HIPER System Firmware changes that affect all systems HIPER/Pervasive: An internally discovered vulnerability in PowerVM on Power9 and Power10 systems could allow an attacker with privileged user access to a logical partition to perform an undetected violation of the isolation between logical partitions which could lead to data leakage or the execution of arbitrary code in other logical partitions on the same physical server. The Common Vulnerability and Exposure number is CVE-2023-30438. For additional information refer to https://www.ibm.com/support/pages/node/6987797 A problem was identified internally by IBM related to SRIOV virtual function support in PowerVM. An attacker with privileged user access to a logical partition that has an assigned SRIOV virtual function (VF) may be able to create a Denial of Service of the VF assigned to other logical partitions on the same physical server and/or undetected arbitrary data corruption. The Common Vulnerability and Exposure number is CVE-2023-30440.
VM950_119_045 / FW950.70 2023/02/15	Impact: Data Severity: HIPER System firmware changes that affect all systems HIPER/Pervasive: If a partition running in Power9 compatibility mode encounters memory errors and a Live Partition Mobility (LPM) operation is subsequently initiated for that partition, undetected data corruption within GZIP operations (via hardware acceleration) may occur within that specific partition. HIPER/Pervasive: If a partition running in Power9 compatibility mode encounters an uncorrectable memory error during a Dynamic Platform Optimization (DPO), memory guard, or memory mirroring defragmentation operation, undetected data corruption may occur in any partition(s) within the system or the system may terminate with SRC B700F105. A problem was fixed for performance slowdowns that can occur during the Live Partition Mobility (LPM) migration of a partition in POWER9, POWER10, or default processor compatibility modes. For this to happen to a partition in default processor compatibility mode, it must have booted on a Power10 system. If this problem occurs, the performance will return to normal after the partition migration completes. As a workaround, the partition to be migrated can be put into POWER9_base processor compatibility mode or older. A problem was fixed for FSP slowness or system failing to IPL with SRC B1812624 errors logged. This may occur if IPMI is used to request CPU temperatures when the On-Chip Controller is not available. This would be the case if the IPMI requests were made while the system was powered down. A problem was fixed for a processor core not being called out and guarded if a recoverable core error recovery fails and triggers a system checkstop. This happens only if core error recovery fails with a core unit checkstop. For a system with I/O Enlarged Capacity enabled, and greater than 8 TB of memory, and having an adapter in SR-IOV shared mode, a problem was fixed for partition or system termination for a failed memory page relocation. This can occur if the SR-IOV adapter is assigned to a VIOS and virtualized to a client partition and then does an I/O DMA on a section of memory greater than 2 GB in size. This problem can be avoided by not enabling "I/O Enlarged Capacity". A problem was fixed for an SR-IOV adapter showing up as "n/a" on the HMC's Hardware Virtualized I/O menu. This is an infrequent error that can occur if an I/O drawer is moved to a different parent slot. As a workaround, the PowerVM Hypervisor NVRAM can be cleared or the I/O drawer can be moved back to the original parent slot to clean up the configuration. A problem was fixed for too frequent callouts for repair action for recoverable errors for Predictive Error (PE) SRCs B7006A72, B7006A74, and B7006A75. These SRCs for PCIe correctable error events called for a repair action but the threshold for the events was too low for a recoverable error that does not impact the system. The threshold for triggering the PE SRCs has been increased for all PLX and non-PLX switch correctable errors. A problem was fixed for not being able to reduce partition memory when the PowerVM hypervisor has insufficient memory for normal operations. With the fix, a partition configuration change to reduce memory is allowed when the hypervisor has insufficient memory. A possible workaround for this error is to free up system memory by deleting a partition. A problem was fixed for an incorrect capacity displayed for a Fibre Channel device using SMS option "I/O Device Information". This happens every time for a device that has a capacity greater than 2 TB. For this case, the capacity value displayed may be significantly less than 2 TB. For example, a 2 TB device would be shown as having a capacity of 485 GB. A problem was fixed for a partition firmware data storage error with SRC BA210003 logged or for a failure to locate NVMe target namespaces when attempting to access NVMe devices over Fibre Channel (FC-NVME) SANs connected to third-party vendor storage systems. This error condition, if it occurs, prevents firmware from accessing NVMe namespaces over FC as described in the following scenarios: 1) Boot attempts from an NVMe namespace over FC using the current SMS bootlist could fail. 2) From SMS menus via option 3 - I/O Device Information - no devices can be found when attempting to view NVMe over FC devices. 3) From SMS menus via option 5 - Select Boot Options - no bootable devices can be found when attempting to view and select an NVMe over FC bootable device for the purpose of boot, viewing the current device order, or modifying the boot device order. The trigger for the problem is the attempted access of NVMe namespaces over Fibre Channel SANs connected to storage systems via one of the scenarios listed above. The frequency of this problem can be high for some of the vendor storage systems. A problem was fixed for a failed NIM download/install of OS images that are greater than 32M. This only happens when using the default TFTP block size of 512 bytes. The latest versions of AIX are greater than 32M in size and can have this problem. As a workaround, in the SMS menu, change "TFTP blocksize" from 512 to 1024. To do this, go to the SMS "Advanced Setup: BOOTP" menu option when setting up NIM install parameters. This will allow a NIM download of an image up to 64M. A problem was fixed for a security scan with NSFOCUS reporting the following low-priority vulnerabilities: 1. Low. Web server enabled "options" 2. Low. Response no "Referrer-Policy" header 3. Low. Response no "X-Permitted-Cross-Domain-Policies" header 4. Low. Response no "X-Download-Options" header 5. Low. Response no "Content-Security-Policy" header There is no impact to the system from these as the FSP service processor does not provide any features which can be exploited by the five vulnerabilities. A problem was fixed for a security scan with NSFOCUS reporting a medium-level vulnerability for a slow HTTPS request denial of service attack against ASMI. This occurs whenever NSFOCUS scans are run. Support for using a Redfish (REST) API to gather power usage for all nodes in watts and the ambient temperature for the system. Redfish sample response is as shown below: ==>> GET redfish/v1/Systems/<> ... "Oem": { "IBMEnterpriseComputerSystem": { ... ... "PowerInputWatts" : <> ( number in watts), <<<<============ "AmbientTemp" : <> (number in Celsius) <<<<============ } }, ...
VM950_111_045 / FW950.60 2022/10/20	Impact: Availability Severity: SPE System firmware changes that affect all systems A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware. No specific adapter problems were addressed at this new level. This change updates the adapter firmware to XX.32.1010 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; and #EC66/EC67 with CCIN 2CF3. If this adapter firmware level is concurrently applied, AIX and VIOS VFs may become failed. To prevent the VF failure, the VIOS and AIX partitions must have the fix for IJ44288 (or a sibling APAR) applied prior to concurrently updating SR-IOV adapter firmware. AIX/VIOS SPs Spring 2023 will ship this fix. Until then, interim fixes (ifixes) are available from https://aix.software.ibm.com/aix/efixes/ij44288/ or by calling IBM support if an ifix is required for a different level. A re-IPL of the system instead of concurrently updating the SR-IOV adapter firmware would also work to prevent a VF failure. Please review the following document for further details: https://www.ibm.com/support/pages/node/6997885 Security problems were fixed for vTPM 1.2 by updating its OpenSSL library to version 0.9.8zh. Security vulnerabilities CVE-2022-0778, CVE-2018-5407, CVE-2014-0076, and CVE-2009-3245 were addressed. These problems only impact a partition if vTPM version 1.2 is enabled for the partition. A problem was fixed for an intermittent service processor core dump for MboxDeviceMsg with SRCs B1818601 and B6008601 logged while the system is running. This is a timing failure related to a double file close on an NVRAM file. The service processor will automatically recover from this error with no impact on the system. A problem was fixed for an SR-IOV adapter in shared mode failing on an IPL with SRC B2006002 logged. This is an infrequent error caused by a different SR-IOV adapter than expected being associated with the slot because of the same memory buffer being used by two SR-IOV adapters. The failed SR-IOV adapter can be powered on again and it should boot correctly. A problem was fixed for an SR-IOV adapter in shared mode failing during run time with SRC B400FF04 or B400F104 logged. This is an infrequent error and may result in a temporary loss of communication as the affected SR-IOV adapter is reset to recover from the error. A problem was fixed for a system crash with a B700F103 logged after a local core checkstop of a core with a running partition. This infrequent error also requires a configuration change on the system like changing the processor configuration of the affected partition or running Dynamic Platform Optimizer (DPO). A problem was fixed for a rare system hang that can happen any time Dynamic Platform Optimizer (DPO), memory guard recovery, or memory mirroring defragmentation occurs for a dedicated processor partition running in Power9 or Power10 processor compatibility mode. This does not affect partitions in Power9_base or older processor compatibility modes. If the partition has the "Processor Sharing" setting set to "Always Allow" or "Allow when partition is active", it may be more likely to encounter this than if the setting is set to "Never allow" or "Allow when partition is inactive". This problem can be avoided by using Power9_base processor compatibility mode for dedicated processor partitions. This can also be avoided by changing all dedicated processor partitions to use shared processors. A problem was fixed for a partition with VPMEM failing to activate after a system IPL with SRC B2001230 logged for a "HypervisorDisallowsIPL" condition. This problem is very rare and is triggered by the partition's hardware page table (HPT) being too big to fit into a contiguous space in memory. As a workaround, the problem can be averted by reducing the memory needed for the HPT. For example, if the system memory is mirrored, the HPT size is doubled, so turning off mirroring is one option to save space. Or the size of the VPMEM LUN could be reduced. The goal of these options would be to free up enough contiguous blocks of memory to fit the partition's HPT size. A problem was fixed for a rare partition hang that can happen any time Dynamic Platform Optimizer (DPO), memory guard recovery, or memory mirroring defragmentation occurs for a shared processor partition running in any compatibility mode if there is also a dedicated processor partition running in Power9 or Power10 processor compatibility mode. This does not happen if the dedicated partition is in Power9_base or older processor compatibility modes. Also, if the dedicated partition has the "Processor Sharing" setting set to "Always Allow" or "Allow when partition is active", it may be more likely to cause a shared processor partition to hang than if the setting is set to "Never allow" or "Allow when partition is inactive". This problem can be avoided by using Power9_base processor compatibility mode for any dedicated processor partitions. This problem can also be avoided by changing all dedicated processor partitions to use shared processors. A problem was fixed for booting an OS using iSCSI from SMS menus that fails with a BA010013 information log. This failure is intermittent and infrequent. If the contents of the BA010013 are inspected, the following messages can be seen embedded within the log: " iscsi_read: getISCSIpacket returned ERROR" " updateSN: Old iSCSI Reply - target_tag, exp_tag" A problem was fixed for the SMS menu option "I/O Device Information". When using a partition's SMS menu option "I/O Device Information" to list devices under a physical or virtual Fibre Channel adapter, the list may be missing or entries in the list may be confusing. If the list does not display, the following message is displayed: "No SAN adapters present. Press any key to continue". An example of a confusing entry in a list follows: "Pathname: /vdevice/vfc-client@30000004 WorldWidePortName: 0123456789012345 1. 500173805d0c0110,0 Unrecognized device type: c" A problem was fixed for a memory leak in the service processor (FSP) that can result in an out of memory (OOM) condition in the FSP kernel with an FSP dump and reset of the FSP. This can occur after the FSP has been active for more than 80 days of uptime. If the problem occurs, the system automatically recovers with a reset/reload of the FSP. A problem was fixed for too frequent callouts for repair action for recoverable errors for SRCs B7006A72, B7006A74, and B7006A75. The current threshold limit for the switch correctable errors is 5 occurring in 10 minutes, which is too low for a predictable event that requests a part replacement. With the fix, the threshold value for calling out a part replacement is increased to match what is done for the PCIe Host Bridge ( PHB) correctable errors. Every correctable error threshold condition on the switch link triggers the too frequent callouts. A problem was fixed for a service processor FSP kernel panic dump and reset/reload that can occur if there is a network configuration error when using ASMI to change the network. The SRCs B1817201 and B1817212 are logged prior to the dump. This problem only occurs when changing the network configuration to an incorrect setting that causes a network timeout. System firmware changes that affect certain systems On a system with no HMC and a serially attached terminal, a problem was fixed for an intermittent service processor core dump for NetsVTTYServer with B181D30B logged that can when using the terminal console for the OS. This error causes the console to be lost but can be recovered by doing a soft reset of the service processor.
VM950_105_045 / FW950.50 2022/07/29	Impact: Availability Severity: HIPER System firmware changes that affect all systems HIPER/Non-Pervasive: The following problems were fixed for certain SR-IOV adapters in shared mode when the physical port is configured for Virtual Ethernet Port Aggregator (VEPA): 1) A security problem for CVE-2022-34331 was addressed where switches configured to monitor network traffic for malicious activity are not effective because of errant adapter configuration changes. The misconfigured adapter can cause network traffic to flow directly between the VFs and not out the physical port hence bypassing any possible monitoring that could be configured in the switch. 2) Packets may not be forwarded after a firmware update, or after certain error scenarios which require an adapter reset. Users configuring or using VEPA mode should install this update. These fixes pertains to adapters with the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CEC; and #EC66/#EC67 with CCIN 2CF3. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. HIPER/Non-Pervasive: A problem was fixed for certain SR-IOV adapters in shared mode when the physical port is configured for Virtual Ethernet Port Aggregator (VEPA). Packets may not be forwarded after a firmware update, or after certain error scenarios which require an adapter reset. Users configuring or using VEPA mode should install this update. This fix pertains to adapters with the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CEC; and #EC66/#EC67 with CCIN 2CF3. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for a rare service processor core dump for NetsCommonMsgServer with SRC B1818611 logged that can occur when doing an AC power-on of the system. This error does not have a system impact beyond the logging of the error as an auto-recovery happens. A problem was fixed for the wrong IBM part number (PN) being displayed in inventory reports and callouts for the 16GB-based Samsung 128 GB DIMM with IBM part number 78P7468 and Samsung part number: M393AAG40M32-CAE. The PN 78P7468 should be shown for the Samsung memory DIMM instead of PN 78P6925 which is specific to the Hynix 128GB memory DIMM. A problem was fixed for an apparent hang in a partition shutdown where the HMC is stuck in a status of "shutting down" for the partition. This infrequent error is caused by a timing window during the system or partition power down where the HMC checks too soon and does not see the partition in the "Powered Off" state. However, the power off of the partition does complete even though the HMC does not acknowledge it. This error can be recovered by rebuilding the HMC representation of the managed system by following the below steps: 1) In the navigation area on the HMC, select Systems Management > Servers. 2) In the contents pane, select the required managed system. 3) Select Tasks > Operations > Rebuild. 4) Select Yes to refresh the internal representation of the managed system. A problem was fixed for a hypervisor task failure with SRC B7000602 logged when running debug macro "sbdumptrace -sbmgr -detail 2" to capture diagnostic data. The secure boot trace buffer is not aligned on a 16-byte boundary in memory which triggers the failure. With the fix, the hypervisor buffer dump utility is changed to recognize 8-byte aligned end of buffer boundaries. A problem was fixed for Predictive Error (PE) SRCs B7006A72 and B7006A74 being logged too frequently. These SRCs for PCIe correctable error events called for a repair action but the threshold for the events was too low for a recoverable error that does not impact the system. The threshold for triggering the PE SRCs has been increased. A problem was fixed for a system crash with SRC B7000103 that can occur when adding or removing FRUs from a PCIe3 expansion drawer (Feature code #EXM0). This error is caused by a very rare race scenario when processing multiple power alerts from the expansion drawer at the same time. A problem was fixed for an HMC incomplete state for the managed system after a concurrent firmware update. This is an infrequent error caused by an HMC query race condition while the concurrent update is rebooting tasks in the hypervisor. A system re-IPL is needed to recover from the error. A problem was fixed for an On-Chip Controller (OCC) and a Core Management Engine ( CME) boot failure during the IPL with SRC BC8A090F and a RC_STOP_GPE_INIT_TIMEOUT error logged. This is an intermittent IPL failure. The system can be recovered by retrying the IPL. This fix reduces the frequency of the error but it may still rarely occur. If it does occur, the retry of the IPL will be successful to recover the system. A problem was fixed for a failed correctable error recovery for a DIMM that causes a flood of SRC BC81E580 error logs and also can prevent dynamic memory deallocation from occurring for a hard memory error. This is a very rare problem caused by an unexpected number of correctable error symbols for the DIMM in the per-symbol counter registers.
VM950_099_045 / FW950.40 2022/05/06	Impact: Security Severity: HIPER System firmware changes that affect all systems HIPER/Non-Pervasive: A problem was fixed for a flaw in OpenSSL TLS which can lead to an attacker being able to compute the pre-master secret in connections that have used a Diffie-Hellman (DH) based ciphersuite. In such a case this would result in the attacker being able to eavesdrop on all encrypted communications sent over that TLS connection. OpenSSL supports encrypted communications via the Transport Layer Security (TLS) and Secure Sockets Layer (SSL) protocols. With the fix, the service processor Lighttpd web server is changed to only use a strict cipher list that precludes the use of the vulnerable ciphersuites. The Common Vulnerability and Exposure number for this problem is CVE-2020-1968. A problem was fixed for a change made to disable Service Location Protocol (SLP) by default for a newly shipped system so that the SLP is disabled by a reset to manufacturing defaults on all systems and to also disable SLP on all systems when this fix is applied by the firmware update. The SLP configuration change has been made to reduce memory usage on the service processor by disabling a service that is not needed for normal system operations. In the case where SLP does need to be enabled, the SLP setting can be changed using ASMI with the options "ASMI -> System Configuration -> Security -> External Services Management" to enable or disable the service. Without this fix, resetting to manufacturing defaults from ASMI does not change the SLP setting that is currently active. A problem was fixed for ASMI TTY menus allowing an unsupported change in hypervisor mode to OPAL. This causes an IPL failure with BB821410 logged if OPAL is selected. The hypervisor mode is not user-selectable in POWER9 and POWER10. Instead, the hypervisor mode is determined by the MTM of the system. With this fix, the "Firmware Configuration" option in ASMI TTY menus is removed so that it matches the options given by the ASMI GUI menus. A problem was fixed for correct ASMI passwords being rejected when accessing ASMI using an ASCII terminal with a serial connection to the server. This problem always occurs for systems at firmware level FW950.10 and later. A problem was fixed for a flaw in OpenSSL certificate parsing that could result in an infinite loop in the hypervisor, causing a hang in a Live Partition Mobility (LPM) target partition. The trigger for this failure is an LPM migration of a partition with a corrupted physical trusted platform module (pTPM) certificate. This is expected to be a rare problem. The Common Vulnerability and Exposure number for this problem is CVE-2022-0778. A problem was fixed for a partition with an SR-IOV logical port (VF) having a delay in the start of the partition. If the partition boot device is an SR-IOV logical port network device, this issue may result in the partition failing in boot with SRCs BA180010 and BA155102 logged and then stuck on progress code SRC 2E49 for an AIX partition. This problem is infrequent because it requires multiple error conditions at the same time on the SR-IOV adapter. To trigger this problem, multiple SR-IOV logical ports for the same adapter must encounter EEH conditions at roughly the same time such that a new logical port EEH condition is occurring while a previous EEH condition's handling is almost complete but not notified to the hypervisor yet. To recover from this problem, reboot the partition. A problem was fixed for errors that can occur if doing a Live Partition Mobility (LPM) migration and a Dynamic Platform Optimizer (DPO) operation at the same time. The migration may abort or the system or partition may crash. This problem requires running multiple migrations and DPO at the same time. As a circumvention, do not use DPO while doing LPM migrations. A problem was fixed for a secondary fault after a partition creation error that could result in a Terminate Immediate (TI) of the system with an SRC B700F103 logged. The failed creation of partitions can be explicit or implicit which might trigger the secondary fault. One example of an implicit partition create is the ghost partition created for a Live Partition Mobility (LPM) migration. This type of partition can fail to create when there is insufficient memory available for the hardware page table (HPT) for the new partition. A problem was fixed for a partition reboot recovery for an adapter in SR-IOV shared mode that rebooted with an SR-IOV port missing. Prior to the reboot, this adapter had SR-IOV ports that failed and were removed after multiple adapter faults, This problem should only occur rarely as it requires a sequence of multiple faults on an SR-IOV adapter in a short time interval to force the SR-IOV Virtual Function (VF) into the errant unrecoverable state. The missing SR-IOV port can be recovered for the partition by doing a remove and add of the failed adapter with DLPAR, or the system can be re-IPLed. The following problems were fixed for certain SR-IOV adapters: 1)A problem was fixed for certain SR-IOV adapters that occurs during a VNIC failover where the VNIC backing device has a physical port down due to an adapter internal error with an SRC B400FF02 logged. This is an improved version of the fix delivered in earlier service pack FW950.10 for adapter firmware level 11.4.415.37 and it significantly reduces the frequency of the error being fixed. 2) A problem was fixed for an adapter issue where traffic doesn’t flow on a VF when the VF is configured with a PVID set to zero and using OS VLAN tagging is configured on a physical port where a VF with a PVID set to the same VLAN ID already exists. This problem occurs whenever this specific VF configuration is dynamically added to a partition or is activated as part of a partition activation. This fix updates the adapter firmware to 11.4.415.43 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for multiple incorrect informational error logs with Thermal Management SRC B1812649 being logged on the service processor, These error logs are more frequent on multiple node systems, but can occur on all system models. The error is triggered by a false time-out and does not reflect a real problem on the service processor. System firmware changes that affect certain systems For a system with an AIX or Linux partition, a problem was fixed a partition start failure for AIX or Linux with SRC BA54504D logged. This problem occurs if the partition is an MDC default partition with virtual Trusted Platform Module (vTPM) enabled. As a circumvention, power off the system and disable vTPM using the HMC GUI to change the default partition property for Virtualized Trusted Platform Module (VTPM) to off. For a system with vTPM enabled, a problem was fixed for an intermittent system hang with SRCs 11001510 and B17BE434 logged and the HMC showing the system in the "Incomplete" state. This problem is very rare. It may be triggered by different scenarios such as a partition power off; a processor DLPAR remove operation; or a Simultaneous Multi-threading (SMT) mode change in a partition. For a system that does not have an HMC attached, a problem was fixed for a system dump 2GB or greater in size failing to off-load to the OS with an SRC BA280000 logged in the OS and an SRC BA28003B logged on the service processor. This problem does not affect systems with an attached HMC since in that case system dumps are off-loaded to the HMC, not the OS, where there is no 2GB boundary error for the dump size.
VM950_092_045 / FW950.30 2021/12/09	Impact: Security Severity: HIPER System firmware changes that affect all systems HIPER/Non-Pervasive: A security problem was fixed to prevent an attacker that gains service access to the FSP service processor from reading and writing PowerVM system memory using a series of carefully crafted service procedures. This problem is Common Vulnerability and Exposure number CVE-2021-38917. HIPER/Non-Pervasive: A problem was fixed for the IBM PowerVM Hypervisor where through a specific sequence of VM management operations could lead to a violation of the isolation between peer VMs. The Common Vulnerability and Exposure number is CVE-2021-38918. A problem was fixed for a possible denial of service on the service processor for ASMI and Redfish users. This problem is very rare and could be triggered by a large number of invalid log in attempts to Redfish over a short period of time. A problem was fixed for a service processor hang after a successful system power down with SRC B181460B and SRC B181BA07 logged. This is a very rare problem that results in a fipsdump and a reset/reload of the service processor that recovers from the problem. A problem was fixed for system fans not increasing in speed when partitions are booted with PCIe hot adapters that require additional cooling. This fan speed problem can also occur if there is a change in the power mode that requires a higher minimum speed for the fans of the system than is currently active. Fans running at a slower speed than required for proper system cooling could lead to over-temperature conditions for the system. A problem was fixed for a hypervisor hang and HMC Incomplete error with SRC B17BE434 logged on a system with a virtual Network Interface Controller (vNIC) adapters. The failure is triggered by actions occurring on two different SR-IOV logical ports for the same adapter in the VIOS that is backing the vNIC that result in a deadlock condition. This is a rare failure that can occur during a Live Partition Mobility (LPM) migration for a partition with vNIC adapters. A problem was fixed for a longer boot time for a shared processor partition on the first boot after the processor chip 0 has been guarded. The partition boot could stall at SRC C20012FF but eventually complete. This rare problem is triggered by the loss of all cores in processor chip 0. On subsequent partition boots after the slow problem boot, the boot speeds return to normal. A problem was fixed for a Live Partition Mobility (LPM) hang during LPM validation on the target system. This is a rare system problem triggered during an LPM migration that causes LPM attempts to fail as well as other functionality such as configuration changes and partition shutdowns. To recover from this problem to be able to do LPM and other operations such as configuration changes and shutting down partitions, the system must be re-IPLed. A problem was fixed for incorrect Power Enterprise Pools(PEP) 2.0 throttling when the system goes out of compliance. When the system is IPLed after going out of compliance, the amount of throttled resources is lower than it should be on the first day after the IPL. Later on, the IBM Cloud Management Console (CMC) corrects the throttle value. This problem requires that a PEP 2.0 system has to go out of compliance, so it should happen only rarely. To recover from this problem, the user can wait for up to one day after the IPL or have the CMC resend the desired PEP Throttling resource amount to correct it immediately. A problem was fixed for the system powering off after a hardware discovery IPL. This will happen if a hardware discovery IPL is initiated while the system is set to "Power off when last partition powers off". The system will power off when the Hardware Discovery Information (IOR) partition that does hardware discovery powers off. As a workaround, one should not use the "Power off when last partition powers off" setting when doing the hardware discovery IPL. Alternatively, one can just do a normal IPL after the system powers off, and then continue as normal. A problem was fixed for system NVRAM corruption that can occur during PowerVM hypervisor shutdown. This is a rare error caused by a timing issue during the hypervisor shutdown. If this error occurs, the partition data will not be able to read from the invalid NVRAM when trying to activate partitions, so the NVRAM must be cleared and the partition profile data restored from the HMC. A problem was fixed for the HMC Repair and Verify (R&V) procedure failing with "Unable to isolate the resource" during concurrent maintenance of the #EMX0 Cable Card. This could lead one to take disruptive action in order to do the repair. This should occur infrequently and only with cases where a physical hardware failure has occurred which prevents access to the PCIe reset line (PERST) but allows access to the slot power controls. As a workaround, pulling both cables from the Cable Card to the #EMX0 expansion drawer will result in a completely failed state that can be handled by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC. Then retry the R&V operation to recover the Cable Card. A problem was fixed to prevent a flood of informational PCIe Host Bridge (PHB) error logs with SRC B7006A74 that cause a wrap of internal flight recorders and loss of data needed for problem debug. This flood can be triggered by bad cables or other issues that cause frequent informational error logs. With the fix, thresholding has been added for informational PHB correctable errors at 10 in 24 hours before a Predictive Error is logged. A problem was fixed for vague and misleading errors caused by using an invalid logical partition (LP) id for a resource dump request. With the fix, the invalid LP id is rejected immediately as a user input error instead of being processed by the main storage dump to create what appear to be severe errors. A problem was fixed for certain SR-IOV adapters that encountered a rare adapter condition, had some response delays, and logged an Unrecoverable Error with SRC B400FF02. With the fix, handling of this rare condition is accomplished without the response delay and an Informational Error is logged. and the adapter initialization continues without interruption. This fix pertains to adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware. No specific adapter problems were addressed at this new level. This change updates the adapter firmware to XX.30.1004 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; and #EC66/EC67 with CCIN 2CF3. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where the SR-IOV adapter goes through EEH error recovery, causing an informational error with SRC B400FF04 and additional information text that indicates a command failed. This always happens when an adapter goes through EEH recovery and a physical port is in VEPA mode. With the fix, the informational error is not logged. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for certain SR-IOV adapters where Virtual Functions (VFs) failed to configure after an immediate restart of a logical partition (LPAR) or a shutdown/restart of an LPAR. This problem only happens intermittently but is more likely to occur for the immediate restart case. A workaround for the problem is to try another shutdown and restart of the partition or use DLPAR to remove the failing VF and then use DLPAR to add it back in. This fix pertains to adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. The fix is in the Partition Firmware and is effective immediately after a firmware update to the fix level. A problem was fixed for a system hypervisor hang and an Incomplete state on the HMC after a logical partition (LPAR) is deleted that has an active virtual session from another LPAR. This problem happens every time an LPAR is deleted with an active virtual session. This is a rare problem because virtual sessions from an HMC (a more typical case) prevent an LPAR deletion until the virtual session is closed, but virtual sessions originating from another LPAR do not have the same check. System firmware changes that affect certain systems For a system with a Linux partition using an SR-IOV adapter, a problem was fixed for ping failures and packet loss for an SR-IOV logical port when a Dynamic DMA Window (DDW) changes from a bigger DMA window page size (such as 64K) back to the smaller default window page size (4K). This can happen during an error recovery that causes a DDW reset back to the default window page size. For a system with an AIX or Linux partition. a problem was fixed for Platform Error Logs (PELs) that are truncated to only eight bytes for error logs created by the firmware and reported to the AIX or Linux OS. These PELs may appear to be blank or missing on the OS. This rare problem is triggered by multiple error log events in the firmware occurring close together in time and each needing to be reported to the OS, causing a truncation in the reporting of the PEL. As a problem workaround, the full error logs for the truncated logs are available on the HMC or using ASMI on the service processor to view them.
VM950_087_045 / FW950.20 2021/09/16	Impact: Data Severity: HIPER New Features and Functions Support added for a mainstream 800GB NVME U.2 7 mm SSD (Solid State Drive) PCIe4 drive in a 7 mm carrier with Feature Code #EC7Q and CCIN 59B4 for AIX, Linux, and VIOS. This PCIe4 drive is also compatible with PCIe3 slots on the system. Support was changed to disable Service Location Protocol (SLP) by default for newly shipped systems or systems that are reset to manufacturing defaults. This change has been made to reduce memory usage on the service processor by disabling a service that is not needed for normal system operations. This change can be made manually for existing customers by changing it in ASMI with the options "ASMI -> System Configuration -> Security -> External Services Management" to disable the service. Support was added to generate a service processor fipsdump whenever there is Hostboot (HB) TI and HB dump. Without this new support, a HB crash (with a HB dump) does not generate a fipsdump and the FSP FFDC at that point in time. So it was difficult to correlate what was seen in the HB dump to what was happening on the FSP at the time of the HB fail. System firmware changes that affect all systems HIPER: A problem was fixed which may occur on a target system following a Live Partition Mobility (LPM) migration of an AIX partition utilizing Active Memory Expansion (AME) with 64 KB page size enabled using the vmo tunable: "vmo -ro ame_mpsize_support=1". The problem may result in AIX termination, file system corruption, application segmentation faults, or undetected data corruption. Note: If you are doing an LPM migration of an AIX partition utilizing AME and 64 KB page size enabled involving a POWER8 or POWER9 system, ensure you have a Service Pack including this change for the appropriate firmware level on both the source and target systems. A problem was fixed for a missing hardware callout and guard for a processor chip failure with SRC BC8AE540 and signature "ex(n0p0c5) (L3FIR[28]) L3 LRU array parity error". A problem was fixed for a missing hardware callout and guard for a processor chip failure with Predictive Error (PE) SRC BC70E540 and signature "ex(n1p2c6) (L2FIR[19]) Rc or NCU Pb data CE error". The PE error occurs after the number of CE errors reaches a threshold of 32 errors per day. A problem was fixed for an infrequent SRC of B7006956 that may occur during a system power off. This SRC indicates that encrypted NVRAM locations failed to synchronize with the copy in memory during the shutdown of the hypervisor. This error can be ignored as the encrypted NVRAM information is stored in a redundant location, so the next IPL of the system is successful. A problem was fixed for a service processor mailbox (mbox) timeout error with SRC B182953C during the IPL of systems with large memory configurations and "I/O Adapter Enlarged Capacity" enabled from ASMI. The error indicates that the hypervisor did not respond quickly enough to a message from the service processor but this may not result in an IPL failure. The problem is intermittent, so if the IPL does fail, the workaround is to retry the IPL. A problem was fixed for a misleading SRC B7006A20 (Unsupported Hardware Configuration) that can occur for some error cases for PCIes #EMX0 expansion drawers that are connected with copper cables. For cable unplug errors, the SRC B7006A88 (Drawer TrainError) should be shown instead of the B7006A20. If a B7006A20 is logged against copper cables with the signature "Prc UnsupportedCableswithFewerChannels" and the message "NOT A 12CHANNEL CABLE", this error should instead follow the service actions for a B7006A88 SRC. Problems were fixed for DLPAR operations that change the uncapped weight of a partition and DLPAR operations that switch an active partition from uncapped to capped. After changing the uncapped weight, the weight can be incorrect. When switching an active partition from uncapped to capped, the operation can fail. A problem was fixed where the Floating Point Unit Computational Test, which should be set to "staggered" by default, has been changed in some circumstances to be disabled. If you wish to re-enable this option, this fix is required. After applying this service pack, do the following steps: 1) Sign in to the Advanced System Management Interface (ASMI). 2) Select Floating Point Computational Unit under the System Configuration heading and change it from disabled to what is needed: staggered (run once per core each day) or periodic (a specified time). 3) Click "Save Settings". A problem was fixed for a hypervisor hang and HMC Incomplete error as a secondary problem after an SR-IOV adapter has gone into error recovery for a failure. This secondary failure is infrequent because it requires an unrecovered error first for an SR-IOV adapter. A problem was fixed for a system termination with SRC B700F107 following a time facility processor failure with SRC B700F10B. With the fix, the transparent replacement of the failed processor will occur for the B700F10B if there is a free core, with no impact to the system. A problem was fixed for an incorrect "Power Good fault" SRC logged for an #EMX0 PCIe3 expansion drawer on the lower CXP cable of B7006A85 (AOCABLE, PCICARD). The correct SRC is B7006A86 (PCICARD, AOCABLE). A problem was fixed for a Live Partition Mobility (LPM) migration that failed with the error "HSCL3659 The partition migration has been stopped because orchestrator detected an error" on the HMC. This problem is intermittent and rare that is triggered by the HMC being overrun with unneeded LPM message requests from the hypervisor that can cause a timeout in HMC queries that result in the LPM operation being aborted. The workaround is to retry the LPM migration which will normally succeed. A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where unmatched unicast packets were not forwarded to the promiscuous mode VF. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for certain SR-IOV adapters in SR-IOV Shared mode which may cause a network interruption and SRCs B400FF02 and B400FF04 logged. The problem occurs infrequently during normal network traffic.. This fix updates the adapter firmware to 11.4.415.38 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for the Device Description in a System Plan related to Crypto Coprocessors and NVMe cards that were only showing the PCI vendor and device ID of the cards. This is not enough information to verify which card is installed without looking up the PCI IDs first. With the fix, more specific/useful information is displayed and this additional information does not have any adverse impact on sysplan operations. The problem is seen every time a System Plan is created for an installed Crypto Coprocessor or NVMe card. A problem was fixed for some serviceable events specific to the reporting of EEH errors not being displayed on the HMC. The sending of an associated call home event, however, was not affected. This problem is intermittent and infrequent. A problem was fixed for possible partition errors following a concurrent firmware update from FW910 or later. A precondition for this problem is that DLPAR operations of either physical or virtual I/O devices must have occurred prior to the firmware update. The error can take the form of a partition crash at some point following the update. The frequency of this problem is low. If the problem occurs, the OS will likely report a DSI (Data Storage Interrupt) error. For example, AIX produces a DSI_PROC log entry. If the partition does not crash, it is also possible that some subsequent I/O DLPAR operations will fail. A problem was fixed for Platform Error Logs (PELS) not being logged and shown by the OS if they have an Error Severity code of "critical error". The trigger is the reporting by a system firmware subsystem of an error log that has set an Event/Error Severity in the 'UH' section of the log to a value in the range, 0x50 to 0x5F. The following error logs are affected: B200308C ==> PHYP ==> A problem occurred during the IPL of a partition. The adapter type cannot be determined. Ensure that a valid I/O Load Source is tagged. B700F104 ==> PHYP ==> Operating System error. Platform Licensed Internal Code terminated a partition. B7006990 ==> PHYP ==> Service processor failure B2005149 ==> PHYP ==> A problem occurred during the IPL of a partition. B700F10B ==> PHYP ==> A resource has been disabled due to hardware problems A7001150 ==> PHYP ==> System log entry only, no service action required. No action needed unless a serviceable event was logged. B7005442 ==> PHYP ==> A parity error was detected in the hardware Segment Lookaside Buffer (SLB). B200541A ==> PHYP ==> A problem occurred during a partition Firmware Assisted Dump B7001160 ==> PHYP ==> Service processor failure. B7005121 ==> PHYP ==> Platform LIC failure BC8A0604 ==> Hostboot ==> A problem occurred during the IPL of the system. BC8A1E07 ==> Hostboot ==> Secure Boot firmware validation failed. Note that these error logs are still reported to the service processor and HMC properly. This issue does not affect the Call Home action for the error logs. A problem was fixed for Live Partition Mobility (LPM) migrations from non-trusted POWER9 systems to POWER10 systems. The LPM migration failure occurs every time a LPM migration is attempted from a non-trusted system source to FW1010 and later. For POWER9 systems, non-trusted is the default setting. The messages shown on the HMC for the failure are the following: HSCL365C The partition migration has been stopped because platform firmware detected an error (041800AC). HSCL365D The partition migration has been stopped because target MSP detected an error (05000127). HSCL365D The partition migration has been stopped because target MSP detected an error (05000127). A workaround for the problem is to enable the trusted system key on the POWER9 FW940/FW950 source system which can be done using an intricate procedure. Please contact IBM Support for help with this workaround. A problem was fixed for a missing error log SRC for an SR-IOV adapter in Shared mode that fails during the IPL because of adapter failure or because the system has insufficient memory for SR-IOV Shared mode for the adapter. The error log SRC added is B7005308, indicating a serviceable event and providing the adapter and error information. A problem was fixed for a Live Partition Mobility (LPM) migration failure from a POWER9 FW950 source to a POWER10 FW1010 target. This will fail on every attempt with the following message on the HMC: "HSCLA2CF The partition migration has been stopped unexpectedly. Perform a migration recovery for this partition, if necessary." A problem was fixed for error logs not being sent over to HMC when disconnecting/reconnecting power cords that caused a flood on informational SRCs of B1818A37 and B18187D7. After the flood of error logs, the reporting of error logs to the HMC stopped, which also prevented Call Home from working. To recover from the error, the service processor can have a reset/reload done using ASMI. System firmware changes that affect certain systems For a system with a partition running AIX 7.3, a problem was fixed for running Live Update or Live Partition Mobility (LPM). AIX 7.3 supports Virtual Persistent Memory (PMEM) but it cannot be used with these operations, but the problem was making it appear that PMEM was configured when it was not. The Live Update and LPM operations always fail when attempted on AIX 7.3. Here is the failure output from a Live Update Preview: "1430-296 FAILED: not all devices are virtual devices. nvmem0 1430-129 FAILED: The following loaded kernel extensions are not known to be safe for Live Update: nvmemdd ... 1430-218 The live update preview failed. 0503-125 geninstall: The lvupdate call failed. Please see /var/adm/ras/liveupdate/logs/lvupdlog for details." On systems with only Integrated Facility for Linux ( IFL) processors and AIX partitions, a problem was fixed for performance issues for IFL VMs (Linux and VIOS). This problem occurs if AIX partitions are active on a system with IPL only cores. As a workaround, AIX partitions should not be activated on an IFL only system. With the fix, the activation of AIX partitions is blocked on an IFL only system. If this fix is installed concurrently with AIX partitions running, these partitions will be allowed to continue to run until they are powered off. Once powered off, the AIX partitions will not be allowed to be activated again on the IFL-only system. For systems with an AIX partition and Platform Keystore (PKS) enabled for the partition, a problem was fixed for AIX not being able to access the PKS during a Main Store Dump (MSD) IPL. This may prevent the dump from completing. This will happen for every MSD IPL when the partition PKS is enabled and in use by the AIX OS. For a system with an AIX or Linux partition, a problem was fixed for a boot hang in RTAS for a partition that owns I/O which uses MSI-X interrupts. A BA180007 SRC may be logged prior to the hang. The frequency of this RTAS hang error is very low.
VM950_075_045 / FW950.11 2021/06/08	Impact: Availability Severity: HIPER System firmware changes that affect all systems HIPER/Pervasive: A problem was fixed for a checkstop due to an internal Bus transport parity error or a data timeout on the Bus. This is a very rare problem that requires a particular SMP transport link traffic pattern and timing. Both the traffic pattern and timing are very difficult to achieve with customer application workloads. The fix will have no measurable effect on most customer workloads although highly intensive OLAP-like workloads may see up to 2.5% impact.
VM950_072_045 / FW950.10 2021/04/28	Impact: Availability Severity: SPE New Features and Functions Support added to Redfish to provide a command to set the ASMI user passwords using a new AccountService schema. Using this service, the ASMI admin, HMC, and general user passwords can be changed. PowerVM support for the Platform KeyStore (PKS) for partitions has removed the FW950.00 restriction where the total amount of PKS for the system that could be configured was limited to 1 MB across all the partitions. This restriction has been removed for FW950.10. Support was added for Samsung DIMMs with part number 01GY853. If these DIMMs are installed in a system with older firmware than FW950.10, the DIMMs will fail and be guarded with SRC BC8A090F logged with HwpReturnCode " RC_CEN_MBVPD_TERM_DATA_UNSUPPORTED_VPD_ENCODE". Support was added for a new service processor command that can be used to 'lock' the power management mode, such that the mode can not be changed except by doing a factory reset. Support for new mainstream 931 GB, 1.86 TB, 3.72 TB, and 7.44 TB capacity SSDs. A 2.5-inch serial-attached SCSI (SAS) SSD is mounted on an SFF-3 carrier or tray for a POWER9 system unit or mounted on an SFF-2 for placement in an expansion drawer, such as the EXP24SX drawer, when attached to a POWER9 server. The drive is formatted to use 4224-byte (4k) sectors and does not support the 4k JBOD 4096-byte sector. Firmware level FW950.10 or later is required for these drives. The following are the feature codes and CCINs for the new drives: #ESKJ/#ESKK with CCIN 5B2B/5B29 – 931 GB Mainstream SAS 4k SFF-3/SFF-2 SSD for AIX/Linux #ESKL/#ESKM with CCIN 5B2B/5B29 - 931GB Mainstream SAS 4k SFF-3/SFF-2 SSD for IBM i #ESKN/#ESKP with CCIN 5B20/5B21- 1.86TB Mainstream SAS 4k SFF-3/SFF-2 SSD for AIX/Linux #ESKQ/#ESKR with CCIN 5B20/5B21- 1.86TB Mainstream SAS 4k SFF-3/SFF-2 SSD for IBM i #ESKS/#ESKT with CCIN 5B2C/5B2D - 3.72TB Mainstream SAS 4k SFF-3/SFF-2 SSD for AIX/Linux #ESKU/#ESKV with CCIN 5B2C/5B2D - 3.72TB Mainstream SAS 4k SFF-3/SFF-2 SSD for IBM i #ESKW/#ESKX with CCIN 5B2E/5B2F- 7.44TB Mainstream SAS 4k SFF-3/SFF-2 SSD for AIX/Linux #ESKY/#ESKZ with CCIN 5B2E/5B2F -7.44TB Mainstream SAS 4k SFF-3/SFF-2 SSD for IBM i Support for new enterprise SSDs refresh the previously available 387 GB, 775 GB, and 1550 GB capacity points for POWER9 servers. These are 400 GB, 800 GB, and 1600 GB SSDs that are always formatted either to 4224 (4k) byte sectors or to 528 (5xx) byte sectors for additional protection, resulting in 387 GB, 775 GB, and 1550 GB capacities. The 4096-byte sector, the 512-byte sector, and JBOD are not supported. Firmware level FW950.10 or later is required for these drives. The following are the feature codes and CCINs for the new drives: #ESK0/#ESK1 with CCIN 5B19/ 5B16 - 387GB Enterprise SAS 5xx SFF-3/SFF-2 SSD for AIX/Linux #ESK2/#ESK3 with CCIN 5B1A/5B17 - 775GB Enterprise SAS 5xx SFF-3/SFF-2 SSD for AIX/Linux #ESK6/#ESK8 with CCIN 5B13/5B10.- 387GB Enterprise SAS 4k SFF-3/SFF-2 SSD for AIX/Linux #ESK7/#ESK9 with CCIN 5B13/5B10- 387GB Enterprise SAS 4k SFF-3/SFF-2 SSD for IBM i #ESKA/#ESKC with CCIN 5B14/5B11- 775GB Enterprise SAS 4k SFF-3/SFF-2 SSD for AIX/Linux #ESKB/#ESKD with CCIN 5B14/5B11- 775GB Enterprise SAS 4k SFF-3/SFF-2 SSD for IBM i #ESKE/#ESKG with CCIN 5B15/5B12- 1.55TB Enterprise SAS 4k SFF-3/SFF-2 SSD for AIX/Linux #ESKF/#ESKH with CCIN 5B15/5B12- 1.55TB Enterprise SAS 4k SFF-3/SFF-2 SSD for IBM i Support for new PCIe 4.0 x8 dual-port 32 Gb optical Fibre Channel (FC) short form adapter based on the Marvell QLE2772 PCIe host bus adapter (6.6 inches x 2.731 inches). The adapter provides two ports of 32 Gb FC capability using SR optics. Each port can provide up to 6,400 MBps bandwidth. This adapter has feature codes #EN1J/#EN1K with CCIN 579C. Firmware level FW950.10 or later is required for this adapter. Support for new PCIe 3.0 16 Gb quad-port optical Fibre Channel (FC)l x8 short form adapter based on the Marvell QLE2694L PCIe host bus adapter (6.6 inches x 2.371 inches). The adapter provides four ports of 16 Gb FC capability using SR optics. Each port can provide up to 3,200 MBps bandwidth. This adapter has feature codes #EN1E/#EN1F with CCIN 579A. Firmware level FW950.10 or later is required for this adapter. Added support in ASMI for a new panel to do Self -Boot Engine (SBE) SEEPROM validation. This validation can only be run at the service processor standby state. If the validation detects a problem, IBM recommends the system not be used and that IBM service be called. System firmware changes that affect all systems DEFERRED: A problem was fixed for a rare Voltage Regulator Module (VRM) power fault with an SRC 11002700 logged for the VRM failure followed by an SRC 11002610 system crash. The trigger for this problem is intense workloads that cause what appear to be input over-current conditions. A re-IPL of the system is needed to activate this fix. A problem was fixed for certain SR-IOV adapters that have a rare, intermittent error with B400FF02 and B400FF04 logged, causing a reboot of the VF. The error is handled and recovered without any user intervention needed. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3. A problem was fixed for initiating a Remote Restart from a PowerVC/NovaLink source system to a remote target. This happens whenever the source system is running FW950.00. The error would look like this from PowerVC (system name, release level would be specific to the environment): "Virtual machine RR-5 could not be remote restarted to Ubu_AX_9.114.255.10. Error message: PowerVM API failed to complete for instance=RR-5-71f5c2cf-0000004e.HTTP error 500 for method PUT on path /rest/api/uom/ManagedSystem/598c1be4-cb4c-3957-917d-327b764d8ac1/LogicalPartition: Internal Server Error -- [PVME01040100-0004] Internal error PVME01038003 occurred while trying to perform this command." A problem was fixed for a B1502616 SRC logged after a system is powered off. This rare error, "A critical error occurred on the thermal/power management device (TPMD); it is being disabled. " is not a real problem but occurred because the Power Management (PM) complex was being reset during the power off. No recovery is needed as the next IPL of the system is successful. A problem was fixed for the error handling of a system with an unsupported memory configuration that exceeds available memory power. Without the fix, the IPL of the system is attempted and fails with a segmentation fault with SRCs B1818611 and B181460B logged that do not call out the incorrect DIMMs. A problem was fixed for an error in the HMC GUI (Error launching task) when clicking on "Hardware Virtualized IO". This error is infrequent and is triggered by an optical cable to a PCIe3 #EMX0 expansion drawer that is failed or unplugged. With the fix, the HMC can show the working I/O adapters. A problem was fixed for performance degradation of a partition due to task dispatching delays. This may happen when a processor chip has all of its shared processors removed and converted to dedicated processors. This could be driven by DLPAR remove of processors or Dynamic Platform Optimization (DPO). A problem was fixed for an unrecoverable UE SRC B181BE12 being logged if a service processor message acknowledgment is sent to a Hostboot instance that has already shutdown. This is a harmless error log and it should have been marked as an informational log. A problem was fixed for Time of Day (TOD) being lost for the real-time clock (RTC) with an SRC B15A3303 logged when the service processor boots or resets. This is a very rare problem that involves a timing problem in the service processor kernel. If the server is running when the error occurs, there will be an SRC B15A3303 logged, and the time of day on the service processor will be incorrect for up to six hours until the hypervisor synchronizes its (valid) time with the service processor. If the server is not running when the error occurs, there will be an SRC B15A3303 logged, and If the server is subsequently IPLed without setting the date and time in ASMI to fix it, the IPL will abort with an SRC B7881201 which indicates to the system operator that the date and time are invalid. A problem was fixed for the Systems Management Services ( SMS) menu "Device IO Information" option being incorrect when displaying the capacity for an NVMe or Fibre Channel (FC) NVMe disk. This problem occurs every time the data is displayed. A problem was fixed for intermittent failures for a reset of a Virtual Function (VF) for SR-IOV adapters during Enhanced Error Handling (EEH) error recovery. This is triggered by EEH events at a VF level only, not at the adapter level. The error recovery fails if a data packet is received by the VF while the EEH recovery is in progress. A VF that has failed can be recovered by a partition reboot or a DLPAR remove and add of the VF. A problem was fixed for a logical partition activation error that can occur when trying to activate a partition when the adapter hardware for an SR-IOV logical port has been physically removed or is unavailable due to a hardware issue. This message is reported on the HMC for the activation failure: "Error: HSCL12B5 The operation to remove SR-IOV logical port <number> failed because of the following error: HSCL1552 The firmware operation failed with extended error" where the logical port number will vary. This is an infrequent problem that is only an issue if the adapter hardware has been removed or another problem makes it unavailable. The workaround for this problem is to physically add the hardware back in or correct the hardware issue. If that cannot be done, create an alternate profile for the logical partition without the SR-IOV logical port and use that until the hardware issue is resolved. A problem was fixed for incomplete periodic data gathered by IBM Service for #EMXO PCIe expansion drawer predictive error analysis. The service data is missing the PLX (PCIe switch) data that is needed for the debug of certain errors. A problem was fixed for a partition hang in shutdown with SRC B200F00F logged. The trigger for the problem is an asynchronous NX accelerator job (such as gzip or NX842 compression) in the partition that fails to clean up successfully. This is intermittent and does not cause a problem until a shutdown of the partition is attempted. The hung partition can be recovered by performing an LPAR dump on the hung partition. When the dump has been completed, the partition will be properly shut down and can then be restarted without any errors. A problem was fixed for a rare failure for an SPCN I2C command sent to a PCIe I/O expansion drawer that can occur when service data is manually collected with hypervisor macros "xmsvc -dumpCCData and xmsvc -logCCErrBuffer". If the hypervisor macro "xmsvc "is run to gather service data and a CMC Alert occurs at the same time that requires an SPCN command to clear the alert, then the I2C commands may be improperly serialized, resulting in an SPCN I2C command failure. To prevent this problem, avoid using xmsvc -dumpCCData and xmsvc -logCCErrBuffer to collect service data until this fix is applied. The following problems were fixed for certain SR-IOV adapters: 1) An error was fixed that occurs during a VNIC failover where the VNIC backing device has a physical port down or read port errors with an SRC B400FF02 logged. 2) A problem was fixed for adding a new logical port that has a PVID assigned that is causing traffic on that VLAN to be dropped by other interfaces on the same physical port which uses OS VLAN tagging for that same VLAN ID. This problem occurs each time a logical port with a non-zero PVID that is the same as an existing VLAN is dynamically added to a partition or is activated as part of a partition activation, the traffic flow stops for other partitions with OS configured VLAN devices with the same VLAN ID. This problem can be recovered by configuring an IP address on the logical port with the non-zero PVID and initiating traffic flow on this logical port. This problem can be avoided by not configuring logical ports with a PVID if other logical ports on the same physical port are configured with OS VLAN devices. This fix updates the adapter firmware to 11.4.415.37 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed for a system hang or terminate with SRC B700F105 logged during a Dynamic Platform Optimization (DPO) that is running with a partition in a failed state but that is not shut down. If DPO attempts to relocate a dedicated processor from the failed partition, the problem may occur. This problem can be avoided by doing a shutdown of any failed partitions before initiating DPO. A problem was fixed for a system crash with HMC message HSCL025D and SRC B700F103 logged on a Live Partition Mobility (LPM) inactive migration attempt that fails. The trigger for this problem is inactive migration that fails a compatibility check between the source and target systems. A problem was fixed for time-out issues in Power Enterprise Pools 1.0 (PEP 1.0) that can affect performance by having non-optimal assignments of processors and memory to the server logical partitions in the pool. For this problem to happen, the server must be in a PEP 1.0 pool and the HMC must take longer than 2 minutes to provide the PowerVM hypervisor with the information about pool resources owned by this server. The problem can be avoided by running the HMC optmem command before activating the partitions. A problem was fixed for certain SR-IOV adapters not being able to create the maximum number of VLANs that are supported for a physical port. There were insufficient memory pages allocated for the physical functions for this adapter type. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC66/#EC67 with CCIN 2CF3. A problem was fixed for certain SR-IOV adapters that can have B400FF02 SRCs logged with LPA dumps during a vNIC remove operation. The adapters can have issues with a deadlock in managing memory pages. In most cases, the operations should recover and complete. This fix updates the adapter firmware to XX.29.2003 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
VM950_045_045 / FW950.00 2020/11/23	Impact: New Severity: New GA Level with key features included listed below All features and fixes from the FW930.30 and FW940.20 service packs (and below) are included in this release. New Features and Functions Host firmware support for anti-rollback protection. This feature implements firmware anti-rollback protection as described in NIST SP 800-147B "BIOS Protection Guidelines for Servers". Firmware is signed with a "secure version". Support added for a new menu in ASMI called "Host firmware security policy" to update this secure version level at the processor hardware. Using this menu, the system administrator can enable the "Host firmware secure version lock-in" policy, which will cause the host firmware to update the "minimum secure version" to match the currently running firmware. Use the "Firmware Update Policy" menu in ASMI to show the current "minimum secure version" in the processor hardware along with the "Minimum code level supported" information. The secure boot verification process will block installing any firmware secure version that is less than the "minimum secure version" maintained in the processor hardware. Prior to enabling the "lock-in" policy, it is recommended to accept the current firmware level. WARNING: Once lock-in is enabled and the system is booted, the "minimum secure version" is updated and there is no way to roll it back to allow installing firmware releases with a lesser secure version. Note: If upgrading from FW930.30 or FW940.20, this feature is already applied. This server firmware level includes the SR-IOV adapter firmware level 11.4.415.33 for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3, #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0, and #EN0K/EN0L with CCIN 2CC1. This server firmware includes the SR-IOV adapter firmware level 1x.25.6100 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. Support for PCIe4 x8 1.6/3.2/6.4 TB NVMe Adapters that are Peripheral Component Interconnect Express (PCIe) Generation 4 (Gen4) x8 adapters with the following feature codes and CCINs: #EC7A/#EC7B with CCIN 594A ; #EC7C/#EC7D with CCIN 594B; and #EC7E/#EC7F with CCIN 594C for AIX/Linux. #EC7J/#EC7K with CCIN 594A ; #EC7L/#EC7M with CCIN 594B; and #EC7N/#EC7P with CCIN 594C for IBM i. PowerVM boot support for AIX for NVMe over Fabrics (NVMf) for 32Gb Fibre Channel. Natively attached adapters are supported with the following feature codes and CCINs: #EN1A/#EN1B with CCIN 578F. Support added for a PCIe2 2-Port USB 3.0 adapter with the following feature codes and CCIN: #EC6J/#EC6K with CCIN 590F. Support added for dedicated processor partitions in IBM Power Enterprise Pools (PEP) 2.0. Previously, systems added to PEP 2.0 needed to have all partitions as shared processor partitions. Support added for SR-IOV Hybrid Network Virtualization (HNV) for Linux. This capability allows a Linux partition to take advantage of the efficiency and performance benefits of SR-IOV logical ports and participate in mobility operations such as active and inactive Live Partition Mobility (LPM) and Simplified Remote Restart (SRR). HNV is enabled by selecting a new Migratable option when an SR-IOV logical port is configured. The Migratable option is used to create a backup virtual device. The backup virtual device must be a Virtual Ethernet adapter (virtual Network Interface Controller (vNIC) adapter not supported as a backup device). In addition to this firmware, HNV support in a production environment requires HMC 9.1.941.0 or later, RHEL 8., SLES 15, and VIOS 3.1.1.20 or later. Enhanced Dynamic DMA Window (DDW) for I/O adapter slots to enable the OS to use 64KB TCEs. The OS supported is Linux RHEL 8.3 LE. PowerVM support for the Platform KeyStore (PKS) for partitions. PowerVM has added new h-call interfaces allowing the partition to interact with the Platform KeyStore that is maintained by PowerVM. This keystore can be used by the partition to store items requiring confidentiality or integrity like encryption keys or certificates. Note: The total amount of PKS for the system is limited to 1 MB across all the partitions for FW950.00. Support for 64 GB 16Gbit DDR4 system memory running at 2666 Mhz with feature code #EM9B and part number 78P6815. Support for 128 GB 16Gbit DDR4 system memory running at 2666 Mhz with feature code #EM9C and part number 78P6925. System firmware changes that affect all systems HIPER/Pervasive: A problem was fixed to be able to detect a failed PFET sensing circuit in a core at runtime, and prevent a system fail with an incomplete state when a core fails to wake up. The failed core is detected on the subsequent IPL. With the fix. a core is called out with the PFET failure with SRC BC13090F and hardware description "CME detected malfunctioning of PFET headers." to isolate the error better with a correct callout. A problem was fixed for a VIOS, AIX, or Linux partition hang during an activation at SRC CA000040. This will occur on a system that has been running more than 814 days when the boot of the partition is attempted if the partitions are in POWER9_base or POWER9 processor compatibility mode. A workaround to this problem is to re-IPL the system or to change the failing partition to POWER8 compatibility mode. Note: If upgrading from FW930.30, this fix is already applied. A problem was fixed for a security vulnerability for the Self Boot Engine (SBE). The SBE can be compromised from the service processor to allow injection of malicious code. An attacker that gains root access to the service processor could compromise the integrity of the host firmware and bypass the host firmware signature verification process. This compromised state can not be detected through TPM attestation. This is Common Vulnerabilities and Exposures issue number CVE-2021-20487.

VM940

VM940 For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url: https://www.ibm.com/support/pages/node/6555136
VM940_098_027 / FW940.60 03/21/22	Impact: Availability Severity: SPE System firmware changes that affect all systems A problem was fixed for a possible denial of service on the service processor for ASMI and Redfish users. This problem is very rare and could be triggered by a large number of invalid login attempts to Redfish over a short period of time. A problem was fixed for system fans not increasing in speed when partitions are booted with PCIe hot adapters that require additional cooling. This fan speed problem can also occur if there is a change in the power mode that requires a higher minimum speed for the fans of the system than is currently active. Fans running at a slower speed than required for proper system cooling could lead to over-temperature conditions for the system. A problem was fixed for correct ASMI passwords being rejected when accessing ASMI using an ASCII terminal with a serial connection to the server. This problem always occurs for systems at firmware level FW940.40 and later. A problem was fixed for a partition with an SR-IOV logical port (VF) having a delay in the start of the partition. If the partition boot device is an SR-IOV logical port network device, this issue may result in the partition failing it boot with SRCs BA180010 and BA155102 logged and then stuck on progress code SRC 2E49 for an AIX partition. This problem is infrequent because it requires multiple error conditions at the same time on the SR-IOV adapter. To trigger this problem, multiple SR-IOV logical ports for the same adapter must encounter EEH conditions at roughly the same time such that a new logical port EEH condition is occurring while a previous EEH condition's handling is almost complete but not notified to the hypervisor yet. To recover from this problem, reboot the partition. A problem was fixed for a system hypervisor hang and an Incomplete state on the HMC after a logical partition (LPAR) is deleted that has an active virtual session from another LPAR. This problem happens every time an LPAR is deleted with an active virtual session. This is a rare problem because virtual sessions from an HMC (a more typical case) prevent an LPAR deletion until the virtual session is closed, but virtual sessions originating from another LPAR do not have the same check. A problem was fixed for a secondary fault after a partition creation error that could result in a Terminate Immediate (TI) of the system with an SRC B700F103 logged. The failed creation of partitions can be explicit or implicit that might trigger the secondary fault. One example of an implicit partition create is the ghost partition created for a Live Partition Mobility (LPM) migration. This type of partition can fail to create when there is insufficient memory available for the hardware page table (HPT) for the new partition. A problem was fixed for certain SR-IOV adapters that occurs during a VNIC failover where the VNIC backing device has a physical port down due to an adapter internal error with an SRC B400FF02 logged. This is an improved version of the fix delivered in earlier service pack FW940.40 for adapter firmware level 11.4.415.37 and it significantly reduces the frequency of the error being fixed. This fix updates the adapter firmware to 11.4.415.41 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. System firmware changes that affect certain systems For a system with an AIX or Linux partition, a problem was fixed a partition start failure for AIX or Linux with SRC BA54504D logged. This problem occurs if the partition is an MDC default partition with virtual Trusted Platform Module (vTPM) enabled. As a circumvention, power off the system and disable vTPM using the HMC GUI to change the default partition property for Virtualized Trusted Platform Module (VTPM) to off.
VM940_095_027 / FW940.50 11/17/21	Impact: Availability Severity: SPE System firmware changes that affect all systems A problem was fixed for an incorrect "Power Good fault" SRC logged for an #EMX0 PCIe3 expansion drawer on the lower CXP cable of B7006A85 (AOCABLE, PCICARD). The correct SRC is B7006A86 (PCICARD, AOCABLE). A problem was fixed for a missing error log SRC for an SR-IOV adapter in Shared mode that fails during the IPL because of adapter failure or because the system has insufficient memory for SR-IOV Shared mode for the adapter. The error log SRC added is B7005308, indicating a serviceable event and providing the adapter and error information. A problem was fixed for a longer boot time for a shared processor partition on the first boot after the processor chip 0 has been guarded. The partition boot would stall at SRC C20012FF but eventually complete. This rare problem is triggered by the loss of all cores in processor chip 0. On subsequent partition boots after the slow problem boot, the boot speeds return to normal. A problem was fixed for a Live Partition Mobility (LPM) hang during LPM validation on the target system. This is a rare system problem triggered during a LPM migration that causes LPM attempts to fail as well as other functionality such as configuration changes and partition shutdowns. To recover from this problem to be able to LPM and other operations such as configuration changes and shutting down partitions, the system must be re-IPLed. A problem was fixed for the system powering off after a hardware discovery IPL. This will happen if a hardware discovery IPL is initiated while the system is set to "Power off when last partition powers off". The system will power off when the Hardware Discovery Information (IOR) partition that does hardware discovery powers off. As a workaround, one should not use the "Power off when last partition powers off" setting when doing the hardware discovery IPL. Alternatively, one can just do a normal IPL after the system powers off, and then continue as normal. A problem was fixed for the HMC Repair and Verify (R&V) procedure failing with "Unable to isolate the resource" during concurrent maintenance of the #EMX0 Cable Card. This could lead one to take a disruptive action in order to do the repair. This should occur infrequently and only with cases where a physical hardware failure has occurred which prevents access to the PCIe reset line (PERST) but allows access to the slot power controls. As a workaround, pulling both cables from the Cable Card to the #EMX0 expansion drawer will result in a completely failed state that can be handled by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC. Then retry the R&V operation to recover the Cable Card. A problem was fixed to prevent a flood of informational PCIe Host Bridge (PHB) error logs with SRC B7006A74 that cause a wrap of internal flight recorders and loss of data needed for problem debug. This flood can be triggered by bad cables or other issues that cause frequent informational error logs. With the fix, thresholding has been added for informational PHB correctable errors at 10 in 24 hours before a Predictive Error is logged. A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware. No specific adapter problems were addressed at this new level. This change updates the adapter firmware to XX.30.1004 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where the SR-IOV adapter goes through EEH error recovery, causing an informational error with SRC B400FF04 and additional information text that indicates a command failed. This always happens when an adapter goes through EEH recovery and a physical port is in VEPA mode. With the fix, the informational error is not logged. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for certain SR-IOV adapters that encountered a rare adapter condition, had some response delays, and logged an Unrecoverable Error with SRC B400FF02. With the fix, handling of this rare condition is accomplished without the delay and an Informational Error is logged. and the adapter initialization continues without interruption. This fix pertains to adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for certain SR-IOV adapters in SR-IOV shared mode which may cause a network interruption and SRCs B400FF02 and B400FF04 logged. The problem occurs infrequently during normal network traffic. This fix updates the adapter firmware to 11.4.415.38 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for Platform Error Logs (PELs) not being logged and shown by the OS if they have an Error Severity code of "critical error". The trigger is the reporting by a system firmware subsystem of an error log that has set an Event/Error Severity in the 'UH' section of the log to a value in the range, 0x50 to 0x5F. The following error logs are affected: B200308C ==> PHYP ==> A problem occurred during the IPL of a partition. The adapter type cannot be determined. Ensure that a valid I/O Load Source is tagged. B700F104 ==> PHYP ==> Operating System error. Platform Licensed Internal Code terminated a partition. B7006990 ==> PHYP ==> Service processor failure B2005149 ==> PHYP ==> A problem occurred during the IPL of a partition. B700F10B ==> PHYP ==> A resource has been disabled due to hardware problems A7001150 ==> PHYP ==> System log entry only, no service action required. No action needed unless a serviceable event was logged. B7005442 ==> PHYP ==> A parity error was detected in the hardware Segment Lookaside Buffer (SLB). B200541A ==> PHYP ==> A problem occurred during a partition Firmware Assisted Dump B7001160 ==> PHYP ==> Service processor failure. B7005121 ==> PHYP ==> Platform LIC failure BC8A0604 ==> Hostboot ==> A problem occurred during the IPL of the system. BC8A1E07 ==> Hostboot ==> Secure Boot firmware validation failed. Note that these error logs are still reported to the service processor and HMC properly. This issue does not affect the Call Home action for the error logs. A problem was fixed for the Device Description in a System Plan related to Crypto Coprocessors and NVMe cards that were only showing the PCI vendor and device ID of the cards. This is not enough information to verify which card is installed without looking up the PCI IDs first. With the fix, more specific/useful information is displayed and this additional information does not have any adverse impact on sysplan operations. The problem is seen every time a System Plan is created for an installed Crypto Coprocessor or NVMe card. System firmware changes that affect certain systems For a system with an AIX or Linux partition. a problem was fixed for Platform Error Logs (PELs) that are truncated to only eight bytes for error logs created by the firmware and reported to the AIX or Linux OS. These PELs may appear to be blank or missing on the OS. This rare problem is triggered by multiple error log events in the firmware occurring close together in time and each needing to be reported to the OS, causing a truncation in the reporting of the PEL. As a problem workaround, the full error logs for the truncated logs are available on the HMC or using ASMI on the service processor to view them. For a system with an AIX or Linux partition, a problem was fixed for a boot hang in RTAS for a partition that owns I/O which uses MSI-X interrupts. A BA180007 SRC may be logged prior to the hang. The frequency of this RTAS hang error is very low.
VM940_093_027 / FW940.41 09/16/21	Impact: Data Severity: HIPER System firmware changes that affect all systems HIPER: A problem was fixed which may occur on a target system following a Live Partition Mobility (LPM) migration of an AIX partition utilizing Active Memory Expansion (AME) with 64 KB page size enabled using the vmo tunable: "vmo -ro ame_mpsize_support=1". The problem may result in AIX termination, file system corruption, application segmentation faults, or undetected data corruption. Note: If you are doing an LPM migration of an AIX partition utilizing AME and 64 KB page size enabled involving a POWER8 or POWER9 system, ensure you have a Service Pack including this change for the appropriate firmware level on both the source and target systems. HIPER/Pervasive: A problem was fixed for certain SR-IOV adapters in Shared mode where multicast and broadcast packets were not properly routed out to the physical port. This may result in network issues such as ping failure or inability to establish TCP connections. This problem only affects the SR-IOV adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3. This problem was introduced by a fix delivered in the FW940.40 service pack. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for Live Partition Mobility (LPM) migrations from non-trusted POWER9 systems to POWER10 systems. The LPM migration failure occurs every time a LPM migration is attempted from a non-trusted system source to FW1010 and later. For POWER9 systems, non-trusted is the default setting. The messages shown on the HMC for the failure are the following: HSCL365C The partition migration has been stopped because platform firmware detected an error (041800AC). HSCL365D The partition migration has been stopped because target MSP detected an error (05000127). HSCL365D The partition migration has been stopped because target MSP detected an error (05000127). A workaround for the problem is to enable the trusted system key on the POWER9 FW940/FW950 source system which can be done using an intricate procedure. Please contact IBM Support for help with this workaround. System firmware changes that affect certain systems For a system with a partition running AIX 7.3, a problem was fixed for running Live Update or Live Partition Mobility (LPM). AIX 7.3 supports Virtual Persistent Memory (PMEM) but it cannot be used with these operations, but the problem was making it appear that PMEM was configured when it was not. The Live Update and LPM operations always fail when attempted on AIX 7.3. Here is the failure output from a Live Update Preview: "1430-296 FAILED: not all devices are virtual devices. nvmem0 1430-129 FAILED: The following loaded kernel extensions are not known to be safe for Live Update: nvmemdd ... 1430-218 The live update preview failed. 0503-125 geninstall: The lvupdate call failed. Please see /var/adm/ras/liveupdate/logs/lvupdlog for details."
VM940_087_027 / FW940.40 07/08/21	Impact: Availability Severity: SPE New features and functions Support added to Redfish to provide a command to set the ASMI user passwords using a new AccountService schema. Using this service, the ASMI admin, HMC, and general user passwords can be changed. Support was changed to disable Service Location Protocol (SLP) by default for newly shipped systems or systems that are reset to manufacturing defaults. This change has been made to reduce memory usage on the service processor by disabling a service that is not needed for normal system operations. This change can be made manually for existing customers by changing it in ASMI with the options "ASMI -> System Configuration -> Security -> External Services Management" to disable the service. System firmware changes that affect all systems A problem was fixed for the system going to a "password update required" state on the HMC when downgrading from FW940 to FW930 service packs. This problem is rare and can only happen if the passwords on the service processor are set to the factory default values. The workaround to this problem is to update the FSP user password on the HMC. A problem was fixed for Time of Day (TOD) being lost for the real-time clock (RTC) with an SRC B15A3303 logged when the service processor boots or resets. This is a very rare problem that involves a timing problem in the service processor kernel. If the server is running when the error occurs, there will be an SRC B15A3303 logged, and the time of day on the service processor will be incorrect for up to six hours until the hypervisor synchronizes its (valid) time with the service processor. If the server is not running when the error occurs, there will be an SRC B15A3303 logged, and If the server is subsequently IPLed without setting the date and time in ASMI to fix it, the IPL will abort with an SRC B7881201 which indicates to the system operator that the date and time are invalid. A problem was fixed for intermittent failures for a reset of a Virtual Function (VF) for SR-IOV adapters during Enhanced Error Handling (EEH) error recovery. This is triggered by EEH events at a VF level only, not at the adapter level. The error recovery fails if a data packet is received by the VF while the EEH recovery is in progress. A VF that has failed can be recovered by a partition reboot or a DLPAR remove and add of the VF. A problem was fixed for performance degradation of a partition due to task dispatching delays. This may happen when a processor chip has all of its shared processors removed and converted to dedicated processors. This could be driven by a DLPAR remove of processors or Dynamic Platform Optimization (DPO). A problem was fixed for a logical partition activation error that can occur when trying to activate a partition when the adapter hardware for an SR-IOV logical port has been physically removed or is unavailable due to a hardware issue. This message is reported on the HMC for the activation failure: "Error: HSCL12B5 The operation to remove SR-IOV logical port <number> failed because of the following error: HSCL1552 The firmware operation failed with extended error" where the logical port number will vary. This is an infrequent problem that is only an issue if the adapter hardware has been removed or another problem makes it unavailable. The workaround for this problem is to physically add the hardware back in or correct the hardware issue. If that cannot be done, create an alternate profile for the logical partition without the SR-IOV logical port and use that until the hardware issue is resolved. A problem was fixed for incomplete periodic data gathered by IBM Service for #EMXO PCIe expansion drawer predictive error analysis. The service data is missing the PLX (PCIe switch) data that is needed for the debug of certain errors. A problem was fixed for a rare failure for an SPCN I2C command sent to a PCIe I/O expansion drawer that can occur when service data is manually collected with hypervisor macros "xmsvc -dumpCCData and xmsvc -logCCErrBuffer". If the hypervisor macro "xmsvc "is run to gather service data and a CMC Alert occurs at the same time that requires an SPCN command to clear the alert, then the I2C commands may be improperly serialized, resulting in an SPCN I2C command failure. To prevent this problem, avoid using xmsvc -dumpCCData and xmsvc -logCCErrBuffer to collect service data until this fix is applied. A problem was fixed for a system hang or terminate with SRC B700F105 logged during a Dynamic Platform Optimization (DPO) that is running with a partition in a failed state but that is not shut down. If DPO attempts to relocate a dedicated processor from the failed partition, the problem may occur. This problem can be avoided by doing a shutdown of any failed partitions before initiating DPO. A problem was fixed for a system crash with HMC message HSCL025D and SRC B700F103 logged on a Live Partition Mobility (LPM) inactive migration attempt that fails. The trigger for this problem is inactive migration that fails a compatibility check between the source and target systems. A problem was fixed for time-out issues in Power Enterprise Pools 1.0 (PEP 1.0) that can affect performance by having non-optimal assignments of processors and memory to the server logical partitions in the pool. For this problem to happen, the server must be in a PEP 1.0 pool and the HMC must take longer than 2 minutes to provide the PowerVM hypervisor with the information about pool resources owned by this server. The problem can be avoided by running the HMC optmem command before activating the partitions. A problem was fixed for the Systems Management Services (SMS) menu " I/O Device Information" option being incorrect when displaying the capacity for an NVMe or Fibre Channel (FC) NVMe disk. This problem occurs every time the data is displayed. A problem was fixed for an infrequent SRC of B7006956 that may occur during a system power off. This SRC indicates that encrypted NVRAM locations failed to synchronize with the copy in memory during the shutdown of the hypervisor. This error can be ignored as the encrypted NVRAM information is stored in a redundant location, so the next IPL of the system is successful. A problem was fixed for a misleading SRC B7006A20 (Unsupported Hardware Configuration) that can occur for some error cases for PCIes #EMX0 expansion drawers that are connected with copper cables. For cable unplug errors, the SRC B7006A88 (Drawer TrainError) should be shown instead of the B7006A20. If a B7006A20 is logged against copper cables with the signature "Prc UnsupportedCableswithFewerChannels" and the message "NOT A 12CHANNEL CABLE", this error should instead follow the service actions for a B7006A88 SRC. A problem was fixed for certain SR-IOV adapters not being able to create the maximum number of VLANs that are supported for a physical port. There were insufficient memory pages allocated for the physical functions for this adapter type. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC66/#EC67 with CCIN 2CF3. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for certain SR-IOV adapters that can have B400FF02 SRCs logged with LPA dumps during a vNIC remove operation. The adapters can have issues with a deadlock in managing memory pages. In most cases, the operations should recover and complete. This fix updates the adapter firmware to XX.29.2003 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. The following problems were fixed for certain SR-IOV adapters: 1) An error was fixed that occurs during a VNIC failover where the VNIC backing device has a physical port down or read port errors with an SRC B400FF02 logged. 2) A problem was fixed for adding a new logical port that has a PVID assigned that is causing traffic on that VLAN to be dropped by other interfaces on the same physical port which uses OS VLAN tagging for that same VLAN ID. This problem occurs each time a logical port with a non-zero PVID that is the same as an existing VLAN is dynamically added to a partition or is activated as part of a partition activation, the traffic flow stops for other partitions with OS configured VLAN devices with the same VLAN ID. This problem can be recovered by configuring an IP address on the logical port with the non-zero PVID and initiating traffic flow on this logical port. This problem can be avoided by not configuring logical ports with a PVID if other logical ports on the same physical port are configured with OS VLAN devices. This fix updates the adapter firmware to 11.4.415.37 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for some serviceable events specific to the reporting of EEH errors not being displayed on the HMC. The sending of an associated call home event, however, was not affected. This problem is intermittent and infrequent. A problem was fixed for possible partition errors following a concurrent firmware update from FW910 or later. A precondition for this problem is that DLPAR operations of either physical or virtual I/O devices must have occurred prior to the firmware update The error can take the form of a partition crash at some point following the update. The frequency of this problem is low. If the problem occurs, the OS will likely report a DSI (Data Storage Interrupt) error. For example, AIX produces a DSI_PROC log entry. If the partition does not crash, it is also possible that some subsequent I/O DLPAR operations will fail. A problem was fixed for a missing hardware callout and guard for a processor chip failure with SRC BC8AE540 and signature "ex(n0p0c5) (L3FIR[28]) L3 LRU array parity error". A problem was fixed for a missing hardware callout and guard for a processor chip failure with Predictive Error (PE) SRC BC70E540 and signature "ex(n1p2c6) (L2FIR[19]) Rc or NCU Pb data CE error". The PE error occurs after the number of CE errors reaches a threshold of 32 errors per day. A problem was fixed for a Live Partition Mobility (LPM) migration that failed with the error "HSCL3659 The partition migration has been stopped because orchestrator detected an error" on the HMC. This problem is intermittent and rare that is triggered by the HMC being overrun with unneeded LPM message requests from the hypervisor that can cause a timeout in HMC queries that result in the LPM operation being aborted. The workaround is to retry the LPM migration which will normally succeed. A problem was fixed for a service processor mailbox ( mbox) timeout error with SRC B182953C during the IPL of systems with large memory configurations and "I/O Adapter Enlarged Capacity" enabled from ASMI. The error indicates that the hypervisor did not respond quickly enough to a message from the service processor but this may not result in an IPL failure. The problem is intermittent, so if the IPL does fail, the workaround is to retry the IPL. Problems were fixed for DLPAR operations that change the uncapped weight of a partition and DLPAR operations that switch an active partition from uncapped to capped. After changing the uncapped weight, the weight can be incorrect. When switching an active partition from uncapped to capped, the operation can fail. A problem was fixed where the Floating Point Unit Computational Test, which should be set to "staggered" by default, has been changed in some circumstances to be disabled. If you wish to re-enable this option, this fix is required. After applying this service pack, do the following steps: 1) Sign in to the Advanced System Management Interface (ASMI). 2) Select Floating Point Computational Unit under the System Configuration heading and change it from disabled to what is needed: staggered (run once per core each day) or periodic (a specified time). 3) Click "Save Settings". A problem was fixed for a system termination with SRC B700F107 following a time facility processor failure with SRC B700F10B. With the fix, the transparent replacement of the failed processor will occur for the B700F10B if there is a free core, with no impact to the system. A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where unmatched unicast packets were not forwarded to the promiscuous mode VF. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. System firmware changes that affect certain systems On systems with only Integrated Facility for Linux ( IFL) processors and AIX partitions, a problem was fixed for performance issues for IFL VMs (Linux and VIOS). This problem occurs if AIX partitions are active on a system with IPL only cores. As a workaround, AIX partitions should not be activated on an IFL only system. With the fix, the activation of AIX partitions is blocked on an IFL only system. If this fix is installed concurrently with AIX partitions running, these partitions will be allowed to continue to run until they are powered off. Once powered off, the AIX partitions will not be allowed to be activated again on the IFL-only system.
VM940_084_027 / FW940.32 05/25/21	Impact: Availability Severity: HIPER New features and functions Support was added for Samsung DIMMs with part number 01GY853. If these DIMMs are installed in a system with older FW940 firmware than FW940.32, the DIMMs will fail and be guarded with SRC BC8A090F logged with HwpReturnCode "RC_CEN_MBVPD_TERM_DATA_UNSUPPORTED_VPD_ENCODE". System firmware changes that affect all systems HIPER/Pervasive: A problem was fixed for a checkstop due to an internal Bus transport parity error or a data timeout on the Bus. This is a very rare problem that requires a particular SMP transport link traffic pattern and timing. Both the traffic pattern and timing are very difficult to achieve with customer application workloads. The fix will have no measurable effect on most customer workloads although highly intensive OLAP-like workloads may see up to 2.5% impact.
VM940_074_027 / FW940.31 03/24/21	Impact: Availability Severity: SPE System firmware changes that affect all systems A problem was fixed for a partition hang in shutdown with SRC B200F00F logged. The trigger for the problem is an asynchronous NX accelerator job (such as gzip or NX842 compression) in the partition that fails to clean up successfully. This is intermittent and does not cause a problem until a shutdown of the partition is attempted. The hung partition can be recovered by performing an LPAR dump on the hung partition. When the dump has been completed, the partition will be properly shut down and can then be restarted without any errors.
VM940_071_027 / FW940.30 02/04/21	Impact: Availability Severity: HIPER New features and functions Support added to be able to set the NVRAM variable 'real-base' from the Restricted OF Prompt (ROFP). Prior to the introduction of ROFP, customers had the ability to set 'real-base' from the OF prompt. This capability was removed in the initial delivery of ROFP in FW940.00. One use for this capability is that, in some cases, OS images (usually Linux) need more memory to load their image for boot. The OS image is loaded in between 0x4000 'load-base' and 0x2800000 'real-base'. Added support in ASMI for a new panel to do Self -Boot Engine (SBE) SEEPROM validation. This validation can only be run at the service processor standby state. If the validation detects a problem, IBM recommends the system not be used and that IBM service be called. System firmware changes that affect all systems HIPER/Pervasive: A problem was fixed to be able to detect a failed PFET sensing circuit in a core at runtime, and prevent a system fail with an incomplete state when a core fails to wake up. The failed core is detected on the subsequent IPL. With the fix. a core is called out with the PFET failure with SRC BC13090F and hardware description "CME detected malfunctioning of PFET headers." to isolate the error better with a correct callout. DEFERRED: A problem was fixed for a rare Voltage Regulator Module (VRM) power fault with an SRC 11002700 logged for the VRM failure followed by an SRC 11002610 system crash. The trigger for this problem is intense workloads that cause what appear to be input over-current conditions. A re-IPL of the system is needed to activate this fix. A problem was fixed for a slow down in PCIe adapter performance or loss of adapter function caused by a reduction in interrupts available to service the adapter. This problem can be triggered over time by partition activations or DLPAR adds of PCIe adapters to a partition. This fix must be applied and the system re-IPLed for existing adapter performance problems to be resolved. However, the fix will prevent future issues without re-ipl if applied before the problem is observed. A problem was fixed for not logging SRCs for certain cable pulls from the #EMXO PCIe expansion drawer. With the fix, the previously undetected cable pulls are now detected and logged with SRC B7006A8B and B7006A88 errors. A problem was fixed for a system hang and HMC "Incomplete" state that may occur when a partition hangs in shutdown with SRC B200F00F logged. The trigger for the problem is an asynchronous NX accelerator job (such as gzip or NX842 compression) in the partition that fails to clean up successfully. This is intermittent and does not cause a problem until a shutdown of the partition is attempted. A problem was fixed for a VIOS, AIX, or Linux partition hang during an activation at SRC CA000040. This will occur on a system that has been running for more than 814 days when the boot of the partition is attempted if the partitions are in POWER9_base or POWER9 processor compatibility mode. A workaround to this problem is to re-IPL the system or to change the failing partition to POWER8 compatibility mode. A problem was fixed for performance tools perfpmr, tprof and pex that may not be able to collect data for the event based options. This can occur any time an OS thread becomes idle. When the processor cores are assigned to the next active process, the performance registers may be disabled. problem was fixed for a rare system hang with SRC BC70E540 logged that may occur when adding processors through licensing or the system throttle state changing (becoming throttled or unthrottled) on an Enterprise Pool system. The trigger for the problem is a very small timing window in the hardware as the processor loads are changing. A problem was for an intermittent anchor card timeout with Informational SRC B7009020 logged when reading TPM physical storage from the anchor card. There is no customer impact for this problem as long as NVRAM is accessible. A problem was fixed for the On-Chip Controller (OCC) going into safe mode (causes loss of processor performance) with SRC BC702616 logged. This problem can be triggered by the loss of a power supply (an oversubscription event). The problem can be circumvented by fixing the issue with the power supply. A problem was fixed for error handling of a rare DIMM VPD read that causes incorrect logging of SRC B1232A09, word 8 (00000000), "Error occurred when attempting to read a memory DIMM temperature". Other SRCs seen with this error may include BC23E504 and B1561314. This error results in multiple FRUs being called out such as the system planar, processor, DIMM controller, DIMMs or the memory riser card. A problem was fixed for the error handling of a system with an unsupported memory configuration that exceeds available memory power. Without the fix, the IPL of the system is attempted and fails with a segmentation fault with SRCs B1818611 and B181460B logged that do not call out the incorrect DIMMs. A problem was fixed for the Self Boot Engine (SBE) going to termination with an SRC B150BA8D logged when booting on a bad core. Once this happens, this error will persist as the bad core is not deconfigured. To recover from this error and be able to IPL, the bad core must be manually deconfigured. With the fix, the failing core is deconfigured and the SBE is reconfigured to use another core so the system is able to IPL. A problem was fixed for certain SR-IOV adapters that have a rare, intermittent error with B400FF02 and B400FF04 logged, causing a reboot of the VF. The error is handled and recovered without any user intervention needed. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3. A problem was fixed for Live Partition Mobility (LPM) being shown as enabled at the OS when it has been disabled by the ASMI command line using the server processor command of "cfcuod -LPM OFF". LPM is actually disabled and the status shows correctly on the HMC. The status on the OS can be ignored (for example as shown by the AIX command "lparstat -L") as LPM will not be allowed to run when it is disabled. A problem was fixed for an SRC B7006A99 informational log now posted as a Predictive with a call out of the CXP cable FRU, This fix improves FRU isolation for cases where a CXP cable alert causes a B7006A99 that occurs prior to a B7006A22 or B7006A8B. Without the fix, the SRC B7006A99 is informational and the latter SRCs cause a larger hardware replacement even though the earlier event identified a probable cause for the cable FRU. System firmware changes that affect certain systems On systems with an uncapped shared processor partition in POWER9 processor compatibility mode. a problem was fixed for a system hang following Dynamic Platform Optimization (DPO), memory mirroring defragmentation, or memory guarding that happens as part of memory error recovery during normal operations of the system. On systems with a partition using Virtual Persistent Memory (vPMEM) LUNS configured with a 16 MB MPSS (Multiple Page Segment Size) mapping, a problem was fixed for temporary system hangs. The temporary hang may occur while the memory is involved in memory operations such as Dynamic Platform Optimization (DPO), memory mirroring defragmentation, or memory guarding that happens as part of memory error recovery during normal operations of the system. On systems with partitions having user mode enabled for the External Interrupt Virtualization Engine (XIVE), a problem was fixed for a possible system crash and HMC "Incomplete" state when a force DLPAR remove of a PCIe adapter occurs after a dynamic LPAR (DLPAR) operation fails for that same PCIe adapter.
VM940_061_027 / FW940.20 09/24/20	Impact: Data Severity: HIPER New features and functions DEFERRED: Host firmware support for anti-rollback protection. This feature implements firmware anti-rollback protection as described in NIST SP 800-147B "BIOS Protection Guidelines for Servers". Firmware is signed with a "secure version". Support added for a new menu in ASMI called "Host firmware security policy" to update this secure version level at the processor hardware. Using this menu, the system administrator can enable the "Host firmware secure version lock-in" policy, which will cause the host firmware to update the "minimum secure version" to match the currently running firmware. Use the "Firmware Update Policy" menu in ASMI to show the current "minimum secure version" in the processor hardware along with the "Minimum code level supported" information. The secure boot verification process will block installing any firmware secure version that is less than the "minimum secure version" maintained in the processor hardware. Prior to enabling the "lock-in" policy, it is recommended to accept the current firmware level. WARNING: Once lock-in is enabled and the system is booted, the "minimum secure version" is updated and there is no way to roll it back to allow installing firmware releases with a lesser secure version. System firmware changes that affect all systems HIPER/Pervasive: A problem was fixed for certain SR-IOV adapters for a condition that may result from frequent resets of adapter Virtual Functions (VFs), or transmission stalls and could lead to potential undetected data corruption. The following additional fixes are also included: 1) The VNIC backing device goes to a powered off state during a VNIC failover or Live Partition Mobility (LPM) migration. This failure is intermittent and very infrequent. 2) Adapter time-outs with SRC B400FF01 or B400FF02 logged. 3) Adapter time-outs related to adapter commands becoming blocked with SRC B400FF01 or B400FF02 logged 4) VF function resets occasionally not completing quickly enough resulting in SRC B400FF02 logged. This fix updates the adapter firmware to 11.4.415.33 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed for the REST/Redfish interface to change the success return code for object creation from "200" to "201". The "200" status code means that the request was received and understood and is being processed. A "201" status code indicates that a request was successful and, as a result, a resource has been created. The Redfish Ruby Client, "redfish_client" may fail a transaction if a "200" status code is returned when "201" is expected. A problem was fixed to allow quicker recovery of PCIe links for the #EMXO PCIe expansion drawer for a run-time fault with B7006A22 logged. The time for recovery attempts can exceed six minutes on rare occasions which may cause I/O adapter failures and failed nodes. With the fix, the PCIe links will recover or fail faster (in the order of seconds) so that redundancy in a cluster configuration can be used with failure detection and failover processing by other hosts, if available, in the case where the PCIe links fail to recover. A problem was fixed for a concurrent maintenance "Repair and Verify" (R&V) operation for a #EMX0 fanout module that fails with an "Unable to isolate the resource" error message. This should occur only infrequently for cases where a physical hardware failure has occurred which prevents access to slot power controls. This problem can be worked around by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC after the hardware failure but before the concurrent repair is attempted. This will avoid the problem with the PCIe slot isolation. These steps can also be used to recover from the error to allow the R&V repair to be attempted again. A problem was fixed for a rare system hang that can occur when a page of memory is being migrated. Page migration (memory relocation) can occur for a variety of reasons, including predictive memory failure, DLPAR of memory, and normal operations related to managing the page pool resources. A problem was fixed for utilization statistics for commands such as HMC lslparutil and third-party lpar2rrd that do not accurately represent CPU utilization. The values are incorrect every time for a partition that is migrated with Live Partition Mobility (LPM). Power Enterprise Pools 2.0 is not affected by this problem. If this problem has occurred, here are three possible recovery options: 1) Re-IPL the target system of the migration. 2) Or delete and recreate the partition on the target system. 3) Or perform an inactive migration of the partition. The cycle values get zeroed in this case. A problem was fixed for running PCM on a system with SR-IOV adapters in shared mode that results in an "Incomplete" system state with certain hypervisor tasks deadlocked. This problem is rare and is triggered when using SR-IOV adapters in shared mode and gathering performance statistics with PCM (Performance Collection and Monitoring) and also having a low-level error on an adapter. The only way to recover from this condition is to re-IPL the system. A problem was fixed for an enhanced PCIe expansion drawer FPGA reset causing EEH events from the fanout module or cable cards that disrupt the PCIe lanes for the PCIe adapters. This problem affects systems with the PCIe expansion drawer enhanced fanout module (#EMXH) and the enhanced cable card ( #EJ20). The error is associated with the following SRCs being logged: B7006A8D with PRC 37414123 (XmPrc::XmCCErrMgrBearPawPrime \| XmPrc::LocalFpgaHwReset) B7006A8E with PRC 3741412A (XmPrc::XmCCErrMgrBearPawPrime \| XmPrc::RemoteFpgaHwReset) If the EEH errors occur, the OS device drivers automatically recover but with a reset of affected PCIe adapters that would cause a brief interruption in the I/O communications. A problem was fixed for the FRU callout lists for SRCs B7006A2A and B7006A2B possibly not including the FRU containing the PCIe switch as the second FRU in the callout list. The card/drive in the slot is the first callout and the FRU containing the PCIe switch should be the second FRU in the callout list. This problem occurs when the PCIe slot is on a different planar that the PCIe switch backing the slot. This impacts the NVMe backplanes (P2 with slots C1-C4) hosting the PCIe backed SSD NVMe U.2 modules that have feature codes #EC5J and #EC5K. As a workaround for B7006A2A and B7006A2B errors where the callout FRU list is processed and the problem is not resolved, consider replacing the backplane (which includes the PCIe switch) if this was omitted in the FRU callout list. A problem was fixed for a PCIe3 expansion drawer cable that has hidden error logs for a single lane failure. This happens whenever a single lane error occurs. Subsequent lane failures are not hidden and have visible error logs. Without the fix, the hidden or informational logs would need to be examined to gather more information for the failing hardware. A problem was fixed for an infrequent issue after a Live Partition Mobility (LPM) operation from a POWER9 system to a POWER8 or POWER7 system. The issue may cause unexpected OS behavior, which may include loss of interrupts, device time-outs, or delays in dispatching. Rebooting the affected target partition will resolve the problem. A problem was fixed for a partition crash or hang following a partition activation or a DLPAR add of a virtual processor. For partition activation, this issue is only possible for a system with a single partition owning all resources. For DLPAR add, the issue is extremely rare. A problem was fixed for a DLPAR remove of memory from a partition that fails if the partition contains 65535 or more LMBs. With 16MB LMBs, this error threshold is 1 TB of memory. With 256 MB LMBs, it is 16 TB of memory. A reboot of the partition after the DLPAR will remove the memory from the partition. A problem was fixed for an IPL failure with SRC BA180020 logged for an initialization failure on a PCIe adapter in a PCIe3 expansion drawer. The PCIe adapters that are intermittently failing on the PCIe probe are the PCIe2 4-port Fibre Channel Adapter with feature code #5729 and the PCIe2 4-port 1 Gb Ethernet Adapter with feature code #5899. The failure can only occur on an IPL or re-IPL and it is very infrequent. The system can be recovered with a re-IPL. A problem was fixed for a partition configured with a large number (approximately 64) of Virtual Persistent Memory (PMEM) LUNs hanging during the partition activation with a CA00E134 checkpoint SRC posted. Partitions configured with approximately 64 PMEM LUNs will likely hang and the greater the number of LUNs, the greater the possibility of the hang. The circumvention to this problem is to reduce the number of PMEM LUNs to 64 or less in order to boot successfully. The PMEM LUNs are also known as persistent memory volumes and can be managed using the HMC. For more information on this topic, refer to https://www.ibm.com/support/knowledgecenter/POWER9/p9efd/p9efd_lpar_pmem_settings.htm. A problem was fixed for non-optimal On-Chip Controller (OCC) processor frequency adjustments when system power limits or user power caps are exceeded. When a workload causes power limits or caps to be exceeded, there can be large frequency swings for the processors and a processor chip can get stuck at minimum frequency. With the fix, the OCC now waits for new power readings when changing the processor frequency and uses a master power capping frequency to keep all processors at the same frequency. As a workaround for this problem, do not set a power cap or run a workload that would exceed the system power limit. A problem was fixed for PCIe resources under a deconfigured PCIe Host Bridge (PHB) being shown on the OS host as available resources when they should be shown as deconfigured. While this fix can be applied concurrently, a re-IPL of the system is needed to correct the state of the PCIe resources if a PHB had already been deconfigured. A problem was fixed for incorrect run-time deconfiguration of a processor core with SRC B700F10B. This problem can be circumvented by a reconfiguration of the processor core but this should only be done with the guidance of IBM Support to ensure the core is good. A problem was fixed for certain SR-IOV adapter errors where a B400F011 is reported instead of a more descriptive B400FF02 or B400FF04. The LPA dump still happens which can be used to isolate to the issue. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3. A problem was fixed for mixing modes on the ports of SR-IOV adapters that causes SRC B200A161, B200F011, B2009014 and B400F104 to be logged on boot of the failed adapter. This error happens when one port of the adapter is changed to option 1 with a second port set at either option 0 or option 2. The error can be cleared by taking the adapter out of SR-IOV shared mode. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3. A problem was fixed for certain SR-IOV adapters with the following issues: 1) The VNIC backing device goes to a powered off state during a VNIC failover or Live Partition Mobility (LPM) migration. This failure is intermittent and very infrequent. 2) Adapter time-outs with SRC B400FF01 or B400FF02 logged. 3) Adapter time-outs related to adapter commands becoming blocked with SRC B400FF01 or B400FF02 logged. 4)VF function resets occasionally not completing quickly enough resulting in SRC B400FF02 logged. This fix updates the adapter firmware to 11.4.415.33 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed for Novalink-created virtual ethernet and vNIC adapters having incorrect SR-IOV Hybrid Network Virtualization (HNV) values. The AIX and other OS hosts may be unable to use the adapters. This happens for all virtual ethernet and vNIC adapters created by Novalink in the FW940 releases up to the FW940.10 service pack. The fix will correct the settings for new Novalink created virtual adapters, but any pre-existing virtual adapters created by Novalink in FW940 must be deleted and recreated. A problem was fixed for partitions configured to run as AIX, VIOS, or Linux partitions that also own specific Fibre Channel (FC) I/O adapters (see below) are subject to a partition crash during boot if the partition does not already have a boot list. During the initial boot of a new partition (containing 577F, 578E, 578F or 579B adapters), the boot might fail with one of the following reference codes: BA210001, BA218001, BA210003, or BA218003. This most often occurs on deployments of new partitions that are booting for the first time for either a network install or booting to the Open Firmware prompt or SMS menus for the first time. The issue requires that the partition owns one or more of the following FC adapters and that these adapters are running at microcode firmware levels older than version 11.4.415.5: - Feature codes #EN1C,/#EN1D and #EL5X/#EL5W with CCIN 578E - Feature codes #EN1A/# EN1B and #EL5U/#EL5V with CCIN 578F - Feature codes #EN0A,/#EN0B and #EL5B/#EL43 with CCIN 577F The frequency of the problem is somewhat rare because it requires the following: - Partition does not already have a default boot list - Partition configured with one of the FC adapters listed above - The FC adapters must be running a version of microcode with unsigned/unsecure adapter microcode The following work around was created for systems having this issue: https://www.ibm.com/support/pages/node/1367103. With the fix, the FC adapters are given a temporary substitute for the FCode on the adapter but not the entire microcode image. The adapter microcode is not updated. This workaround is done so the system can boot from the adapter until the adapter can be updated by the customer with the latest available microcode from IBM Fix Central. In the meantime, the FCode substitution is made from the 12.4.257.15 level of the microcode. A problem was fixed for mixing memory DIMMs with different timings (different vendors) under the same memory controller that fail with an SRC BC20E504 error and DIMMs deconfigured. This is an "MCBIST_BRODCAST_OUT_OF_SYNC" error. The loss of memory DIMMs can result in an IPL failure. This problem can happen if the memory DIMMs have a certain level of timing differences. If the timings are not compatible, the failure will occur on the IPL during the memory training. To circumvent this problem, each memory controller should have only memory DIMMs from the same vendor plugged. A problem was fixed for the SR-IOV logical port of an I/O adapter logging a B400FF02 error because of a time-out waiting on a response from the firmware. This rare error requires a very heavily loaded system. For this error, word 8 of the error log is 80090027. No user intervention is needed for this error as the logical port recovers and continues with normal operations. A problem was fixed for a security vulnerability for the Self Boot Engine (SBE). The SBE can be compromised from the service processor to allow injection of malicious code. An attacker that gains root access to the service processor could compromise the integrity of the host firmware and bypass the host firmware signature verification process. This compromised state can not be detected through TPM attestation. This is Common Vulnerabilities and Exposures issue number CVE-2021-20487. System firmware changes that affect certain systems On systems with AIX and Linux partitions, a problem was fixed for AIX and Linux partitions that crash or hang when reporting any of the following Partition Firmware RTAS ASSERT rare conditions: 1) SRC BA33xxxx errors - Memory allocation and management errors. 2) SRC BA29xxxx errors - Partition Firmware internal stack errors. 3) SRC BA00E8xx errors - Partition Firmware initialization errors during concurrent firmware update or Live Partition Mobility (LPM) operations. This problem should be very rare. If the problem does occur, a partition reboot is needed to recover from the error.
VM940_050_027 / FW940.10 05/22/20	Impact: Availability Severity: SPE New features and functions Enable periodic logging of internal component operational data for the PCIe3 expansion drawer paths. The logging of this data does not impact the normal use of the system. Support added for SR-IOV Hybrid Network Virtualization (HNV) in a production environment (no longer a Technology Preview) for AIX and IBM i. This capability allows an AIX or IBM i partition to take advantage of the efficiency and performance benefits of SR-IOV logical ports and participate in mobility operations such as active and inactive Live Partition Mobility (LPM) and Simplified Remote Restart (SRR). HNV is enabled by selecting a new Migratable option when an SR-IOV logical port is configured. The Migratable option is used to create a backup virtual device. The backup virtual device can be either a Virtual Ethernet adapter or a virtual Network Interface Controller (vNIC) adapter. In addition to this firmware HNV support in a production environment requires HMC 9.1.941.0 or later, AIX Version 7.2 with the 7200-04 Technology Level and Service Pack 7200-04-02-2015 or AIX Version 7.1 with the 7100-05 Technology Level and Service Pack 7100-05-06-2015, IBM i 7.3 TR8 or IBM i 7.4 TR2, and VIOS 3.1.1.20. System firmware changes that affect all systems DEFERRED: A problem was fixed for a processor core failure with SRCs B150BA3C and BC8A090F logged that deconfigures the entire processor for the current IPL. A re-IPL of the system will recover the lost processor with only the bad core guarded. A problem was fixed for Performance Monitor Unit (PMU) events that had the incorrect Alink address (Xlink data given instead) that could be seen in 24x7 performance reports. The Alink event data is a recent addition for FW940 and would not have been seen at the earlier firmware levels. A problem was fixed for an SR-IOV adapter hang with B400FF02/B400FF04 errors logged during firmware update or error recovery. The adapter may recover after the error log and dump, but it is possible the adapter VF will remain disabled until the partition using it is rebooted. This affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. A problem was fixed for extraneous B400FF01 and B400FF02 SRCs logged when moving cables on SR-IOV adapters. This is an infrequent error that can occur if the HMC performance monitor is running at the same time the cables are moved. These SRCs can be ignored when accompanied by cable movement. A problem was fixed for certain SR-IOV adapters that can have B400FF02 SRCs logged with LPA dumps during Live Partition Mobility (LPM) migrations or vNIC failovers. The adapters can have issues with a deadlock on error starts after many resets of the VF and errors in managing memory pages. In most cases, the operations should recover and complete. This fix updates the adapter firmware to 1X.25.6100 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed where SR-IOV adapter VFs occasionally failed to provision successfully on the low-speed ports (1 Gbps) with SRC B400FF04 logged, or SR-IOV adapter VFs occasionally failed to provision successfully with SRC B400FF04 logged when the RoCE option is enabled. This affects the adapters with low speed ports (1 Gbps) with the following Feature Codes and CCINs: #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0, and #EN0K/EN0L with CCIN 2CC1. And it affects the adapters with the ROCE option enabled with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. A problem was fixed for an expired trial or elastic Capacity on Demand ( CoD) memory not warning of the use of unlicensed memory if the memory is not returned. This lack of warning can occur if the trial memory has been allocated as Virtual Persistent Memory (vPMEM). A problem was fixed for a B7006A96 fanout module FPGA corruption error that can occur in unsupported PCIe3 expansion drawer(#EMX0) configurations that mix an enhanced PCIe3 fanout module (#EMXH) in the same drawer with legacy PCIe3 fanout modules (#EMXF, #EMXG, #ELMF, or #ELMG). This causes the FPGA on the enhanced #EMXH to be updated with the legacy firmware and it becomes a non-working and unusable fanout module. With the fix, the unsupported #EMX0 configurations are detected and handled gracefully without harm to the FPGA on the enhanced fanout modules. A problem was fixed for possible dispatching delays for partitions running in POWER8, POWER9_base or POWER9 processor compatibility mode. A problem was fixed for system memory not returned after create and delete of partitions, resulting in slightly less memory available after configuration changes in the systems. With the fix, an IPL of the system will recover any of the memory that was orphaned by the issue. A problem was fixed for failover support for the Mover Service Partition (MSP) where a failover to the MSP partner during an LPM could cause the migration to abort. This vulnerability is only for a very specific window in the migration process. The recovery is to restart the migration operation. A rare problem was fixed for a checkstop during an IPL that fails to isolate and guard the problem core. An SRC is logged with B1xxE5xx and an extended hex word 8 xxxxDD90. With the fix, the failing hardware is guarded and a node is possibly deconfigured to allow the subsequent IPLs of the system to be successful. A problem was fixed for a hypervisor error during system shutdown where a B7000602 SRC is logged and the system may also briefly go "Incomplete" on the HMC but the shutdown is successful. The system will power back on with no problems so the SRC can be ignored if it occurred during a shutdown. A problem was fixed for certain large I/O adapter configurations having the PCI link information truncated on the PCI-E topology display shown with ASMI and the HMC. Because of the truncation, individual adapters may be missing on the PCI-E topology screens. A problem was fixed for certain NVRAM corruptions causing a system crash with a bad pointer reference instead of expected Terminate Immediate (TI) with B7005960 logged. A problem was fixed for certain SR-IOV adapters that do not support the "Disable Logical Port" option from the HMC but the HMC was allowing the user to select this, causing incorrect operation. The invalid state of the logical port causes an "Enable Logical Port" to fail in a subsequent operation. With the fix, the HMC provides the message that the "Disable Logical Port" is not supported for the adapter. This affects the adapters with the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3, #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0, and #EN0K/EN0L with CCIN 2CC1. A problem was fixed for the service processor ASMI "Factory Reset" option to disable the IPMI service as part of the factory reset. Without the fix, the IPMI operation state will be unchanged by the factory reset. A problem was fixed to remove unneeded resets of a VF for SR-IOV adapters, providing for improved performance of the startup or recovery time of the VF. This performance difference may be noticed during a Live Partition Mobility migration of a partition or during vNIC (Virtual Network Interface Controller) failovers where many resets of VFs are occurring. A problem was fixed for SR-IOV adapters having an SRC B400FF04 logged when a VF is reset. This is an infrequent issue and can occur for a Live Partition Mobility migration of a partition or during vNIC (Virtual Network Interface Controller) failovers where many resets of VFs are occurring. This error is recovered automatically with no impact on the system. A problem was fixed for initial configuration of SR-IOV adapter VFs with certain configuration settings for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3. These VFs may then fail following an adapter restart, with other VFs functioning normally. The error causes the VF to fail with an SRC B400FF04 logged. With the fix, VFs are configured correctly when created. Because the error condition may pre-exist in an incorrectly configured logical port, a concurrent update of this fix may trigger a logical port failure when the VF logical port is restarted during the firmware update. Existing VFs with the failure condition can be recovered by dynamically removing/adding the failed port and are automatically recovered during a system restart. A problem was fixed for TPM hardware failures not causing SRCs to logged with a call out if the system is configured in ASMI to not require TPM for the IPL. If this error occurs, the user would not find out about it until they needed to run with TPM on the IPL. With the fix, the error logs and notifications will occur regardless of how the TPM is configured. System firmware changes that affect certain systems For systems with deconfigured cores and using the default performance and power setting of "Dynamic Performance Mode" or "Maximum Performance Mode", a rare problem was fixed for an incorrect voltage/frequency setting for the processors during heavy workloads with high ambient temperature. This error could impact power usage, expected performance, or system availability if a processor fault occurs. This problem can be avoided by using ASMI "Power and Performance Mode Setup" to disable "All modes" when there are cores deconfigured in the system.
VM940_041_027 / FW940.02 02/18/20	Impact: Function Severity: HIPER System firmware changes that affect all systems A problem was fixed for an HMC "Incomplete" state for a system after the HMC user password is changed with ASMI on the service processor. This problem can occur if the HMC password is changed on the service processor but not also on the HMC, and a reset of the service processor happens. With the fix, the HMC will get the needed "failed authentication" error so that the user knows to update the old password on the HMC. System firmware changes that affect certain systems HIPER/Pervasive: For systems using PowerVM NovaLink to manage partitions, a problem was fixed for the hypervisor rejecting setting the system to be NovaLink managed. The following error message is given: "FATAL pvm_apd[]: Hypervisor encountered an error creating the ibmvmc device. Error number 5." This always happens in FW940.00 and FW940.01 which prevents a system from transitioning to be NovaLink managed at these firmware levels. If you were successfully running as NovaLink managed already on FW930 and upgraded to FW940, you would not experience this issue. For more information on PowerVM Novalink, refer to the IBM Knowledge Center article: https://www.ibm.com/support/knowledgecenter/POWER9/p9eig/p9eig_kickoff.htm.
VM940_034_027 / FW940.01 01/09/20	Impact: Security Severity: SPE New features and functions Support was added for improved security for the service processor password policy. For the service processor, the "admin", "hmc" and "general" password must be set on first use for newly manufactured systems and after a factory reset of the system. The IPMI interface has been changed to be disabled by default in these scenarios. The REST/Redfish interface will return an error saying the user account is expired. This policy change helps to enforce the service processor is not left in a state with a well known password. The user can change from an expired default password to a new password using the Advanced System Management Interface (ASMI). Support was added for real-time data capture for PCIe3 expansion drawer (#EMX0) cable card connection data via resource dump selector on the HMC or in ASMI on the service processor. Using the resource selector string of "xmfr -dumpccdata" will non-disruptively generate an RSCDUMP type of dump file that has the current cable card data, including data from cables and the retimers. System firmware changes that affect all systems A problem was fixed for an intermittent IPMI core dump on the service processor. This occurs only rarely when multiple IPMI sessions are starting and cleaning up at the same time. A new IPMI session can fail initialization when one of its session objects is cleaned up. The circumvention is to retry the IPMI command that failed. A problem was fixed for system hangs or incomplete states displayed by HMC(s) with SRC B182951C logged. The hang can occur during operations that require a memory relocation for any partition such as Dynamic Platform Optimization (DPO), memory mirroring defragmentation, or memory guarding that happens as part of memory error recovery during normal operations of the system. A problem was fixed for possible unexpected interrupt behavior for partitions running in POWER9 processor compatibility mode. This issue can occur during the boot of a partition running in POWER9 processor compatibility mode with an OS level that supports the External Interrupt Virtualization Engine (XIVE) exploitation mode. For more information on compatibility modes, see the following two articles in the IBM Knowledge Center: 1) Processor compatibility mode overview: https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcm.htm 2) Processor compatibility mode definitions: https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcmdefs.htm A problem was fixed for an intermittent IPL failure with SRC B181E540 logged with fault signature " ex(n2p1c0) (L2FIR[13]) NCU Powerbus data timeout". No FRU is called out. The error may be ignored and the re-IPL is successful. The error occurs very infrequently. This is the second iteration of the fix that has been released. Expedient routing of the Powerbus interrupts did not occur in all cases in the prior fix, so the timeout problem was still occurring.
VM940_027_027 / FW940.00 11/25/19	Impact: New Severity: New GA Level with key features included listed below All features and fixes from the FW930.11. service pack (and below) are included in this release. At the time of the FW940.00 release, the FW930.11 is a future FW930 service pack scheduled for the fourth quarter of 2019. New Features and Functions User Mode NX Accelerator Enablement for PowerVM. This enables the access of NX accelerators such as the gzip engine through user mode interfaces. The IBM Virtual HMC (vHMC) 9.1.940 provides a user interface to this feature. The LPAR must be running in POWER9 compatibility mode to use this feature. For more information on compatibility modes, see the following two articles in the IBM Knowledge Center: 1) Processor compatibility mode overview: https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcm.htm 2) Processor compatibility mode definitions: https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcmdefs.htm Support for user mode enablement of the External Interrupt Virtualization Engine (XIVE). This user mode enables the management of interrupts to move from the hypervisor to the operating system for improved efficiency. Operating systems may also have to be updated to enable this support. The LPAR must be running in POWER9 compatibility mode to use this feature. For more information on compatibility modes, see the following two articles in the IBM Knowledge Center: 1) Processor compatibility mode overview: https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcm.htm 2) Processor compatibility mode definitions: https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcmdefs.htm Extended support for PowerVM Firmware Secure Boot. This feature restricts access to the Open Firmware prompt and validates all adapter boot driver code. Boot adapters, or adapters which may be used as boot adapters in the future, must be updated to the latest microcode from IBM Fix Central. The latest microcode will ensure the adapters support the Firmware Secure Boot feature of Power Systems. This requirement applies when updating system firmware from a level prior to FW940 to levels FW940 and later. The latest adapter microcode levels include signed boot driver code. If a boot-capable PCI adapter is not installed with the latest level of adapter microcode, the partition which owns the adapter will boot, but error logs with SRCs BA5400A5 or BA5400A6 will be posted. Once the adapter(s) are updated, the error logs will no longer be posted. Linux OS support was added for PowerVM LPARs for the PCIe4 2x100GbE ConnectX-5 RoCE adapter with feature codes of #EC66/EC67 and CCIN 2CF3. Linux versions RHEL 7.5 and SLES 12.3 are supported. This server firmware level includes the SR-IOV adapter firmware level 11.4.415.28 for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3, #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0, and #EN0K/EN0L with CCIN 2CC1. This server firmware includes the SR-IOV adapter firmware level 1x.25.6000 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. System firmware changes that affect all systems A problem was fixed for incorrect call outs for PowerVM hypervisor terminations with SRC B7000103 logged. With the fix, the call outs are changed from SVCDOCS, FSPSP04, and FSPSP06 to FSPSP16. When this type of termination occurs, IBM support requires the dumps be collected to determine the cause of failure. A problem was fixed for an IPL failure with the following possible SRCs logged: 11007611, 110076x1, 1100D00C, and 110015xx. The service processor may reset/reload for this intermittent error and end up in the termination state.

VM930

VM930 For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url: https://www.ibm.com/support/pages/node/6555136
VM930_145_035 / FW930.50 9/17/21	Impact: Data Severity: HIPER New features and functions Support was changed to disable Service Location Protocol (SLP) by default for newly shipped systems or systems that are reset to manufacturing defaults. This change has been made to reduce memory usage on the service processor by disabling a service that is not needed for normal system operations. This change can be made manually for existing customers by changing it in ASMI with the options "ASMI -> System Configuration -> Security -> External Services Management" to disable the service. System firmware changes that affect all systems HIPER: A problem was fixed which may occur on a target system following a Live Partition Mobility (LPM) migration of an AIX partition utilizing Active Memory Expansion (AME) with 64 KB page size enabled using the vmo tunable: "vmo -ro ame_mpsize_support=1". The problem may result in AIX termination, file system corruption, application segmentation faults, or undetected data corruption. Note: If you are doing an LPM migration of an AIX partition utilizing AME and 64 KB page size enabled involving a POWER8 or POWER9 system, ensure you have a Service Pack including this change for the appropriate firmware level on both the source and target systems. A problem was fixed for a missing hardware callout and guard for a processor chip failure with SRC BC8AE540 and signature "ex(n0p0c5) (L3FIR[28]) L3 LRU array parity error". A problem was fixed for a missing hardware callout and guard for a processor chip failure with Predictive Error (PE) SRC BC70E540 and signature "ex(n1p2c6) (L2FIR[19]) Rc or NCU Pb data CE error". The PE error occurs after the number of CE errors reaches a threshold of 32 errors per day. A problem was fixed for a rare failure for an SPCN I2C command sent to a PCIe I/O expansion drawer that can occur when service data is manually collected with hypervisor macros "xmsvc -dumpCCData and xmsvc -logCCErrBuffer". If the hypervisor macro "xmsvc "is run to gather service data and a CMC Alert occurs at the same time that requires an SPCN command to clear the alert, then the I2C commands may be improperly serialized, resulting in an SPCN I2C command failure. To prevent this problem, avoid using xmsvc -dumpCCData and xmsvc -logCCErrBuffer to collect service data until this fix is applied. A problem was fixed for a system hang or terminate with SRC B700F105 logged during a Dynamic Platform Optimization (DPO) that is running with a partition in a failed state but that is not shut down. If DPO attempts to relocate a dedicated processor from the failed partition, the problem may occur. This problem can be avoided by doing a shutdown of any failed partitions before initiating DPO. A problem was fixed for a system crash with HMC message HSCL025D and SRC B700F103 logged on a Live Partition Mobility (LPM) inactive migration attempt that fails. The trigger for this problem is inactive migration that fails a compatibility check between the source and target systems. A problem was fixed for time-out issues in Power Enterprise Pools 1.0 (PEP 1.0) that can affect performance by having non-optimal assignments of processors and memory to the server logical partitions in the pool. For this problem to happen, the server must be in a PEP 1.0 pool and the HMC must take longer than 2 minutes to provide the PowerVM hypervisor with the information about pool resources owned by this server. The problem can be avoided by running the HMC optmem command before activating the partitions. A problem was fixed for a Live Partition Mobility (LPM) migration that failed with the error "HSCL3659 The partition migration has been stopped because orchestrator detected an error" on the HMC. This problem is intermittent and rare that is triggered by the HMC being overrun with unneeded LPM message requests from the hypervisor that can cause a timeout in HMC queries that result in the LPM operation being aborted. The workaround is to retry the LPM migration which will normally succeed. A problem was fixed for a system becoming unresponsive when a processor goes into a tight loop condition with an SRC B17BE434, indicating that the service processor has lost communication with the hypervisor. This problem is triggered by an SR-IOV shared mode adapter going through a recovery VF reset for an error condition, without releasing a critical lock. If a later reset is then needed for the VF, the problem can occur. The problem is infrequent because a combination of errors needs to occur in a specific sequence for the adapter. A problem was fixed for a misleading SRC B7006A20 (Unsupported Hardware Configuration) that can occur for some error cases for PCIe #EMX0 expansion drawers that are connected with copper cables. For cable unplug errors, the SRC B7006A88 (Drawer TrainError) should be shown instead of the B7006A20. If a B7006A20 is logged against copper cables with the signature "Prc UnsupportedCableswithFewerChannels" and the message "NOT A 12CHANNEL CABLE", this error should instead follow the service actions for a B7006A88 SRC. A problem was fixed where the Floating Point Unit Computational Test, which should be set to "staggered" by default, has been changed in some circumstances to be disabled. If you wish to re-enable this option, this fix is required. After applying this service pack, do the following steps: 1) Sign in to the Advanced System Management Interface (ASMI). 2) Select Floating Point Computational Unit under the System Configuration heading and change it from disabled to what is needed: staggered (run once per core each day) or periodic (a specified time). 3) Click "Save Settings". A problem was fixed for a system termination with SRC B700F107 following a time facility processor failure with SRC B700F10B. With the fix, the transparent replacement of the failed processor will occur for the B700F10B if there is a free core, with no impact to the system. A problem was fixed for an incorrect "Power Good fault" SRC logged for an #EMX0 PCIe3 expansion drawer on the lower CXP cable of B7006A85 (AOCABLE, PCICARD). The correct SRC is B7006A86 (PCICARD, AOCABLE). A problem was fixed for certain SR-IOV adapters that can have B400FF02 SRCs logged with LPA dumps during a vNIC remove operation. In most cases, the operations should recover and complete. This fix updates the adapter firmware to XX.29.2003 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for certain SR-IOV adapters not being able to create the maximum number of VLANs that are supported for a physical port. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC66/#EC67 with CCIN 2CF3. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where unmatched unicast packets were not forwarded to the promiscuous mode VF. This is an updated and corrected version of a similar fix delivered in the FW940.40 service pack that had side effects of network issues such as ping failure or inability to establish TCP connections. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. The following problems were fixed for certain SR-IOV adapters: 1) An error was fixed that occurs during a VNIC failover where the VNIC backing device has a physical port down due to an adapter internal error with an SRC B400FF02 logged. This is an improved version of the fix delivered in earlier service pack FW930.40 for adapter firmware 11.4.415.36 and it significantly reduces the frequency of the error being fixed. 2) A problem was fixed for an adapter in SR-IOV shared mode which may cause a network interruption and SRCs B400FF02 and B400FF04 logged. The problem occurs infrequently during normal network traffic. The fixes update the adapter firmware to 11.4.415.38 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. A problem was fixed for the Device Description in a System Plan related to Crypto Coprocessors and NVMe cards that were only showing the PCI vendor and device ID of the cards. This is not enough information to verify which card is installed without looking up the PCI IDs first. With the fix, more specific/useful information is displayed and this additional information does not have any adverse impact on sysplan operations. The problem is seen every time a System Plan is created for an installed Crypto Coprocessor or NVMe card. A problem was fixed for possible partition errors following a concurrent firmware update from FW910 or later. A precondition for this problem is that DLPAR operations of either physical or virtual I/O devices must have occurred prior to the firmware update. The error can take the form of a partition crash at some point following the update. The frequency of this problem is low. If the problem occurs, the OS will likely report a DSI (Data Storage Interrupt) error. For example, AIX produces a DSI_PROC log entry. If the partition does not crash, it is also possible that some subsequent I/O DLPAR operations will fail. A problem was fixed for some serviceable events specific to the reporting of EEH errors not being displayed on the HMC. The sending of an associated call home event, however, was not affected. This problem is intermittent and infrequent. System firmware changes that affect certain systems For a system with a partition running AIX 7.3, a problem was fixed for running Live Update or Live Partition Mobility (LPM). AIX 7.3 supports Virtual Persistent Memory (PMEM) but it cannot be used with these operations, but the problem was making it appear that PMEM was configured when it was not. The Live Update and LPM operations always fail when attempted on AIX 7.3. Here is the failure output from a Live Update Preview: "1430-296 FAILED: not all devices are virtual devices. nvmem0 1430-129 FAILED: The following loaded kernel extensions are not known to be safe for Live Update: nvmemdd ... 1430-218 The live update preview failed. 0503-125 geninstall: The lvupdate call failed. Please see /var/adm/ras/liveupdate/logs/lvupdlog for details." On systems with only Integrated Facility for Linux ( IFL) processors and AIX partitions, a problem was fixed for performance issues for IFL VMs (Linux and VIOS). This problem occurs if AIX partitions are active on a system with IPL only cores. As a workaround, AIX partitions should not be activated on an IFL only system. With the fix, the activation of AIX partitions is blocked on an IFL only system. If this fix is installed concurrently with AIX partitions running, these partitions will be allowed to continue to run until they are powered off. Once powered off, the AIX partitions will not be allowed to be activated again on the IFL-only system.
VM930_139_035 / FW930.41 5/25/21	Impact: Availability Severity: HIPER System firmware changes that affect all systems HIPER/Pervasive: A problem was fixed to be able to detect a failed PFET sensing circuit in a core at runtime, and prevent a system fail with an incomplete state when a core fails to wake up. The failed core is detected on the subsequent IPL. With the fix. a core is called out with the PFET failure with SRC BC13090F and hardware description "CME detected malfunctioning of PFET headers." to isolate the error better with a correct callout. HIPER/Pervasive: A problem was fixed for a checkstop due to an internal Bus transport parity error or a data timeout on the Bus. This is a very rare problem that requires a particular SMP transport link traffic pattern and timing. Both the traffic pattern and timing are very difficult to achieve with customer application workloads. The fix will have no measurable effect on most customer workloads although highly intensive OLAP-like workloads may see up to 2.5% impact.
VM930_134_035 / FW930.40 3/10/21	Impact: Availability Severity: SPE New features and functions Added support in ASMI for a new panel to do Self -Boot Engine (SBE) SEEPROM validation. This validation can only be run at the service processor standby state. If the validation detects a problem, IBM recommends the system not be used and that IBM service be called. System firmware changes that affect all systems DEFERRED: A problem was fixed for a rare Voltage Regulator Module (VRM) power fault with an SRC 11002700 logged for the VRM failure followed by an SRC 11002610 system crash. The trigger for this problem is intense workloads that cause what appear to be input over-current conditions. A re-IPL of the system is needed to activate this fix. A problem was fixed for the On-Chip Controller (OCC) going into safe mode (causes loss of processor performance) with SRC BC702616 logged. This problem can be triggered by the loss of a power supply (an oversubscription event). The problem can be circumvented by fixing the issue with the power supply. A problem was fixed for certain SR-IOV adapters that have a rare, intermittent error with B400FF02 and B400FF04 logged, causing a reboot of the VF. The error is handled and recovered without any user intervention needed. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3. A problem was fixed for not logging SRCs for certain cable pulls from the #EMXO PCIe expansion drawer. With the fix, the previously undetected cable pulls are now detected and logged with SRC B7006A8B and B7006A88 errors. A problem was fixed for a rare system hang with SRC BC70E540 logged that may occur when adding processors through licensing or the system throttle state changing (becoming throttled or unthrottled) on an Enterprise Pool system. The trigger for the problem is a very small timing window in the hardware as the processor loads are changing. A problem was fixed for the error handling of a system with an unsupported memory configuration that exceeds available memory power. Without the fix, the IPL of the system is attempted and fails with a segmentation fault with SRCs B1818611 and B181460B logged that do not call out the incorrect DIMMs. A problem was fixed for the Systems Management Services ( SMS) menu "Device IO Information" option being incorrect when displaying the capacity for an NVMe or Fibre Channel (FC) NVMe disk. This problem occurs every time the data is displayed. A problem was fixed for an unrecoverable UE SRC B181BE12 being logged if a service processor message acknowledgment is sent to a Hostboot instance that has already shutdown. This is a harmless error log and it should have been marked as an informational log. A problem was fixed for Time of Day (TOD) being lost for the real-time clock (RTC) when the system initializes from AC power off to service processor standby state with an SRC B15A3303 logged. This is a very rare problem that involves a timing problem in the service processor kernel that can be recovered by setting the system time with ASMI. A problem was fixed for intermittent failures for a reset of a Virtual Function (VF) for SR-IOV adapters during Enhanced Error Handling (EEH) error recovery. This is triggered by EEH events at a VF level only, not at the adapter level. The error recovery fails if a data packet is received by the VF while the EEH recovery is in progress. A VF that has failed can be recovered by a partition reboot or a DLPAR remove and add of the VF. A problem was fixed for performance degradation of a partition due to task dispatching delays. This may happen when a processor chip has all of its shared processors removed and converted to dedicated processors. This could be driven by DLPAR remove of processors or Dynamic Platform Optimization (DPO). The following problems were fixed for certain SR-IOV adapters: 1) An error was fixed that occurs during VNIC failover where the VNIC backing device has a physical port down with an SRC B400FF02 logged. 2) A problem was fixed for adding a new logical port that has a PVID assigned that is causing traffic on that VLAN to be dropped by other interfaces on the same physical port which uses OS VLAN tagging for that same VLAN ID. This problem occurs each time a logical port with a non-zero PVID that is the same as an existing VLAN is dynamically added to a partition or is activated as part of a partition activation, the traffic flow stops for other partitions with OS configured VLAN devices with the same VLAN ID. This problem can be recovered by configuring an IP address on the logical port with the non-zero PVID and initiating traffic flow on this logical port. This problem can be avoided by not configuring logical ports with a PVID if other logical ports on the same physical port are configured with OS VLAN devices. This fix updates the adapter firmware to 11.4.415.36 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed for incomplete periodic data gathered by IBM Service for #EMXO PCIe expansion drawer predictive error analysis. The service data is missing the PLX (PCIe switch) data that is needed for the debug of certain errors. A problem was fixed for a partition hang in shutdown with SRC B200F00F logged. The trigger for the problem is an asynchronous NX accelerator job (such as gzip or NX842 compression) in the partition that fails to clean up successfully. This is intermittent and does not cause a problem until a shutdown of the partition is attempted. The hung partition can be recovered by performing an LPAR dump on the hung partition. When the dump has been completed, the partition will be properly shut down and can then be restarted without any errors.
VM930_116_035 / FW930.30 10/21/20	Impact: Data Severity: HIPER New features and functions DEFERRED: Host firmware support for anti-rollback protection. This feature implements firmware anti-rollback protection as described in NIST SP 800-147B "BIOS Protection Guidelines for Servers". Firmware is signed with a "secure version". Support added for a new menu in ASMI called "Host firmware security policy" to update this secure version level at the processor hardware. Using this menu, the system administrator can enable the "Host firmware secure version lock-in" policy, which will cause the host firmware to update the "minimum secure version" to match the currently running firmware. Use the "Firmware Update Policy" menu in ASMI to show the current "minimum secure version" in the processor hardware along with the "Minimum code level supported" information. The secure boot verification process will block installing any firmware secure version that is less than the "minimum secure version" maintained in the processor hardware. Prior to enabling the "lock-in" policy, it is recommended to accept the current firmware level. WARNING: Once lock-in is enabled and the system is booted, the "minimum secure version" is updated and there is no way to roll it back to allow installing firmware releases with a lesser secure version. Enable periodic logging of internal component operational data for the PCIe3 expansion drawer paths. The logging of this data does not impact the normal use of the system. System firmware changes that affect all systems HIPER/Pervasive: A problem was fixed for certain SR-IOV adapters for a condition that may result from frequent resets of adapter Virtual Functions (VFs), or transmission stalls and could lead to potential undetected data corruption. The following additional fixes are also included: 1) The VNIC backing device goes to a powered off state during a VNIC failover or Live Partition Mobility (LPM) migration. This failure is intermittent and very infrequent. 2) Adapter time-outs with SRC B400FF01 or B400FF02 logged. 3) Adapter time-outs related to adapter commands becoming blocked with SRC B400FF01 or B400FF02 logged 4) VF function resets occasionally not completing quickly enough resulting in SRC B400FF02 logged. This fix updates the adapter firmware to 11.4.415.33 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). DEFERRED: A problem was fixed for a slow down in PCIe adapter performance or loss of adapter function caused by a reduction in interrupts available to service the adapter. This problem can be triggered over time by partition activations or DLPAR adds of PCIe adapters to a partition. This fix must be applied and the system re-IPLed for existing adapter performance problems to be resolved. A rare problem was fixed for a checkstop during an IPL that fails to isolate and guard the problem core. An SRC is logged with B1xxE5xx and an extended hex word 8 xxxxDD90. With the fix, the suspected failing hardware is guarded. A problem was fixed to allow quicker recovery of PCIe links for the #EMXO PCIe expansion drawer for a run-time fault with B7006A22 logged. The time for recovery attempts can exceed six minutes on rare occasions which may cause I/O adapter failures and failed nodes. With the fix, the PCIe links will recover or fail faster (in the order of seconds) so that redundancy in a cluster configuration can be used with failure detection and failover processing by other hosts, if available, in the case where the PCIe links fail to recover. A problem was fixed for system memory not returned after create and delete of partitions, resulting in slightly less memory available after configuration changes in the systems. With the fix, an IPL of the system will recover any of the memory that was orphaned by the issue. A problem was fixed for certain SR-IOV adapters that do not support the "Disable Logical Port" option from the HMC but the HMC was allowing the user to select this, causing incorrect operation. The invalid state of the logical port causes an "Enable Logical Port" to fail in a subsequent operation. With the fix, the HMC provides the message that the "Disable Logical Port" is not supported for the adapter. This affects the adapters with the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. A problem was fixed for SR-IOV adapters having an SRC B400FF04 logged when a VF is reset. This is an infrequent issue and can occur for a Live Partition Mobility migration of a partition or during vNIC (Virtual Network Interface Controller) failovers where many resets of VFs are occurring. This error is recovered automatically with no impact on the system. A problem was fixed to remove unneeded resets of a Virtual Function (VF) for SR-IOV adapters, providing for improved performance of the startup or recovery time of the VF. This performance difference may be noticed during a Live Partition Mobility migration of a partition or during vNIC (Virtual Network Interface Controller) failovers where many resets of VFs are occurring. A problem was fixed for TPM hardware failures not causing SRCs to logged with a call out if the system is configured in ASMI to not require TPM for the IPL. If this error occurs, the user would not find out about it until they needed to run with TPM on the IPL. With the fix, the error logs and notifications will occur regardless of how the TPM is configured. A problem was fixed for PCIe resources under a deconfigured PCIe Host Bridge (PHB) being shown on the OS host as available resources when they should be shown as deconfigured. While this fix can be applied concurrently, a re-IPL of the system is needed to correct the state of the PCIe resources if a PHB had already been deconfigured. A problem was fixed for the REST/Redfish interface to change the success return code for object creation from "200" to "201". The "200" status code means that the request was received and understood and is being processed. A "201" status code indicates that a request was successful and, as a result, a resource has been created. The Redfish Ruby Client, "redfish_client" may fail a transaction if a "200" status code is returned when "201" is expected. A problem was fixed for a concurrent maintenance "Repair and Verify" (R&V) operation for a #EMX0 fanout module that fails with an "Unable to isolate the resource" error message. This should occur only infrequently for cases where a physical hardware failure has occurred which prevents access to slot power controls. This problem can be worked around by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC after the hardware failure but before the concurrent repair is attempted. This will avoid the problem with the PCIe slot isolation These steps can also be used to recover from the error to allow the R&V repair to be attempted again. A problem was fixed for certain large I/O adapter configurations having the PCI link information truncated on the PCI-E topology display shown with ASMI and the HMC. Because of the truncation, individual adapters may be missing on the PCI-E topology screens. A problem was fixed for a rare system hang that can occur when a page of memory is being migrated. Page migration (memory relocation) can occur for a variety of reasons, including predictive memory failure, DLPAR of memory, and normal operations related to managing the page pool resources. A problem was fixed for utilization statistics for commands such as HMC lslparutil and third-party lpar2rrd that do not accurately represent CPU utilization. The values are incorrect every time for a partition that is migrated with Live Partition Mobility (LPM). Power Enterprise Pools 2.0 is not affected by this problem. If this problem has occurred, here are three possible recovery options: 1) Re-IPL the target system of the migration. 2) Or delete and recreate the partition on the target system. 3) Or perform an inactive migration of the partition. The cycle values get zeroed in this case. A problem was fixed for running PCM on a system with SR-IOV adapters in shared mode that results in an "Incomplete" system state with certain hypervisor tasks deadlocked. This problem is rare and is triggered when using SR-IOV adapters in shared mode and gathering performance statistics with PCM (Performance Collection and Monitoring) and also having a low level error on an adapter. The only way to recover from this condition is to re-IPL the system. A problem was fixed for an enhanced PCIe expansion drawer FPGA reset causing EEH events from the fanout module or cable cards that disrupt the PCIe lanes for the PCIe adapters. This problem affects systems with the PCIe expansion drawer enhanced fanout module (#EMXH) and the enhanced cable card ( #EJ20). The error is associated with the following SRCs being logged: B7006A8D with PRC 37414123 (XmPrc::XmCCErrMgrBearPawPrime \| XmPrc::LocalFpgaHwReset) B7006A8E with PRC 3741412A (XmPrc::XmCCErrMgrBearPawPrime \| XmPrc::RemoteFpgaHwReset) If the EEH errors occur, the OS device drivers automatically recover but with a reset of affected PCIe adapters that would cause a brief interruption in the I/O communications. A problem was fixed for the FRU callout lists for SRCs B7006A2A and B7006A2B possibly not including the FRU containing the PCIe switch as the second FRU in the callout list. The card/drive in the slot is the first callout and the FRU containing the PCIe switch should be the second FRU in the callout list. This problem occurs when the PCIe slot is on a different planar that the PCIe switch backing the slot. This impacts the NVMe backplanes (P2 with slots C1-C4) hosting the PCIe backed SSD NVMe U.2 modules that have feature codes #EC5J and #EC5K. As a workaround for B7006A2A and B7006A2B errors where the callout FRU list is processed and the problem is not resolved, consider replacing the backplane (which includes the PCIe switch) if this was omitted in the FRU callout list. A problem was fixed for a PCIe3 expansion drawer cable that has hidden error logs for a single lane failure. This happens whenever a single lane error occurs. Subsequent lane failures are not hidden and have visible error logs. Without the fix, the hidden or informational logs would need to be examined to gather more information for the failing hardware. A problem was fixed for mixing modes on the ports of SR-IOV adapters that causes SRCs B200A161, B200F011, B2009014 and B400F104 to be logged on boot of the failed adapter. This error happens when one port of the adapter is changed to option 1 with a second port set at either option 0 or option 2. The error can be cleared by taking the adapter out of SR-IOV shared mode. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3. A problem was fixed for a partition configured with a large number (approximately 64) of Virtual Persistent Memory (PMEM) LUNs hanging during the partition activation with a CA00E134 checkpoint SRC posted. Partitions configured with approximately 64 PMEM LUNs will likely hang and the greater the number of LUNs, the greater the possibility of the hang. The circumventionf or this problem is to reduce the number of PMEM LUNs to 64 or less in order to boot successfully. The PMEM LUNs are also known as persistent memory volumes and can be managed using the HMC. For more information on this topic, refer to https://www.ibm.com/support/knowledgecenter/POWER9/p9efd/p9efd_lpar_pmem_settings.htm. A problem was fixed for non-optimal On-Chip Controller (OCC) processor frequency adjustments when system power limits or user power caps are exceeded. When a workload causes power limits or caps to be exceeded, there can be large frequency swings for the processors and a processor chip can get stuck at minimum frequency. With the fix, the OCC now waits for new power readings when changing the processor frequency and uses a master power capping frequency to keep all processors at the same frequency. As a workaround for this problem, do not set a power cap or run a workload that would exceed the system power limit. A problem was fixed for mixing memory DIMMs with different timings (different vendors) under the same memory controller that fail with an SRC BC20E504 error and DIMMs deconfigured. This is an "MCBIST_BRODCAST_OUT_OF_SYNC" error. The loss of memory DIMMs can result in a IPL failure. This problem can happen if the memory DIMMs have a certain level of timing differences. If the timings are not compatible, the failure will occur on the IPL during the memory training. To circumvent this problem, each memory controller should have only memory DIMMs from the same vendor plugged. A problem was fixed for the Self Boot Engine (SBE) going to termination with an SRC B150BA8D logged when booting on a bad core. Once this happens, this error will persist as the bad core is not deconfigured. To recover from this error and be able to IPL, the bad core must be manually deconfigured. With the fix, the failing core is deconfigured and the SBE is reconfigured to use another core so the system is able to IPL. A problem was fixed for guard clearing where a specific unguard action may cause other unrelated predictive and manual guards to also be cleared. A problem was fixed for an infrequent issue after a Live Partition Mobility (LPM) operation from a POWER9 system to a POWER8 or POWER7 system. The issue may cause unexpected OS behavior, which may include loss of interrupts, device time-outs, or delays in dispatching. Rebooting the affected target partition will resolve the problem. A problem was fixed for a partition crash or hang following a partition activation or a DLPAR add of a virtual processor. For partition activation, this issue is only possible for a system with a single partition owning all resources. For DLPAR add, the issue is extremely rare. A problem was fixed for a DLPAR remove of memory from a partition that fails if the partition contains 65535 or more LMBs. With 16MB LMBs, this error threshold is 1 TB of memory. With 256 MB LMBs, it is 16 TB of memory. A reboot of the partition after the DLPAR will remove the memory from the partition. A problem was fixed for incorrect run-time deconfiguration of a processor core with SRC B700F10B. This problem can be circumvented by a reconfiguration of the processor core but this should only be done with the guidance of IBM Support to ensure the core is good. A problem was fixed for Live Partition Mobility (LPM) being shown as enabled at the OS when it has been disabled by the ASMI command line using the server processor command of "cfcuod -LPM OFF". LPM is actually disabled and the status shows correctly on the HMC. The status on the OS can be ignored (for example as shown by the AIX command "lparstat -L") as LPM will not be allowed to run when it is disabled. A problem was fixed for a VIOS, AIX, or Linux partition hang during an activation at SRC CA000040. This will occur on a system that has been running more than 814 days when the boot of the partition is attempted if the partitions are in POWER9_base or POWER9 processor compatibility mode. A workaround to this problem is to re-IPL the system or to change the failing partition to POWER8 compatibility mode. A problem was fixed for performance tools perfpmr, tprof and pex that may not be able to collect data for the event-based options. This can occur any time an OS thread becomes idle. When the processor cores are assigned to the next active process, the performance registers may be disabled. A problem was fixed for a system hang and HMC "Incomplete" state that may occur when a partition hangs in shutdown with SRC B200F00F logged. The trigger for the problem is an asynchronous NX accelerator job (such as gzip) in the partition that fails to clean up successfully. This is intermittent and does not cause a problem until a shutdown of the partition is attempted. A problem was fixed for an SRC B7006A99 informational log now posted as a Predictive with a call out of the CXP cable FRU, This fix improves FRU isolation for cases where a CXP cable alert causes a B7006A99 that occurs prior to a B7006A22 or B7006A8B. Without the fix, the SRC B7006A99 is informational and the latter SRCs cause a larger hardware replacement even though the earlier event identified a probable cause for the cable FRU. A problem was fixed for a security vulnerability for the Self Boot Engine (SBE). The SBE can be compromised from the service processor to allow injection of malicious code. An attacker that gains root access to the service processor could compromise the integrity of the host firmware and bypass the host firmware signature verification process. This compromised state can not be detected through TPM attestation. This is Common Vulnerabilities and Exposures issue number CVE-2021-20487. System firmware changes that affect certain systems On systems with AIX and Linux partitions, a problem was fixed for AIX and Linux partitions that crash or hang when reporting any of the following Partition Firmware RTAS ASSERT rare conditions: 1) SRC BA33xxxx errors - Memory allocation and management errors. 2) SRC BA29xxxx errors - Partition Firmware internal stack errors. 3) SRC BA00E8xx errors - Partition Firmware initialization errors during concurrent firmware update or Live Partition Mobility (LPM) operations. This problem should be very rare. If the problem does occur, a partition reboot is needed to recover from the error.
VM930_101_035 / FW930.20 02/27/20	Impact: Availability Severity: HIPER New features and functions Support was added for real-time data capture for PCIe3 expansion drawer (#EMX0) cable card connection data via resource dump selector on the HMC or in ASMI on the service processor. Using the resource selector string of "xmfr -dumpccdata" will non-disruptively generate an RSCDUMP type of dump file that has the current cable card data, including data from cables and the retimers. System firmware changes that affect all systems HIPER/Pervasive: A problem was fixed for a possible system crash and HMC "Incomplete" state when a logical partition (LPAR) power off after a dynamic LPAR (DLPAR) operation fails for a PCIe adapter. This scenario is likely to occur during concurrent maintenance of PCIe adapters or for #EMX0 components such as PCIe3 Cable adapters, Active Optical or copper cables, fanout modules, chassis management cards, or midplanes. The DLPAR fail can leave page table mappings active for the adapter, causing the problems on the power down of the LPAR. If the system does not crash, the DLPAR will fail if it is retried until a platform IPL is performed. HIPER/Pervasive: A problem was fixed for an HMC "Incomplete" state for a system after the HMC user password is changed with ASMI on the service processor. This problem can occur if the HMC password is changed on the service processor but not also on the HMC, and a reset of the service processor happens. With the fix, the HMC will get the needed "failed authentication" error so that the user knows to update the old password on the HMC. DEFERRED: A problem was fixed for a processor core failure with SRCs B150BA3C and BC8A090F logged that deconfigures the entire processor for the current IPL. A re-IPL of the system will recover the lost processor with only the bad core guarded. A problem was fixed for certain SR-IOV adapters that can have an adapter reset after a mailbox command timeout error. This fix updates the adapter firmware to 11.2.211.39 for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3, #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0, and #EN0K/EN0L with CCIN 2CC1. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed for an SR-IOV adapter failure with B400FFxx errors logged when moving the adapter to shared mode. This is an infrequent race condition where the adapter is not yet ready for commands and it can also occur during EEH error recovery for the adapter. This affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. A problem was fixed for an IPL failure with the following possible SRCs logged: 11007611, 110076x1, 1100D00C, and 110015xx. The service processor may reset/reload for this intermittent error and end up in the termination state. A problem was fixed for delayed interrupts on a Power9 system following a Live Partition Mobility operation from a Power7 or Power8 system. The delayed interrupts could cause device time-outs, program dispatching delays, or other device problems on the target Power9 system. A problem was fixed for processor cores not being able to be used by dedicated processor partitions if they were DLPAR removed from a dedicated processor partition. This error can occur if there was a firmware assisted dump or a Live Partition Mobility (LPM) operation after the DLPAR of the processor. A re-IPL of the system will recover the processor cores. A problem was fixed for a B7006A96 fanout module FPGA corruption error that can occur in unsupported PCIe3 expansion drawer(#EMX0) configurations that mix an enhanced PCIe3 fanout module (#EMXH) in the same drawer with legacy PCIe3 fanout modules (#EMXF, #EMXG, #ELMF, or #ELMG). This causes the FPGA on the enhanced #EMXH to be updated with the legacy firmware and it becomes a non-working and unusable fanout module. With the fix, the unsupported #EMX0 configurations are detected and handled gracefully without harm to the FPGA on the enhanced fanout modules. A problem was fixed for lost interrupts that could cause device time-outs or delays in dispatching a program process. This can occur during memory operations that require a memory relocation for any partition such as mirrored memory defragmentation done by the HMC optmem command, or memory guarding that happens as part of memory error recovery during normal operations of the system. A problem was fixed for extraneous informational logging of SRC B7006A10 ("Insufficient SR-IOV resources available") with a 1306 PRC. This SRC is logged whenever an SR-IOV adapter is moved from dedicated mode to shared mode. This SRC with the 1306 PRC should be ignored as no action is needed and there is no issue with SR-IOV resources. A problem was fixed for a hypervisor error during system shutdown where a B7000602 SRC is logged and the system may also briefly go "Incomplete" on the HMC but the shutdown is successful. The system will power back on with no problems so the SRC can be ignored if it occurred during a shutdown. A problem was fixed for possible dispatching delays for partitions running in POWER8, POWER9_base or POWER9 processor compatibility mode. A problem was fixed for extraneous B400FF01 and B400FF02 SRCs logged when moving cables on SR-IOV adapters. This is an infrequent error that can occur if the HMC performance monitor is running at the same time the cables are moved. These SRCs can be ignored when accompanied by cable movement.
VM930_093_035 / FW930.11 12/11/19	Impact: Availability Severity: SPE System firmware changes that affect all systems DEFERRED: PARTITION_DEFERRED: A problem was fixed for vHMC having no useable local graphics console when installed on FW930.00 and later partitions. A problem was fixed for an IPMI core dump and SRC B181720D logged, causing the service processor to reset due to a low memory condition. The memory loss is triggered by frequently using the ipmitool to read the network configuration. The service processor recovers from this error but if three of these errors occur within a 15 minute time span, the service processor will go to a failed hung state with SRC B1817212 logged. Should a service processor hang occur, OS workloads will continue to run but it will not be possible for the HMC to interact with the partitions. This service processor hung state can be recovered by doing a re-IPL of the system with a scheduled outage. A problem was fixed for the Advanced System Management Interface (ASMI) menu for "PCIe Hardware Topology/Reset link" showing the wrong value. This value is always wrong without the fix. A problem was fixed for PLL unlock error with SRC B124E504 causing a secondary error of PRD Internal Firmware Software Fault with SRC B181E580 and incorrect FRU call outs. A problem was fixed for an initialization failure of certain SR-IOV adapter ports during its boot, causing a B400FF02 SRC to be logged. This is a rare problem and it recovers automatically by the reboot of the adapter on the error. This problem affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. A problem was fixed for the SR-IOV Virtual Functions (VFs) when the multi-cast promiscuous flag has been turned on for the VF. Without the fix, the VF device driver sometimes may erroneously fault when it senses that the multi-cast promiscuous mode had not been achieved although it had been requested. A problem was fixed for SR-IOV adapters to provide a consistent Informational message level for cable plugging issues. For transceivers not plugged on certain SR-IOV adapters, an unrecoverable error (UE) SRC B400FF03 was changed to an Informational message logged. This affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. For copper cables unplugged on certain SR-IOV adapters, a missing message was replaced with an Informational message logged. This affects the SR-IOV adapters with the following feature codes and CCINs: #EN17/EN18 with CCIN 2CE4, and #EN0K/EN0L with CCIN 2CC1. A problem was fixed for incorrect DIMM callouts for DIMM over-temperature errors. The error log for the DIMM over temperature will have incorrect FRU callouts, either calling out the wrong DIMM or the wrong DIMM controller memory buffer. A problem was fixed for an Operations Panel hang after using it set LAN Console as the console type for several iterations. After several iterations, the operations panel may hang with "Function 41" displayed on the operations panel. A hot unplug and plug of the operations panel can be used to recover it from the hang. A problem was fixed for shared processor pools where uncapped shared processor partitions placed in a pool may not be able to consume all available processor cycles. The problem may occur when the sum of the allocated processing units for the pool member partitions equals the maximum processing units of the pool. A problem was fixed for Novalink failing to activate partitions that have names with character lengths near the maximum allowed character length. This problem can be circumvented by changing the partition name to have 32 characters or less. A problem was fixed where a Linux or AIX partition type was incorrectly reported as unknown. Symptoms include: IBM Cloud Management Console (CMC) not being able to determine the RPA partition type (Linux/AIX) for partitions that are not active; and HMC attempts to dynamically add CPU to Linux partitions may fail with a HSCL1528 error message stating that there are not enough Integrated Facility for Linux ( IFL) cores for the operation. A problem was fixed for a hypervisor hang that can occur on the target side when doing a Live Partition Mobility (LPM) migration from a system that does not support encryption and compression of LPM data. If the hang occurs, the HMC will go to an "Incomplete" state for the target system. The problem is rare because the data from the source partition must be in a very specific pattern to cause the failure. When the failure occurs, a B182951C will be logged on the target (destination) system and the HMC for the source partition will issue the following message: "HSCLA318 The migration command issued to the destination management console failed with the following error: HSCLA228 The requested operation cannot be performed because the managed system <system identifier> is not in the Standby or Operating state.". To recover, the target system must be re-IPLed. A problem was fixed for performance collection tools not collecting data for event-based options. This fix pertains to perfpmr and tprof on AIX, and Performance Explorer (PEX) on IBM i. A problem was fixed a Live Partition Mobility (LPM) migration of a large memory partition to a target system that causes the target system to crash and for the HMC to go to the "Incomplete" state. For servers with the default LMB size (256MB), if the partition is >=16TB and if the desired memory is different than the maximum memory, LPM may fail on the target system. Servers with LMB sizes less than the default could hit this problem with smaller memory partition sizes. A circumvention to the problem is to set the desired and maximum memory to the same value for the large memory partition that is to be migrated. A problem was fixed for certain SR-IOV adapters with the following issues: 1) If the SR-IOV logical port's VLAN ID (PVID) is modified while the logical port is configured, the adapter will use an incorrect PVID for the Virtual Function (VF). This problem is rare because most users do not change the PVID once the logical port is configured, so they will not have the problem. 2) Adapters with an SRC of B400FF02 logged. This fix updates the adapter firmware to 11.2.211.38 for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3, #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0, and #EN0K/EN0L with CCIN 2CC1. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed for certain SR-IOV adapters where after some error conditions the adapter may hang with no messages or error recovery. This is a rare problem for certain severe adapter errors. This problem affects the SR-IOV adapters with the following feature codes: #EC66/EC67 with CCIN 2CF3. This problem can be recovered by removing the adapter from SR-IOV mode and putting it back in SR-IOV mode, or the system can be re-IPLed. A problem was fixed for an initialization failure of certain SR-IOV adapters when changed into SR-IOV mode. This is an infrequent problem that most likely can occur following a concurrent firmware update when the adapter also needs to be updated. This problem affects the SR-IOV adapter with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. This problem can be recovered by removing the adapter from SR-IOV mode and putting it back in SR-IOV mode, or the system can be re-IPLed. A problem was fixed for a rare IPL failure with SRCs BC8A090F and BC702214 logged caused by an overflow of VPD repair data for the processor cores. A re-IPL of the system should recover from this problem. A problem was fixed for a false memory error that can be logged during the IPL with SRC BC70E540 with the description "mcb(n0p0c1) (MCBISTFIR[12]) WAT_DEBUG_ATTN" but with no hardware call outs. This error log can be ignored. A problem was fixed for an IPL failure after installing DIMMs of different sizes, causing memory access errors. Without the fix, the memory configuration should be restored to only use DIMMs of the same size. A problem was fixed for a memory DIMM plugging rule violation that causes the IPL to terminate with an error log with RC_GET_MEM_VPD_UNSUPPORTED_CONFIG IPL that calls out the memory port but has no DIMM call outs and no DIMM deconfigurations are done. With the fix, the DIMMs that violate the plugging rules will be deconfigured and the IPL will complete. Without the fix, the memory configuration should be restored to the prior working configuration to allow the IPL to be successful. A problem was fixed for a B7006A22 Recoverable Error for the enhanced PCIe3 expansion drawer (#EMX0) I/O drawer with fanout PCIe Six Slot Fan Out modules (#EMXH) installed. This can occur up to two hours after an IPL from power off. This can be a frequent occurrence on an IPL for systems that have the PCIe Six Slot Fan Out module (#EMXH). The error is automatically recovered at the hypervisor level. If an LPAR fails to start after this error, a restart of the LPAR is needed. A problem was fixed for degraded memory bandwidth on systems with memory that had been dynamically repaired with symbols to mark the bad bits. A problem was fixed for an intermittent IPMI core dump on the service processor. This occurs only rarely when multiple IPMI sessions are starting and cleaning up at the same time. A new IPMI session can fail initialization when one of its session objects is inadvertently cleaned up. The circumvention is to retry the IPMI command that failed. A problem was fixed for an intermittent IPL failure with SRC B181E540 logged with fault signature " ex(n2p1c0) (L2FIR[13]) NCU Powerbus data timeout". No FRU is called out. The error may be ignored since the automatic re-IPL is successful. The error occurs very infrequently. This is the second iteration of the fix that has been released. Expedient routing of the Powerbus interrupts did not occur in all cases in the prior fix, so the timeout problem was still occurring. System firmware changes that affect certain systems On systems with PCIe3 expansion drawers(feature code #EMX0), a problem was fixed for a concurrent exchange of a PCIe expansion drawer cable card, although successful, leaves the fault LED turned on. On systems with 16TB or more of memory, a problem was fixed for certain SR-IOV adapters not being able to start a Virtual Function (VF) if "I/O Adapter Enlarged Capacity" is enabled and VF option 0 has been selected for the number of supported VFs . This problem affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. This problem can be circumvented by the following action: change away from VF option 0. VF option 1 is the default option and it will work. On systems with 16GB huge-pages, a problem was fixed for certain SR-IOV adapters with all or nearly all memory assigned to them preventing a system IPL. This affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. The problem can be circumvented by powering off the system and turning off all the huge-page allocations.
VM930_068_035 / FW930.03 08/22/19	Impact: Data Severity: HIPER System firmware changes that affect all systems HIPER/Pervasive: A change was made to fix an intermittent processor anomaly that may result in issues such as operating system or hypervisor termination, application segmentation fault, hang, or undetected data corruption. The only issues observed to date have been operating system or hypervisor terminations. A problem was fixed for a very intermittent partition error when using Live Partition Mobility (LPM) or concurrent firmware update. For a mobility operation, the issue can result in a partition crash if the mobility target system is FW930.00, FW930.01 or FW930.02. For a code update operation, the partition may hang. The recovery is to reboot the partition after the crash or hang.
VM930_048_035 / FW930.02 06/28/19	Impact: Availability Severity: SPE System firmware changes that affect all systems A problem was fixed for a bad link for the PCIe3 expansion drawer (#EMX0) I/O drawer with the clock enhancement causing a system failure with B700F103. This error could occur during an IPL or a concurrent add of the link hardware. A problem was fixed for On-Chip Controller (OCC) power capping operation time-outs with SRC B1112AD3 that caused the system to enter safe mode, resulting in reduced performance. The problem only occurred when the system was running with high power consumption, requiring the need for OCC power capping. A problem was fixed for the "PCIe Topology " option to get cable information in the HMC or ASMI that was returning the wrong cable part numbers if the PCIe3 expansion drawer (#EMX0) I/O drawer clock enhancement was configured. If cables with the incorrect part numbers are used for an enhanced PCIe3 expansion drawer configuration, the hypervisor will log a B7006A20 with PRC 4152 indicating an invalid configuration - https://www.ibm.com/support/knowledgecenter/9080-M9S/p9eai/B7006A20.htm. A problem was fixed for a drift in the system time (time lags and the clock runs slower than the true value of time) that occurs when the system is powered off to the service processor standby state. To recover from this problem, the system time must be manually corrected using the Advanced System Management Interface (ASMI) before powering on the system. The time lag increases in proportion to the duration of time that the system is powered off.
VM930_035_035 / FW930.00 05/17/19	Impact: New Severity: New All features and fixes from the FW920.30 service pack (and below) are included in this release. New Features and Functions Support was added to allow the FPGA soft error checking on the PCIe I/O expansion drawer (#EMX0) to be disabled with the help of IBM support using the hypervisor "xmsvc" macro. This new setting will persist until it it is changed by the user or IBM support. The effect of disabling FPGA soft error checking is to eliminate the FPGA soft error recovery which causes a recoverable PCIe adapter outage. Some of the soft errors will be hidden by this change but others may have unpredictable results, so this should be done only under guidance of IBM support. Support for the PCIe3 expansion drawer (#EMX0) I/O drawer clock enhancement so that a reset of the drawer does not affect the reference clock to the adapters so the PCIe lanes for the PCIe adapters can keep running through an I/O drawer FPGA reset. To use this support, new cable cards, fanout modules, and optical cables are needed after this support is installed: PCIe Six Slot Fan out module(#EMXH) - only allowed to be connected to converter adapter cable card; PCIe X16 to CXP Optical or CU converter adapter for the expansion drawer (#EJ20); and new AOC cables with feature/part number of #ECCR/78P6567, #ECCX/78P6568, #ECCY/78P6569, and #ECCZ/78P6570. These parts cannot be install concurrently, so a scheduled outage is needed to complete the migration. Support added for RDMA Over Converged Ethernet (RoCE) for SR-IOV adapters. Support added for SMS menu to enhance the I/O information option to have "vscsi" and "network" options. The information shown for "vscsi" devices is similar to that provided for SAS and Fibre Channel devices. The "network" option provides connectivity information for the adapter ports and shows which can be used for network boots and installs. Support added to monitor the thermal sensors on the NVMe SSD drives (feature codes #EC5J, #EC5K, #EC5L) and use that information to adjust the speed of the system fans for improved cooling of the SSD drives. Support added to allow integrated USB ports to be disabled. This is available via an Advanced System Management Interface (ASMI) menu option: "System Configuration -> Security -> USB Policy". The USB disable policy, if selected, does not apply to pluggable USB adapters plugged into PCIe slots such as the 4-Port USB adapter (#EC45/#EC46), which are always enabled. System firmware changes that affect all systems A problem was fixed for a system IPLing with an invalid time set on the service processor that causes partitions to be reset to the Epoch date of 01/01/1970. With the fix, on the IPL, the hypervisor logs a B700120x when the service processor real time clock is found to be invalid and halts the IPL to allow the time and date to be corrected by the user. The Advanced System Management Interface (ASMI) can be used to correct the time and date on the service processor. On the next IPL, if the time and date have not been corrected, the hypervisor will log a SRC B7001224 (indicating the user was warned on the last IPL) but allow the partitions to start, but the time and date will be set to the Epoch value. A problem was fixed for a possible boot failure from a ISO/IEC 13346 formatted image, also known as Universal Disk Format (UDF). UDF is a profile of the specification known as ISO/IEC 13346 and is an open vendor-neutral file system for computer data storage for a broad range of media such as DVDs and newer optical disc formats. The failure is infrequent and depends on the image. In rare cases, the boot code erroneously fails to find a file in the current directory. If the boot fails on a specific image, the boot of that image will always fail without the fix. A problem was fixed for broadcast bootp installs or boots that fail with a UDP checksum error. A problem was fixed for failing to boot from an AIX mksysb backup on a USB RDX drive with SRCs logged of BA210012, AA06000D, and BA090010. The boot error does not occur if a serial console is used to navigate the SMS menus.

VM920

VM920 For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url: https://www.ibm.com/support/pages/node/6555136
VM920_123_101 / FW920.60 06/30/20	Impact: Availability Severity: HIPER New features and functions Support was added for real-time data capture for PCIe3 expansion drawer (#EMX0) cable card connection data via resource dump selector on the HMC or in ASMI on the service processor. Using the resource selector string of "xmfr -dumpccdata" will non-disruptively generate an RSCDUMP type of dump file that has the current cable card data, including data from cables and the retimers. System firmware changes that affect all systems HIPER/Pervasive: A problem was fixed for an HMC "Incomplete" state for a system after the HMC user password is changed with ASMI on the service processor. This problem can occur if the HMC password is changed on the service processor but not also on the HMC, and a reset of the service processor happens. With the fix, the HMC will get the needed "failed authentication" error so that the user knows to update the old password on the HMC. DEFERRED: A problem was fixed for a processor core failure with SRCs B150BA3C and BC8A090F logged that deconfigures the entire processor for the current IPL. A re-IPL of the system will recover the lost processor with only the bad core guarded. A problem was fixed for an intermittent IPMI core dump on the service processor. This occurs only rarely when multiple IPMI sessions are starting and cleaning up at the same time. It is possible for the new IPMI session to fail initialization when one of its session objects is cleaned up. The circumvention is to retry the IPMI command that failed. A rare problem was fixed for a checkstop during an IPL that fails to isolate and guard the problem core. An SRC is logged with B1xxE5xx and an extended hex word 8 xxxxDD90. With the fix, the failing hardware is guarded and a node is possibly deconfigured to allow the subsequent IPLs of the system to be successful. A problem was fixed for a B7006A96 fanout module FPGA corruption error that can occur in unsupported PCIe3 expansion drawer(#EMX0) configurations that mix an enhanced PCIe3 fanout module (#EMXH) in the same drawer with legacy PCIe3 fanout modules (#EMXF, #EMXG, #ELMF, or #ELMG). This causes the FPGA on the enhanced #EMXH to be updated with the legacy firmware and it becomes a non-working and unusable fanout module. With the fix, the unsupported #EMX0 configurations are detected and handled gracefully without harm to the FPGA on the enhanced fanout modules. A problem was fixed for system memory not returned after create and delete of partitions, resulting in slightly less memory available after configuration changes in the systems. With the fix, an IPL of the system will recover any of the memory that was orphaned by the issue. A problem was fixed to allow quicker recovery of PCIe links for the #EMXO PCIe expansion drawer for a run time fault with B7006A22 logged. The time for recovery attempts can exceed six minutes on rare occasions which may cause I/O adapter failures and failed nodes. With the fix, the PCIe links will recover or fail faster (in the order of seconds) so that redundancy in a cluster configuration can be used with failure detection and failover processing by other hosts, if available, in the case where the PCIe links fail to recover. A problem was fixed for certain large I/O adapter configurations having the PCI link information truncated on the PCI-E topology display shown with ASMI and the HMC. Because of the truncation, individual adapters may be missing on the PCI-E topology screens. A problem was fixed for extraneous B400FF01 and B400FF02 SRCs logged when moving cables on SR-IOV adapters. This is an infrequent error that can occur if the HMC performance monitor is running at the same time the cables are moved. These SRCs can be ignored when accompanied by cable movement. A problem was fixed for certain SR-IOV adapters that can have an adapter reset after a mailbox command timeout error. This fix updates the adapter firmware to 11.2.211.39 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed for SR-IOV adapters having an SRC B400FF04 logged when a VF is reset. This is an infrequent issue and can occur for a Live Partition Mobility migration of a partition or during vNIC (Virtual Network Interface Controller) failovers where many resets of VFs are occurring. This error is recovered automatically with no impact on the system. A problem was fixed for certain SR-IOV adapters that do not support the "Disable Logical Port" option from the HMC but the HMC was allowing the user to select this, causing incorrect operation. The invalid state of the logical port causes an "Enable Logical Port" to fail in a subsequent operation. With the fix, the HMC provides the message that the "Disable Logical Port" is not supported for the adapter. This affects the adapters with the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1. A problem was fixed for an IPL failure with the following possible SRCs logged: 11007611, 110076x1, 1100D00C, and 110015xx. The service processor may reset/reload for this intermittent error and end up in the termination state. A problem was fixed for Trusted Platform Module (TPM) hardware failures not causing SRCs to logged with a call out if the system is configured in ASMI to not require TPM for the IPL. If this error occurs, the user would not find out about it until they needed to run with TPM on the IPL. With the fix, the error logs and notifications will occur regardless of how the TPM is configured. A problem was fixed for an intermittent IPL failure calling out the system planar. There is no hardware error here and another IPL of the system should be successful. With the fix, a corrupt error message in hostboot from a prior power off of the system is not allowed to be processed, so the next IPL of the system is not adversely affected. A problem was fixed for an intermittent IPL failure with SRC B181E540 logged with fault signature " ex(n2p1c0) (L2FIR[13]) NCU Powerbus data timeout". No FRU is called out. The error may be ignored and the re-IPL is successful. The error occurs very infrequently. This is the second iteration of the fix that has been released. Expedient routing of the Powerbus interrupts did not occur in all cases in the prior fix, so the timeout problem was still occurring. A problem was fixed for the REST/Redfish interface to change the success return code for object creation from "200" to "201". The "200" status code means that the request was received and understood and is being processed. A "201" status code indicates that a request was successful and, as a result, a resource has been created. The Redfish Ruby Client, "redfish_client" may fail a transaction if a "200" status code is returned when "201" is expected. A problem was fixed for a hypervisor error during system shutdown where a B7000602 SRC is logged and the system may also briefly go "Incomplete" on the HMC but the shutdown is successful. The system will power back on with no problems so the SRC can be ignored if it occurred during a shutdown. A problem was fixed for utilization statistics for commands such as HMC lslparutil and third-party lpar2rrd that do not accurately represent CPU utilization. The values are incorrect every time for a partition that is migrated with Live Partition Mobility (LPM). Power Enterprise Pools 2.0 is not affected by this problem. If this problem has occurred, here are three possible recovery options: 1) Re-IPL the target system of the migration. 2) Or delete and recreate the partition on the target system. 3) Or perform an inactive migration of the partition. The cycle values get zeroed in this case.
VM920_118_101 / FW920.50 11/21/19	Impact: Availability Severity: HIPER System firmware changes that affect all systems HIPER/Pervasive: A problem was fixed for a possible system crash and HMC "Incomplete" state when a logical partition (LPAR) power off after a dynamic LPAR (DLPAR) operation fails for a PCIe adapter. This scenario is likely to occur during concurrent maintenance of PCIe adapters or for #EMX0 components such as PCIe3 Cable adapters, Active Optical or copper cables, fanout modules, chassis management cards, or midplanes. The DLPAR fail can leave page table mappings active for the adapter, causing the problems on the power down of the LPAR. If the system does not crash, the DLPAR will fail if it is retried until a platform IPL is performed. A problem was fixed for PLL unlock error with SRC B124E504 causing a secondary error of PRD Internal Firmware Software Fault with SRC B181E580 and incorrect FRU call outs. A problem was fixed for a Operations Panel hang after using it set LAN Console as the console type for several iterations. After several iterations, the op panel may hang with "Function 41" displayed on the op panel. A hot unplug and plug of the op panel can be used to recover it from the hang. A problem was fixed for Novalink failing to activate partitions that have names with character lengths near the maximum allowed character length. This problem can be circumvented by changing the partition name to have 32 characters or less. A problem was fixed for a possible system crash with SRC B7000103 if the HMC session is closed while the performance monitor is active. As a circumvention for this problem, make sure the performance monitor is turned off before closing the HMC sessions. A problem was fixed a Live Partition Mobility (LPM) migration of a large memory partition to a target system that causes the target system to crash and for the HMC to go to the "Incomplete" state. For servers with the default LMB size (256MB), if partition is >=16TB and if desired memory is different than the maximum memory, LPM may fail on the target system. Servers with LMB sizes less than the default could hit this problem with smaller memory partition sizes. A circumvention to the problem is to set the desired and maximum memory to the same value for the large memory partition that is to be migrated. A problem was fixed for an intermittent IPL failure with SRC B181E540 logged with fault signature " ex(n2p1c0) (L2FIR[13]) NCU Powerbus data timeout". No FRU is called out. The error may be ignored and the re-IPL is successful. The error occurs very infrequently. A problem was fixed for a rare IPL failure with SRCs BC8A090F and BC702214 logged caused by an overflow of VPD repair data for the processor cores. A re-IPL of the system should recover from this problem. A problem was fixed for an IPL failure after installing DIMMs of different sizes, causing memory access errors. Without the fix, the memory configuration should be restored to only use DIMMs of the same size. A problem was fixed for certain SR-IOV adapters with the following issues: 1) If the SR-IOV logical port's VLAN ID (PVID) is modified while the logical port is configured, the adapter will use an incorrect PVID for the Virtual Function (VF). This problem is rare because most users do not change the PVID once the logical port is configured, so they will not have the problem. 2) Adapters failing with error1=00007410 and error2=00000000. This fix updates the adapter firmware to 11.2.211.38 for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3, #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0, and #EN0K/EN0L with CCIN 2CC1. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed for a memory DIMM plugging rule violation that causes the IPL to terminate with an error log with RC_GET_MEM_VPD_UNSUPPORTED_CONFIG IPL that calls out the memory port but has no DIMM call outs and no DIMM deconfigurations are done. With the fix, the DIMMs that violate the plugging rules will be deconfigured and the IPL will complete. Without the fix, the memory configuration should be restored to the prior working configuration to allow the IPL to be successful. A problem was fixed for an initialization failure of certain SR-IOV adapters when changed into SR-IOV mode. This is an infrequent problem that most likely can occur following a concurrent firmware update when the adapter also needs to be updated. This problem affects the SR-IOV adapter with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. This problem can be recovered by removing the adapter from SR-IOV mode and putting it back in SR-IOV mode, or the system can be re-IPLed. A problem was fixed for lost interrupts that could cause device time-outs or delays in dispatching a program process. This can occur during memory operations that require a memory relocation for any partition such as mirrored memory defragmentation done by the HMC optmem command, or memory guarding that happens as part of memory error recovery during normal operations of the system. A problem was fixed for delayed interrupts on a Power9 system following a Live Partition Mobility operation from a Power7 or Power8 system. The delayed interrupts could cause device time-outs, program dispatching delays, or other device problems on the target Power9 system. A problem was fixed for processor cores not being able to be used by dedicated processor partitions if they were DLPAR removed from a dedicated processor partition. This error can occur if there was a firmware assisted dump or a Live Partition Mobility (LPM) operation after the DLPAR of the processor. A re-IPL of the system will recover the processor cores. A problem was fixed where a Linux or AIX partition type was incorrectly reported as unknown. Symptoms include: IBM Cloud Management Console (CMC) not being able to determine the RPA partition type (Linux/AIX) for partitions that are not active; and HMC attempts to dynamically add CPU to Linux partitions may fail with a HSCL1528 error message stating that there are not enough Integrated Facility for Linux ( IFL) cores for the operation. On systems with 16GB huge-pages, a problem was fixed for certain SR-IOV adapters with all or nearly all memory assigned to them preventing a system IPL. This affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. The problem can be circumvented by powering off the system and turning off all the huge-page allocations. A problem was fixed for a SR-IOV adapter failure with B400FFxx errors logged when moving the adapter to shared mode. This is an infrequent race condition where the adapter is not yet ready for commands and it can also occur during EEH error recovery for the adapter. This affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3.
VM920_112_101 / FW920.40 08/06/19	Impact: Data Severity: HIPER New features and functions An option was added to the SMS Remote IPL (RIPL) menus to enable or disable the UDP checksum calculation for any device type. Previously, this checksum option was only available for logical LAN devices but now it extended to all types. The default is for the UDP checksum calculation to be done, but if this calculation causes errors for the device, it can be turned off with the new option. System firmware changes that affect all systems HIPER/Pervasive: A change was made to fix an intermittent processor anomaly that may result in issues such as operating system or hypervisor termination, application segmentation fault, hang, or undetected data corruption. The only issues observed to date have been operating system or hypervisor terminations. DEFERRED:PARTITION_DEFERRED: A problem was fixed for repeated CPU DLPAR remove operations by Linux (Ubuntu, SUSE, or RHEL) OSes possibly resulting in a partition crash. No specific SRCs or error logs are reported. The problem can occur on any DLPAR CPU remove operation if running on Linux. The occurrence is intermittent and rare. The partition crash may result in one or more of the following console messages (in no particular order): 1) Bad kernel stack pointer addr1 at addr2 2) Oops: Bad kernel stack pointer 3) ***** RTAS CALL BUFFER CORRUPTION ***** 4) ERROR: Token not supported This fix does not activate until there is a reboot of the partition. A problem was fixed for a concurrent firmware update failure with SRC B7000AFF logged. This is a rare problem triggered by a power mode change preceding a concurrent firmware update. To recover from this problem, run the code update again without any power mode changes. A problem was fixed for an IPMI core dump and SRC B181720D logged, causing the service processor to reset due to a low memory condition. The memory loss is triggered by frequently using the ipmitool to read the network configuration. The service processor recovers from this error but if three of these errors occur within a 15 minute time span, the service processor will go to a failed hung state with SRC B1817212 logged. Should a service processor hang occur, OS workloads will continue to run but it will not be possible for the HMC to interact with the partitions. This service processor hung state can be recovered by doing a re-IPL of the system with a scheduled outage. A problem was fixed for informational logs flooding the error log if a "Get Sensor Reading" is not working. A problem was fixed for a concurrent firmware hang with SRC B1813450 logged. This is a rare problem triggered by an error or power mode change that requires a Power Management (PM) Complex Reset. To recover from this problem, re-IPL the system and it will be running at the target firmware update level. A problem was fixed for a concurrent replace of the base Operator Panel with the LCD Operator Panel with feature code #EU0B that could result in the following errors if the replace takes longer than four minutes: 1) SRCs B1504805 and B1504804 logged against the Operator Panel. 2) The ambient temperature sensors in the system will be considered faulted by firmware. As a result, firmware will not automatically shut the system down due to high ambient temperature (EPOW). The system can be recovered with a reset of the service processor or a power down/ power up of the system. The system can also be recovered by removing the LCD Operator Panel for at least two minutes and then plugging it back in. A problem was fixed for shared processor pools where uncapped shared processor partitions placed in a pool may not be able to consume all available processor cycles. The problem may occur when the sum of the allocated processing units for the pool member partitions equals the maximum processing units of the pool. A problem was fixed for an outage of I/O connected to a single PCIe Host Bridge (PHB) with a B7006970 SRC logged. With the fix, the rare PHB fault will have an EEH event detected and recovered by firmware. A problem was fixed for partitions becoming unresponsive or the HMC not being able to communicate with the system after a processor configuration change or a partition power on and off. A problem was fixed for a concurrent firmware update error with SRC B7000AFF logged. This is a rare problem triggered by an error or power mode change that requires a Power Management (PM) Complex Reset. To recover from this problem, re-IPL the system and it will be running at the target firmware update level. A problem was fixed for possible abnormal terminations of programs on partitions running in POWER7 or POWER8 compatibility mode. A problem was fixed for a hypervisor hang that can occur on the target side when doing a Live Partition Mobility (LPM) migration from a system that does not support encryption and compression of LPM data. If the hang occurs, the HMC will go to an "Incomplete" state for the target system. The problem is rare because the data from the source partition must be in a very specific pattern to cause the fail. When the failure occurs, a B182951C will be logged on the target (destination) system and the HMC for the source partition will issue the following message: "HSCLA318 The migration command issued to the destination management console failed with the following error: HSCLA228 The requested operation cannot be performed because the managed system <system identifier> is not in the Standby or Operating state.". To recover, the target system must be re-IPLed. A problem was fixed for an initialization failure of an SR-IOV adapter port during its boot, causing a B400FF02 SRC to be logged. This is a rare problem and it recovers automatically by the reboot of the adapter on the error. A problem was fixed for SR-IOV adapter Virtual Functions (VFs) that can fail to restore to their configuration after a low-level EEH error, causing loss of function for the adapter. This problem can occur if the other than the default NIC VF configuration was selected when the VF was created. The problem will occur all the time for VFs configured as RDMA over Converged Ethernet (RoCE) but much less frequent and intermittent for other non-default VF configurations. A problem was fixed which caused network traffic failures for Virtual Functions (VFs) operating in non-promiscuous multicast mode. In non-promiscuous mode, when a VF receives a frame, it will drop it unless the frame is addressed to the VF's MAC address, or is a broadcast or multicast addressed frame. With the problem, the VF drops the frame even though it is multicast, thereby blocking the network traffic, which can result in ping failures and impact other network operations. To recover from the issue, turn multicast promiscuous on. This may cause some unwanted multicast traffic to flow to the partition. A problem was fixed for a boot failure using a N_PORT ID Virtualization (NPIV) LUN for an operating system that is installed on a disk of 2 TB or greater, and having a device driver for the disk that adheres to a non-zero allocation length requirement for the "READ CAPACITY 16". The IBM partition firmware had always used an invalid zero allocation length for the return of data and that had been accepted by previous device drivers. Now some of the newer device drivers are adhering to the specification and needing an allocation length of non-zero to allow the boot to proceed. A problem was fixed for a possible boot failure from a ISO/IEC 13346 formatted image, also known as Universal Disk Format (UDF). UDF is a profile of the specification known as ISO/IEC 13346 and is an open vendor-neutral file system for computer data storage for a broad range of media such as DVDs and newer optical disc formats. The failure is infrequent and depends on the image. In rare cases, the boot code erroneously fails to find a file in the current directory. If the boot fails on a specific image, the boot of that image will always fail without the fix. A problem was fixed for broadcast bootp installs or boots that fail with a UDP checksum error. A problem was fixed for failing to boot from an AIX mksysb backup on a USB RDX drive with SRCs logged of BA210012, AA06000D, and BA090010. The boot error does not occur if a serial console is used to navigate the SMS menus. A problem was fixed for possible loss of mainstore memory dump data for system termination errors. A problem was fixed for an intermittent IPL failure with B181345A, B150BA22, BC131705, BC8A1705, or BC81703 logged with a processor core called out. This is a rare error and does not have a real hardware fault, so the processor core can be unguarded and used again on the next IPL. A problem was fixed for two false UE SRCs of B1815285 and B1702A03 possibly being logged on the first IPL of a 2-node system. A VPD timing error can cause a 2-node system to be misread as a 4-node, causing the false SRCs. This can only occur on the first IPL of the system. A problem was fixed for a processor core fault in the early stages of the IPL that causes the service processor to terminate. With the fix, the system is reconfigured to remove the bad core and the system is IPLed with the remaining processor cores. A problem was fixed for a drift in the system time (time lags and the clock runs slower than the true value of time) that occurs when the system is powered off to the service processor standby state. To recover from this problem, the system time must be manually corrected using the Advanced System Management Interface (ASMI) before powering on the system. The time lag increases in proportion to the duration of time that the system is powered off. A problem was fixed for hypervisor tasks getting deadlocked that cause the hypervisor to be unresponsive to the HMC ( this shows as an incomplete state on the HMC) with SRC B200F011 logged. This is a rare timing error. With this problem, OS workloads will continue to run but it will not be possible for the HMC to interact with the partitions. This error can be recovered by doing a re-IPL of the system with a scheduled outage. A problem was fixed for eight or more simultaneous Live Partition Mobility (LPM) migrations to the same system possibly failing in validation with the HMC error message of "HSCL0273 A command that was targeted to the managed system has timed out". The problem can be circumvented by doing the LPM migrations to the same system in smaller batches. A problem was fixed for a system IPLing with an invalid time set on the service processor that causes partitions to be reset to the Epoch date of 01/01/1970. With the fix, on the IPL, the hypervisor logs a B700120x when the service processor real time clock is found to be invalid and halts the IPL to allow the time and date to be corrected by the user. The Advanced System Management Interface (ASMI) can be used to correct the time and date on the service processor. On the next IPL, if the time and date have not been corrected, the hypervisor will log a SRC B7001224 (indicating the user was warned on the last IPL) but allow the partitions to start, but the time and date will be set to the Epoch value. A problem was fixed for the Advanced System Management Interface (ASMI) menu for "PCIe Hardware Topology/Reset link" showing the wrong value. This value is always wrong without the fix. A problem was fixed for SR-IOV adapters to provide a consistent Informational message level for cable plugging issues. For transceivers not plugged on certain SR-IOV adapters, an unrecoverable error (UE) SRC B400FF03 was changed to an Informational message logged. This affects the SR-IOV adapters with the following feature codes: EC2S, EC2U, and EC3M. For copper cables unplugged on certain SR-IOV adapters, a missing message was replaced with an Informational message logged. This affects the SR-IOV adapters with the following feature codes: EN17, EN0K, EN0L, EL3C, and EL57. A problem was fixed for a drift in the system time (time lags and the clock runs slower than the true value of time) that occurs when the system is powered off to the service processor standby state. To recover from this problem, the system time must be manually corrected using the Advanced System Management Interface (ASMI) before powering on the system. The time lag increases in proportion to the duration of time that the system is powered off. A problem was fixed for incorrect Centaur DIMM callouts for DIMM over temperature errors. The error log for the DIMM over temperature will have incorrect FRU callouts, either calling out the wrong DIMM or the wrong Centaur memory buffer. System firmware changes that affect certain systems On systems with PCIe3 expansion drawers(feature code #EMX0), a problem was fixed for a concurrent exchange of a PCIe expansion drawer cable card, although successful, leaves the fault LED turned on. On systems using Utility COD, a problem was fixed for "Shared Processor Utilization Data" showing a too-large number of Non-Utility processors, much more than even installed. This incorrect information can prevent the billing for the use of the Utility Processors.
VM920_101_101 / FW920.30 03/08/19	Impact: Data Severity: HIPER New features and functions Support was added to allow 3-socket processor configurations for the system. Previously, there had to be a minimum of two sockets and a maximum of 4 sockets but 3 socket configurations were not supported. The Operations Panel was enhanced to display "Disruptive" warning for control panel operations that would disturb a running system. For example, control panel function "03" is used to re-IPL the system and would get the warning message to alert the operator that the system could be impacted. A new SRC of B7006A74 was added for PHB LEM 62 errors that had surpassed a threshold in the path of the #EMX0 expansion drawer. This replaces the SRC B7006A72 to have a correct callout list. Without the feature, when B7006A72 is logged against a PCIe slot in the CEC containing a cable card, the FRUs in the full #EMX0 expansion drawer path should be considered (use the B7006A8B FRU callout list as a reference). System firmware changes that affect all systems HIPER/Pervasive: DISRUPTIVE: A problem was fixed where, under certain conditions, a Power Management Reset (PM Reset) event may result in undetected data corruption. PM Resets occur under various scenarios such as power management mode changes between Dynamic Performance and Maximum Performance, Concurrent FW updates, power management controller recovery procedures, or system boot. DEFERRED: A problem was fixed for I/O adapters that use LSI (Level Sensitive Interrupts) not functioning in slot C6. This problem can be avoided by moving the adapter to a direct PCIe slot. The system must be re-IPLed to activate this fix. DEFERRED: A problem with slower than expected L2 cache memory update response was fixed to improve system performance for some workloads. The slowdown was triggered by many concurrent processor threads trying to update the L2 cache memory atomicallly with a Power LARX/STCX instruction sequence. Without the fix, the rate that the system could do these atomic updates was slower than the normal L2 cache response which could cause the system overall performance to decrease. This problem could be noticed for workloads that are cache bound (where speed of cache access is an important factor in determining the speed at which the program gets executed). For example, if the most visited part of a program is a small section of code inside a loop small enough to be contained within the cache, then the program may be cache bound. A problem was fixed for not being able to concurrently add the PCIe to USB conversion card with CCIN 6B6C. The Vital Product Data (VPD )for the new FRU is not updated into the system, so the added part is not functional until the system is re-IPLed. A problem was fixed for a system mis-configured with a mix of DDR3 and DDR4 DIMMs in the same node failing without callouts for the problem DIMMs. The system fails with SRC B181BAD4. With fix, the IPL will still fail but the SRC provides a list of the problem DIMMs so they can be guarded or physically removed. A problem was fixed for failed hardware such as a clock card causing the service processor to have slow performance. This might be seen if a hardware problem occurs and the service processor appears to be hanging while error logs are collected. A problem was fixed for an IPL failing with B7000103 if there is an error in a PCIe Hub (PHB). With the fix, the IPL is allowed to complete but there may be failed I/O adapters if the errant PHB is populated with PCIe adapters. A problem was fixed for hypervisor task getting deadlocked if partitions are powered on at the same time that SR-IOV is being configured for an adapter. With this problem, workloads will continue to run but it will not be possible to change the virtualization configuration or power partitions on and off. This error can be recovered by doing a re-IPL of the system. A problem was fixed for I/O adapters not recovering from low-level EEH errors, resulting in a Permanent EEH error with SRC B7006971 logged. These errors can occur during memory relocation in parallel with heavy I/O traffic, The affected adapters can be recovered by a re-IPL of the system. A problem was fixed for the an unexpected Core Watchdog error during a reset of the service processor with a SRC B150B901 logged . With enough service processor resets in a row, it is possible for the service processor to go to a failed state with SRC B1817212 on systems with a single service processor. On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails. The hypervisor and the partition workloads would continue to run in these cases of failed service processors. A problem was fixed for an intermittent IPL failure with BC131705 and BC8A1703 logged with a processor core called out. This is a rare error and does not have a real hardware fault, so the processor core can be unguarded and used again on the next IPL. A problem was fixed for DDR4 2933 MHZ and 3200 MHZ DIMMs not defaulting to the 2666 MHZ speed on a new DIMM plug, thus preventing the system from IPLing. A problem was fixed for a PCIe Hub checkstop with SRC B138E504 logged that fails to guard the errant processor chip. With the fix, the problem hardware FRU is guarded so there is not a recurrence of the error on the next IPL. A problem was fixed for a VRM error for a Self Boot Engine (SBE) that caused the system to go to terminate state after the error rather than re-IPLing to run-time. A re-IPL will recover the system. A problem was fixed for an IPMI core dump and SRC B1818601 logged intermittently when an IPMI session is closed. A flood of B1818A03 SRCs may be logged after the error occurs. The IPMI server is not impacted and a call home is reported for the problem. There is no service outage for the IPMI users because of this. A problem was fixed for a boot device hang, leading to a long time-out condition before the service processor gives up. This problem has a very low frequency and a re-IPL is normally successful to recover the system. A problem was fixed for DIMM row repairs for 8Gb and 16Gb DIMMs to allow the ECC spare repair to be used. This problem does not affect all the memory in the DIMM, just the memory in the first rank position. A problem was fixed for deconfigured FRUs that showed as Unit Type of "Unknown" in the Advanced System Management Interface (ASMI). The following FRU type names will be displayed if deconfigured (shown here is a description of the FRU type as well): DMI: Processor to Memory Buffer Interface MC: Memory Controller MFREFCLK: Multi Function Reference Clock MFREFCLKENDPT: Muti function reference clock end point MI: Processor to Memory Buffer Interface NPU: Nvidia Processing Unit OBUS_BRICK: OBUS SYSREFCLKENDPT: System reference clock end point TPM: Trusted Platform Module A problem was fixed for insufficient fan speeds for PCIe cards that require additional cooling. To circumvent this problem for systems that have the high performance PCIe cards, disable Idle Power Saver mode and ensure system power mode is set to Nominal, Dynamic Performance, or Maximum Performance. With the fix, If an adapter is known to require higher levels of cooling, the system automatically speeds up fans to increase airflow across the PCIe adapters. The affected adapters that need the additional cooling are the following PCIe SAS adapters with feature codes EJ0J, EJ0K, EJ0L, EJ10, EJ14, EJIN, and EJIP. A problem was fixed for shared processor partitions going unresponsive after changing the processor sharing mode of a dedicated processor partition from "allow when partition is active" to either "allow when partition is inactive" or "never". This problem can be circumvented by avoiding disabling processor sharing when active on a dedicated processor partition. To recovery the partition if the issue has been encountered, enable "processor sharing when active" for the partition. A problem was fixed for hypervisor error logs issued during the IPL missing the firmware version. This happens on every IPL for logs generated during the early part of the IPL. A problem was fixed for a continuous logging of B7006A28 SRCs after the threshold limit of PCIe Advanced Error Reporting (AER) correctable errors. The error log flooding can cause error buffer wrapping and other performance issues. A problem was fixed for an error in deleting a partition with the virtualized Trusted Platform Module (vTPM) enabled and SRC B7000602 logged. When this error occurs, the encryption process in the hypervisor may become unusable. The problem can be recovered from with a re-IPL of the system. A problem was fixed in Live Partition Mobility (LPM) of a partition to a shared processor pool, which results in the partition being unable to consume uncapped cycles on the target system. To prevent the issue from occurring, partitions can be migrated to the default shared processor pool and then dynamically moved to the desired shared processor pool. To recover from the issue, use DLPAR to add or remove a virtual processor to/from the affected partition, dynamically move the partition between shared processor pools, reboot the partition, or re-IPL the system. A problem was fixed for informational (INF) errors for the PCIe Hub (PHB) at a threshold limit causing the I/O slots to go non-operational. The system I/O can be recovered with a re-IPL. A problem was fixed for the HMC in some instances reporting a VIOS partition as an AIX partition. The VIOS partition can be used correctly even when it is misidentified. A problem was fixed for errors in the PHB performance counters collected by the 24x7 performance monitor. A problem was fixed for certain SR-IOV adapters where SRC B400FF01 errors are seen during configuration of the adapter into SR-IOV mode or updating adapter firmware. This fix updates the adapter firmware to 11.2.211.37 for the following Feature Codes: EN15, EN17, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed for a system terminating if there was even one predictive or recoverable SRC. For this problem, all hardware SRCs logged are treated as terminating SRCs. For this behavior to occur, the initial service processor boot from the AC power off state failed to complete cleanly, instead triggering an internal reset (a rare error), leaving some parts of the service processor not initialized. This problem can be recovered by doing an AC power cycle, or concurrently on an active system with the assistance of IBM support. A security problem was fixed in the service processor OpenSSL support that could cause secured sockets to hang, disrupting HMC communications for system management and partition operations. The Common Vulnerabilities and Exposures issue number is CVE-2018-0732. A security problem was fixed in the service processor Network Security Services (NSS) services which, with a man-in-the-middle attack, could provide false completion or errant network transactions or exposure of sensitive data from intercepted SSL connections to ASMI, Redfish, or the service processor message server. The Common Vulnerabilities and Exposures issue number is CVE-2018-12384. A problem was fixed for IPMI sessions in the service processor causing a flood of B181A803 informational error logs on registry read fails for IPv6 and IPv4 keywords. These error logs do not represent a real problem and may be ignored. A security problem was fixed in the service processor TCP stack that would allow a Denial of Service (DOS) attack with TCP packets modified to trigger time and calculation expensive calls. By sending specially modified packets within ongoing TCP sessions with the Management Consoles, this could lead to a CPU saturation and possible reset and termination of the service processor. The Common Vulnerabilities and Exposures issue number is CVE-2018-5390. A security problem was fixed in the service processor TCP stack that would allow a Denial of Service (DOS) attack by allowing very large IP fragments to trigger time and calculation expensive calls in packet reassembly. This could lead to a CPU saturation and possible reset and termination of the service processor. The Common Vulnerabilities and Exposures issue number is CVE-2018-5391. With the fix, changes were made to lower the IP fragment thresholds to invalidate the attack.
VM920_089_075 / FW920.24 02/12/19	Impact: Performance Severity: SPE New Features and Functions Support for up to 8 production SAP HANA LPARs and 16 TB of memory. System firmware changes that affect all systems A problem was fixed for a concurrent firmware update that could hang during the firmware activation, resulting in the system entering into Power safe mode. The system can be recovered by doing a re-IPL of the system with a power down and power up. A concurrent remove of this fix to the firmware level FW920.22 will fail with the hang, so moving back to this level should only be done with a disruptive firmware update. A problem was fixed where installing a partition with a NIM server may fail when using an SR-IOV adapter with a Port VLAN ID (PVID) configured. This error is a regression problem introduced in the 11.2.211.32 adapter firmware. This fix reverts the adapter firmware back to 11.2.211.29 for the following Feature Codes: EN15, EN17, EN0H, and EN0K. Because the adapter firmware is reverted to the prior version, all changes included in the 11.2.211.32 are reverted as well. Circumvention options for this problem can be found at the following link: http://www.ibm.com/support/docview.wss?uid=ibm10794153. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
VM920_080_075 / FW920.22 12/13/18	Impact: Availability Severity: SPE System firmware changes that affect all systems A problem was fixed for an intermittent IPL failure with SRCs B150BA40 and B181BA24 logged. The system can be recovered by IPLing again. The failure is caused by a memory buffer misalignment, so it represents a transient fault that should occur only rarely. A problem was fixed for intermittent PCIe correctable errors which would eventually threshold and cause SRC B7006A72 to be logged. PCIe performance degradation or temporary loss of one or more PCIe IO slots could also occur resulting in SRCs B7006970 or B7006971.
VM920_078_075 / FW920.21 11/28/18	Impact: Availability Severity: SPE This Service Pack contained updates for MANUFACTURING ONLY.
VM920_075_075 / FW920.20 11/16/18	Impact: Data Severity: HIPER New features and functions Support was enabled for eRepair spare lane deployment for fabric and memory buses. Support was added for doing soft post package memory row repair (sPPR) on the DDR4 DIMMs during the system IPL. The sPPR feature saves on the use of ECC spares for memory recovery, reducing the number of DIMMs that have to be guarded for memory errors. Support was added for Multi-Function clock card failover. System firmware changes that affect all systems HIPER/Non-Pervasive: DISRUPTIVE: Fixes included to address potential scenarios that could result in undetected data corruption, system hangs, or system terminations. DISRUPTIVE: A problem was fixed for PCIe and SAS adapters in slots attached to a PLX (PCIe switch) failing to initialize and not being found by the Operating System. The problem should not occur on the first IPL after an AC power cycle, but subsequent IPLs may experience the problem. DEFERRED: A problem was fixed for a PCIe clock failure in the PCIe3 I/O expansion drawer (feature #EMX0), causing loss of PCIe slots. The system must be re-IPLed for the fix to activate. DEFERRED: A problem was fixed for a possible system hang in the early boot stage. This could occur during periods of very high activity for memory read operations which deplete all read buffers, hanging an internal process that requires a read buffer, With the fix, a congested memory controller can stall the read pipeline to make a read buffer available for the internal processes. DEFERRED: A problem was fixed for concurrent maintenance operations for PCIe expansion drawer cable cards and PCI adapters that could cause loss of system hardware information in the hypervisor with these side effects: 1) partition secure boots could fail with SRC BA540100 logged.; 2) Live Partition Mobility (LPM) migrations could be blocked; 3) SR-IOV adapters could be blocked from going into shared mode; 4) Power Management services could be lost; and 5) warm re-IPLs of the system can fail. The system can be recovered by powering off and then IPLing again. DEFERRED: A problem was fixed for a transient VRM over-current condition for loads on the USB bus that could fail an IPL with SRC 11002700 00002708 logged. The frequency of the failure is about 1 in every 5 IPL attempts. The system can be recovered by doing another IPL. A problem was fixed for an unhelpful error message of "HSCL1473 Cannot execute atomic operation. Atomic operations are not enabled." that is displayed on the HMC if there are no licensed processors available for the boot of a partition. A problem was fixed for a memory channel failure due to a RCD parity error calling out the affected DIMMs correctly, but also falsely calling out either the memory controller or a processor, or both. A problem was fixed for adapters in slots attached to a PLX (PCIe switch) failing with SRCs B7006970 and BA188002 when a second and subsequent errors on the PLX failed to initiate PLX recovery. For this infrequent problem to occur, it requires a second error on the PLX after recovery from the first error. A problem was fixed for the system going into Safe Mode after a run-time deconfiguration of a processor core, resulting in slower performance. For this problem to occur, there must be a second fault in the Power Management complex after the processor core has been deconfigured. A problem was fixed for service processor resets confusing the wakeup state of processor cores, resulting in degraded cores that cannot be managed for power usage. This will result in the system consuming more power, but also running slower due to the inability to make use of WOF optimizations around the cores. The degraded processor cores can be recovered by a re-IPL of the system. A problem was fixed for the On-Chip Controller (OCC) MAX memory bandwidth sensor sometimes having values that are too high. A problem was fixed for DDR4 memory training in the IPL to improve the DDR4 write margin. Lesser write margins can potentially cause memory errors. A problem was fixed for a system failure with SRC B700F103 that can occur if a shared-mode SR-IOV adapter is moved from a high-performance slot to a lower performance slot. This problem can be avoided by disabling shared mode on the SR-IOV adapter; moving the adapter; and then re-enabling shared mode. A problem was fixed for the system going to Safe Mode if all the cores of a processor are lost at run-time. A problem was fixed for a Core Management Engine (CME) fault causing a system failure with SRC B700F105 if processor cores had been guarded during the IPL. A problem was fixed for a Core Management Engine (CME) fault that could result in a system checkstop. A problem was fixed for a missing error log for the case of the TPM card not being detected when it is required for a trusted boot. A problem was fixed for a flood of BC130311 SRCs that could occur when changing Energy Scale Power settings, if the Power Management is in a reset loop because of errors. A problem was fixed for coherent accelerator processor proxy (CAPP) unit errors being called out as CEC hardware Subsystem instead of PROCESSOR_UNIT. A problem was fixed for an incorrect processor callout on a memory channel error that causes a CHIFIR[61] checkstop on the processor. A problem was fixed for a Logical LAN (l-lan) device failing to boot when there is a UDP packet checksum error. With the fix, there is a new option when configuring a l-lan port in SMS to enable or disable the UDP checksum validation. If the adapter is already providing the checksum validation, then the l-lan port needs to have its validation disabled. A problem was fixed for missing error logs for hardware faults if the hypervisor terminates before the faults can be processed. With the fix, the hardware attentions for the bad FRUs will get handled, prior to processing the termination of the hypervisor. A problem was fixed for the diagnostics for a system boot checkstop failing to isolate to the bad FRU if it occurred on a non-master processor or a memory chip connected to a non-master processor. With the fix, the fault attentions from a non-master processor are properly isolated to the failing chip so it can be guarded or recovered as needed to allow the IPL to continue. A problem was fixed for Hostboot error log IDs (EID) getting reused from one IPL to the next, resulting in error logs getting suppressed (missing) for new problems on the subsequent IPLs if they have a re-used EID that was already present in the service processor error logs. A problem was fixed for Live Partition Mobility (LPM) partition migration to preserve the Secure Boot setting on the target partition. Secure Boot is supported in FW920 and later partitions. If the Secure Boot setting is non-zero for the partition, it will zero after the migration. A problem was fixed for an SR-IOV adapter using the wrong Port VLAN ID (PVID) for a logical port (VF) when its non-zero PVID could be changed following a network install using the logical port. This fix updates adapter firmware to 11.2.211.32 for the following Feature Codes: EN15, EN17, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57. The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm. Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running). A problem was fixed for a SMS ping failure for a SR-IOV adapter VF with a non-zero Port VLAN ID (PVID). This failure may occur after the partition with the adapter has been booted to AIX, and then rebooted back to SMS. Without the fix, residue information from the AIX boot is retained for the VF that should have been cleared. A problem was fixed for a SR-IOV adapter vNIC configuration error that did not provide a proper SRC to help resolve the issue of the boot device not pinging in SMS due to maximum transmission unit (MTU) size mismatch in the configuration. The use of a vNIC backing device does not allow configuring VFs for jumbo frames when the Partition Firmware configuration for the adapter (as specified on the HMC) does not support jumbo frames. When this happens, the vNIC adapter will fail to ping in SMS and thus cannot be used as a boot device. With the fix, the vNIC driver configuration code is now checking the vNIC login (open) return code so it can issue an SRC when the open fails for a MTU issue (such as jumbo frame mismatch) or for some other reason. A jumbo frame is an Ethernet frame with a payload greater than the standard MTU of 1,500 bytes and can be as large as 9,000 bytes. A problem was fixed for three bad lanes causing a memory channel fail on the DMI interface. With the fix, the errors on the third lane on the DMI interface will be recovered and it will continue to be used as long as it functions. A problem was fixed for preventing loss of function on an SR-IOV adapter with an 8MB adapter firmware image if it is placed into SR-IOV shared mode. The 8MB image is not supported at the FW920.20 firmware level. With the fix, the adapter with the 8MB image is rejected with an error without an attempt to load the older 4MB image on the adapter which could damage it. This problem affects the following SR-IOV adapters: #EC2R/#EC2S with CCIN 58FA; and #EC2T/#EC2U with CCIN 58FB. A problem was fixed for incorrect recovery from a service processor mailbox error that was causing the system IPL to fail with the loss of all the PCIe links. If this occurs, the system will normally re-IPL successfully. A problem was fixed for SR-IOV adapter failures when running in shared mode in a Huge Dynamic DMA Window (HDDW) slot. I/O slots are enabled with HDDW by using the I/O Adapter Enlarged Capacity setting in the Advanced System Management Interface (ASMI). This problem can be circumvented by moving the SR-IOV adapter to a non-HDDW slot, or alternatively, disabling HDDW on the system. A problem was fixed for system termination for a re-IPL with power on with SRC B181E540 logged. The system can be recovered by powering off and then IPLing. This problem occurs infrequently and can be avoided by powering off the system between IPLs. System firmware changes that affect certain systems For a shared memory partition, a problem was fixed for Live Partition Mobility (LPM) migration hang after a Mover Service Partition (MSP) failover in the early part of the migration. To recover from the hang, a migration stop command must be given on the HMC. Then the migration can be retried. For a shared memory partition, a problem was fixed for Live Partition Mobility (LPM) migration failure to an indeterminate state. This can occur if the Mover Service Partition (MSP) has a failover that occurs when the migrating partition is in the state of "Suspended." To recover from this problem, the partition must be shutdown and restarted. On a system with a Cloud Management Console and a HMC Cloud Connector, a problem was fixed for memory leaks in the Redfish server causing Out of Memory (OOM) resets of the service processor. On a system with a partition with dedicated processors that are set to allow processor sharing with "Allow when partition is active" or "Allow always", a problem was fixed for a potential system hang if the partition is booting or shutting down while Dynamic Platform Optimizer (DPO) is running. As a work-around to the problem, the processor sharing can be turned off before running DPO, or avoid starting or shutting down dedicated partitions with processor sharing while DPO is active. On a system with an AMS partition, a problem was fixed for a Live Partition Mobility (LPM) migration failure when migrating from P9 to a pre-FW860 P8 or P7 system. This failure can occur if the P9 partition is in dedicated memory mode, and the Physical Page Table (PPT) ratio is explicitly set on the HMC (rather than keeping the default value) and the partition is then transitioned to Active Memory Sharing (AMS) mode prior to the migration to the older system. This problem can be avoided by using dedicated memory in the partition being migrated back to the older system.
VM920_057_057 / FW920.10 09/24/18	Impact: Data Severity: HIPER New features and functions DISRUPTIVE: Support was added for installing and running mixed levels of P9 processors on the system in compatibility mode. Support added for PCIe4 2-port 100Gb ROCE RN adapter with feature code #EC66 for AIX and IBM i. This PCIe Gen4 Ethernet x16 adapter provides two 100 GbE QSFP28 ports. Support was added to enable mirrored Hostboot memory. System firmware changes that affect all systems HIPER/Non-Pervasive: A problem was fixed for a potential problem that could result in undetected data corruption. DEFERRED: A problem was fixed for the Input Offset Voltage (VIO) to the processor being set too low, having less margin for PCIe and XBUS errors that could cause a higher than normal rate of processor or PCIe device failures during the IPL or at run time. A problem was fixed for truncated firmware assisted dumps (fadump/kdump). This can happen when the dumps are configured with chunks > 1Gb. A problem was fixed for the default gateway in the Advanced System Management Interface (ASMI) IPv4 network configurations showing as 0.0.0.0 which is an invalid gateway IP address. This problem can occur if ASMI is used to clear the gateway value with blanks. A problem was fixed for the Advanced System Management Interface (ASMI) displaying the IPv6 network prefix in decimal instead of hex character values. The service processor command line "ifconfig" can be used to see the IPv6 network prefix value in hex as a circumvention to the problem. A problem was fixed for link speed for PCIe Generation 4 adapters showing as "unknown" in the Advanced System Management Interface (ASMI) PCIe Hardware Topology menu. A problem was fixed for the system crashing on PCIe errors that result in guard action for the FRU. A problem was fixed for an extraneous SRC B7000602 being logged intermittently when is the system is being powered off. The trigger for the error log is a HMC request for information that does not complete before the system is shut down. If the HMC sends certain commands to get capacity information (eg, 0x8001/0x0107) while the CEC is shutting down, the SFLPHMCCMD task can fail with this assertion. This error log may be ignored. A problem was fixed for the service processor Thermal Management not being made aware of a Power Management failure that the hypervisor had detected. This could cause the system to go into Safe Mode with degraded performance if the error does not have recovery done. A problem was fixed for the On-Chip Controller (OCC) being held in reset after a channel error for the memory. The system would remain in Safe Mode (with degraded performance) until a re-IPL of the system. The trigger for the problem requires the memory channel checkstop and the OCC not being able to detect the error. Both of these conditions are rare, making the problem unlikely to occur. A problem was fixed for the memory bandwidth sensors for the P9 memory modules being off by a factor of 2. As a workaround, divide memory sensor values by 2 to get a corrected value. A problem was fixed for known bad DRAM bits having errors logs being generated repeatedly with each IPL. With the fix, the error logs only occur one time at the initial failure and then thereafter the known bad DRAM bits are repaired as part of the normal memory initialization. A problem was fixed for a Hostboot run time memory channel error where the processor could be called out erroneously instead of the memory DIMM. For this error to happen, there must be a RCD parity error on the memory DIMM with a channel failure attention on the processor side of the bus and no channel failure attention on the memory side of the bus, and the system must recover from the channel failure. A problem was fixed for DDR3 DIMM memory training where the ranks not being calibrated had their outputs enabled. The JEDEC specification requires that the outputs be disabled. Adding the termination settings on the non-calibrating ranks can improve memory margins ( thereby reduce the rate of memory failures), and it matches the memory training technique used for the DDR4 memory. A problem was fixed for a PCIe2 4-port Slot Adapter with feature code #2E17 that cannot recover from a double EEH error if the second error occurs during the EEH recovery. Because is a double-error scenario, the problem should be very infrequent. A rare problem was fixed for slow downs in a Live Partition Mobility migration of a partition with Active Memory Sharing (AMS). The AMS partition does not fail but the slower performance could cause time-outs in the workload if there are time constraints on the operations. A problem was fixed for isolation of memory channel failure attentions on the processor side of the differential memory interface (DMI) bus. This only is a problem if there are no attentions from the memory module side of the bus and it could cause the service processor run time diagnostics to get caught in hang condition, or result in a system checkstop with the processor called out. A problem was fixed for the memory bandwidth sensors for the P9 memory modules sometimes being zero. A problem was fixed for deconfiguring checkstopped processor cores at run time. Without the fix, the processor core checkstop error could cause a checkstop of the system and a re-IPL, or it could force the system into Safe Mode. A problem was fixed for a failed TPM card preventing a system IPL, even after the card was replaced. A problem was fixed for differential memory interface (DMI) lane sparing to prevent shutting down a good lane on the TX side of the bus when a lane has been spared on the RX side of the bus. If the XBUS or DMI bus runs out of spare lanes, it can checkstop the system, so the fix helps use these resources more efficiently. A problem was fixed for IPL failures with SRC BC50090F when replacing Xbus FRUs. The problem occurs if VPD has a stale bad memory lane record and that record does not exist on both ends of the bus. A problem was fixed for SR-IOV adapter dumps hanging with low-level EEH events causing failures on VFs of other non-target SR-IOV adapters. A problem was fixed for SR-IOV VF configured with a PVID that fails to function correctly after a virtual function reset. It will allow receiving untagged frames but not be able to transmit the untagged frames. A problem was fixed for SR-IOV VFs, where a VF configured with a PVID priority may be presented to the OS with an incorrect priority value. A problem was fixed for a Self Boot Engine (SBE) recoverable error at run time causing the system to go into Safe Mode. A problem was fixed for a rare Live Partition Mobility migration hang with the partition left in VPM (Virtual Page Mode) which causes performance concerns. This error is triggered by a migration failover operation occurring during the migration state of "Suspended" and there has to be insufficient VASI buffers available to clear all partition state data waiting to be sent to the migration target. Migration failovers are rare and the migration state of "Suspended" is a migration state lasting only a few seconds for most partitions, so this problem should not be frequent. On the HMC, there will be an inability to complete either a migration stop or a recovery operation. The HMC will show the partition as migrating and any attempt to change that will fail. The system must be re-IPLed to recover from the problem. A problem was fixed for Self Boot Engine (SBE) failure data being collected from the wrong processor if the SBE is not running on processor 0. This can result in the wrong FRU being called out for SBE failures. System firmware changes that affect certain systems On systems which do not have a HMC attached, a problem was fixed for a firmware update initiated from the OS from FW920.00 to FW920.10 that caused a system crash one hour after the code update completed. This does not fix the case of the OS initiated firmware update back to FW920.00 from FW920.10 which will still result in a crash of the system. Do not initiate a FW920.10 to FW920.00 code update via the operating system. Use only HMC or USB methods of code update for this case. If a HMC or USB code update is not an option, please contact IBM support. A problem was fixed for Linux or AIX partitions crashing during a firmware assisted dump or when using Linux kexec to restart with a new kernel. This problem was more frequent for the Linux OS with kdump failing with "Kernel panic - not syncing: Attempted to kill init" in some cases.
VM920_040_040 / FW920.00 08/20/18	Impact: New Severity: New New Features and Functions GA Level

[{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI0007E","label":"Power System E950 Server (9040-MR9)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Was this topic helpful?

Document Information

More support for:
Power System E950 Server (9040-MR9)

Component:
Firmware

Software version:
All Versions

Document number:
6955589

Modified date:
26 February 2025

UID

ibm16955589

IBM Support

Tips