APAR status
Closed as documentation error.
Error description
After Power On/Reset (POR) of an IBM z15 HW using DPM mode, a multi-node accelerator cluster using RoCE cards does not start. Looking into the system details, the RoCE cards were found in status "Stopped" (issue 1). After manual activation, the communication between head node and data nodes of the multi-node cluster still did not work. Looking again into the system details, it was unveiled, that the FIDs assigned to the RoCE cards were no longer the ones that were valid until the POR was executed (issue 2). After adapting the json file to obtain the new FIDs, the communication between head node and data nodes was working again making the accelerator operational again. Background information: On a z16, both issue 1 and issue 2 do NOT occur. On a z15 (and on a z14) - Issue 1 happens due to an architectural limitation of the hardware / firmware. - Issue 2 happens due to a bug in the firmware of the z15 (z14) machine. Curing issue 2 for an IBM z15 HW system requires installation of an MCL patch (P46598.532) that is included in bundle S85. Additional keywords: TS015305644 POR Z15 DPM ROCE FID BUNDLE S85 DT269759 DRIVER41C
Local fix
Problem summary
Problem Summary: After Power On/Reset of an IBM z15 machine (IBM model types 8561, 8562) using DPM mode, a multi-node accelerator cluster using RoCE cards does not start. Users Affected: Customers for which all of the following applies; they are - running a multi-node accelerator - using RoCE cards for theinternode-communication - havedeployed the accelerator on IBM Z15 hardware (IBM model types 8561, 8562) - using IBM Z Dynamic Partition Manager (DPM). Problem Scenario: See APAR Error Description. Problem Symptoms: See APAR Error Description.
Problem conclusion
The root cause has been addressed (partially fixed) with IBM Z Driver 41C Firmware patch P46598.532. This patch is part of IBM Z Firmware Bundle S85 which was made available on February 5, 2024. If you belong to the affected customers as described above, upgrade your IBM Z15 firmware accordingly. What does "addressed (partially fixed)" mean? For new RoCE cards plugged in, the FIDs assigned will be persistent and won't be changed by a new system initialization cycle such as it is run through during a POR. For RoCE cards already plugged in, the FIDs will become persistent AFTER (!) having run through another system initialization cycle a system such as it is run through during a POR. This especially means for the affected customers: after having applied the MCL bundle S85, the next POR will lead to an assignment of new FIDs to the RoCE cards plugged in which makes it necessary to adapt the FIDs specified in the Accelerator's configuration json file. To repeat an important statement from the APAR Error Description above: after any POR of an IBM z15 machine (IBM model types 8561, 8562) using DPM mode, RoCE cards need to be activated manually!
Temporary fix
Comments
APAR Information
APAR number
PH59572
Reported component name
ANYTCS ACCLTR Z
Reported component ID
5697DA700
Reported release
750
Status
CLOSED DOC
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2024-02-01
Closed date
2024-06-20
Last modified date
2024-06-20
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Applicable component levels
[{"Business Unit":{"code":"BU011","label":"Systems - zSystems software"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z Systems"}],"Version":"750"}]
Document Information
Modified date:
20 June 2024