IBM Support

PH59572: After Power On/Reset of an IBM z15 HW using DPM mode, a multi-node accelerator cluster using RoCE cards does not start

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as documentation error.

Error description

  • After Power On/Reset (POR) of an IBM z15 HW using DPM mode, a
    multi-node accelerator cluster using RoCE cards does not start.
    
    Looking into the system details, the RoCE cards were found in
    status "Stopped" (issue 1).
    After manual activation, the communication between head node and
    data nodes of the multi-node cluster still did not work.
    Looking again into the system details, it was unveiled, that the
    FIDs assigned to the RoCE cards were no longer the ones that
    were valid until the POR was executed (issue 2).
    After adapting the json file to obtain the new FIDs, the
    communication between head node and data nodes was working again
    making the accelerator operational again.
    
    Background information:
    On a z16, both issue 1 and issue 2 do NOT occur.
    
    On a z15 (and on a z14)
    
    - Issue 1 happens due to an architectural limitation of the
      hardware / firmware.
    - Issue 2 happens due to a bug in the firmware of the z15 (z14)
      machine.
    Curing issue 2 for an IBM z15 HW system requires installation of
    an MCL patch (P46598.532) that is included in bundle S85.
    
    Additional keywords:
    TS015305644 POR Z15 DPM ROCE FID BUNDLE S85 DT269759 DRIVER41C
    

Local fix

Problem summary

  • Problem Summary:
    After Power On/Reset of an IBM z15 machine (IBM model types
    8561, 8562) using DPM mode, a multi-node accelerator cluster
    using RoCE cards does not start.
    
    Users Affected:
    Customers for which all of the following applies; they are
    - running a multi-node accelerator
    - using RoCE cards for theinternode-communication
    - havedeployed the accelerator on IBM Z15 hardware (IBM model
    types 8561, 8562)
    - using IBM Z Dynamic Partition Manager (DPM).
    
    Problem Scenario:
    See APAR Error Description.
    
    Problem Symptoms:
    See APAR Error Description.
    

Problem conclusion

  • The root cause has been addressed (partially fixed) with IBM Z
    Driver 41C Firmware patch P46598.532. This patch is part of IBM
    Z Firmware Bundle S85 which was made available on February 5,
    2024. If you belong to the affected customers as described
    above, upgrade your IBM Z15 firmware accordingly.
    What does "addressed (partially fixed)" mean?
    For new RoCE cards plugged in, the FIDs assigned will be
    persistent and won't be changed by a new system initialization
    cycle such as it is run through during a POR.
    For RoCE cards already plugged in, the FIDs will become
    persistent AFTER (!) having run through another system
    initialization cycle a system such as it is run through during a
    
    POR.
    
    This especially means for the affected customers:
    after having applied the MCL bundle S85, the next POR will lead
    to an assignment of new FIDs to the RoCE cards plugged in which
    makes it necessary to adapt the FIDs specified in the
    Accelerator's configuration json file.
    
    To repeat an important statement from the APAR Error Description
    
    above:
    after any POR of an IBM z15 machine (IBM model types 8561, 8562)
    
    using DPM mode, RoCE cards need to be activated manually!
    

Temporary fix

Comments

APAR Information

  • APAR number

    PH59572

  • Reported component name

    ANYTCS ACCLTR Z

  • Reported component ID

    5697DA700

  • Reported release

    750

  • Status

    CLOSED DOC

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2024-02-01

  • Closed date

    2024-06-20

  • Last modified date

    2024-06-20

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

Applicable component levels

[{"Business Unit":{"code":"BU011","label":"Systems - zSystems software"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z Systems"}],"Version":"750"}]

Document Information

Modified date:
20 June 2024