IBM Support

PH10643: CASCADING SERVER RESTART DUE TO TRANSACTION AUTO PEER RECOVERY LOCKING ISSUE

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • A Cascading server restart can occur in a Websphere ND Cluster
    environment where transaction logs are stored in a relational
    database and automated peer transaction recovery has been
    configured. If the HAManager view becomes split, typically
    because of a network partition but also where servers become
    highly resource constrained, servers will log WTRN0100E and
    WTRN0112E messages and may repeatedly start, fail and restart
    in a cascading fashion.
    In a partitioned network, servers on either side of the
    partition may attempt to automatically recover the transaction
    logs of peer servers with which they have lost communication
    and in so doing take possession of the recovery logs of a
    running server. The peer effectively steals ownership of the
    logs, the owning application server detects this and forces
    itself to shutdown. This leads to cascading server restarts.
    The locking mechanism in the case where transaction recovery
    logs are stored in a relational database does not prevent peer
    recovery taking place for a server that is running.
    

Local fix

  • Disable auto peer recovery.
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:  All users of IBM WebSphere Application      *
    *                  Server storing transaction and              *
    *                  compensation logs in a relational           *
    *                  database.                                   *
    ****************************************************************
    * PROBLEM DESCRIPTION: Servers shutdown reporting              *
    *                      WTRN0100E/WTRN0112E messages when       *
    *                      recovery logs are stored in an RDBMS    *
    *                      with auto peer recovery.                *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    This issue can occur in a Websphere ND Cluster environment
    where
    transaction and compensation logs are stored in a relational
    database and automated peer transaction recovery has been
    configured. There is no such issue where the recovery logs are
    stored in a filesystem, which is the default configuration.
    The symptoms described by this APAR could arise in a cluster
    where the network becomes partitioned or where servers become
    severely resource constrained and therefore unresponsive.
    In such scenarios HAManager may trigger servers to attempt to
    take over and recover the logs of running peer servers with
    which they have lost communication. A running server will
    detect
    this, report that another server has taken ownership of its
    logs
    with a WTRN0100E error and shut itself down. On restart the
    server may not be able to open its logs and report a WTRN0112E
    error or may report further WTRN0100E errors as peers take
    possession of its logs once more.
    When the recovery logs are stored in a filesystem the locking
    mechanism prevents peers from recovering the logs of a running
    server. In the case where the logs are stored in a relational
    database the locking mechanism does not prevent such peer log
    takeovers.
    

Problem conclusion

  • The locking mechanism in the relational database case has been
    changed to cater for this situation. To enable the new locking
    mechanism it is necessary to specify a new Transaction
    Service Custom Property named ENABLE_DBTXLOG_PEERLOCKING on
    each server in the cluster.
    The value of the property should be set to TRUE. The entire
    cluster should be quiesced prior to adding this custom property
    to each cluster member.
    
    The fix for this APAR is targeted for inclusion in fix packs
    8.5.5.19 and 9.0.5.6. For more information, see 'Recommended
    Updates for WebSphere Application Server':
    https://www.ibm.com/support/pages/node/715553
    

Temporary fix

Comments

APAR Information

  • APAR number

    PH10643

  • Reported component name

    WEBS APP SERV N

  • Reported component ID

    5724H8800

  • Reported release

    850

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2019-04-08

  • Closed date

    2020-11-27

  • Last modified date

    2020-12-14

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    WEBS APP SERV N

  • Fixed component ID

    5724H8800

Applicable component levels

  • R850 PSY

       UP

  • R900 PSY

       UP

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.5","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
01 November 2021