PH10643: CASCADING SERVER RESTART DUE TO TRANSACTION AUTO PEER RECOVERY LOCKING ISSUE

APAR status

Closed as program error.

Error description

A Cascading server restart can occur in a Websphere ND Cluster
environment where transaction logs are stored in a relational
database and automated peer transaction recovery has been
configured. If the HAManager view becomes split, typically
because of a network partition but also where servers become
highly resource constrained, servers will log WTRN0100E and
WTRN0112E messages and may repeatedly start, fail and restart
in a cascading fashion.
In a partitioned network, servers on either side of the
partition may attempt to automatically recover the transaction
logs of peer servers with which they have lost communication
and in so doing take possession of the recovery logs of a
running server. The peer effectively steals ownership of the
logs, the owning application server detects this and forces
itself to shutdown. This leads to cascading server restarts.
The locking mechanism in the case where transaction recovery
logs are stored in a relational database does not prevent peer
recovery taking place for a server that is running.

Local fix

```
Disable auto peer recovery.
```

Problem summary

****************************************************************
* USERS AFFECTED:  All users of IBM WebSphere Application      *
*                  Server storing transaction and              *
*                  compensation logs in a relational           *
*                  database.                                   *
****************************************************************
* PROBLEM DESCRIPTION: Servers shutdown reporting              *
*                      WTRN0100E/WTRN0112E messages when       *
*                      recovery logs are stored in an RDBMS    *
*                      with auto peer recovery.                *
****************************************************************
* RECOMMENDATION:                                              *
****************************************************************
This issue can occur in a Websphere ND Cluster environment
where
transaction and compensation logs are stored in a relational
database and automated peer transaction recovery has been
configured. There is no such issue where the recovery logs are
stored in a filesystem, which is the default configuration.
The symptoms described by this APAR could arise in a cluster
where the network becomes partitioned or where servers become
severely resource constrained and therefore unresponsive.
In such scenarios HAManager may trigger servers to attempt to
take over and recover the logs of running peer servers with
which they have lost communication. A running server will
detect
this, report that another server has taken ownership of its
logs
with a WTRN0100E error and shut itself down. On restart the
server may not be able to open its logs and report a WTRN0112E
error or may report further WTRN0100E errors as peers take
possession of its logs once more.
When the recovery logs are stored in a filesystem the locking
mechanism prevents peers from recovering the logs of a running
server. In the case where the logs are stored in a relational
database the locking mechanism does not prevent such peer log
takeovers.

Problem conclusion

The locking mechanism in the relational database case has been
changed to cater for this situation. To enable the new locking
mechanism it is necessary to specify a new Transaction
Service Custom Property named ENABLE_DBTXLOG_PEERLOCKING on
each server in the cluster.
The value of the property should be set to TRUE. The entire
cluster should be quiesced prior to adding this custom property
to each cluster member.

The fix for this APAR is targeted for inclusion in fix packs
8.5.5.19 and 9.0.5.6. For more information, see 'Recommended
Updates for WebSphere Application Server':
https://www.ibm.com/support/pages/node/715553

Temporary fix

Comments

APAR Information

APAR number
PH10643
Reported component name
WEBS APP SERV N
Reported component ID
5724H8800
Reported release
850
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2019-04-08
Closed date
2020-11-27
Last modified date
2020-12-14

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
WEBS APP SERV N
Fixed component ID
5724H8800

Applicable component levels

R850 PSY
UP
R900 PSY
UP

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.5","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
01 November 2021

Tips

PH10643: CASCADING SERVER RESTART DUE TO TRANSACTION AUTO PEER RECOVERY LOCKING ISSUE

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R850 PSY

R900 PSY

Document Information

Share your feedback

Need support?