APAR status
Closed as program error.
Error description
A Cascading server restart can occur in a Websphere ND Cluster environment where transaction logs are stored in a relational database and automated peer transaction recovery has been configured. If the HAManager view becomes split, typically because of a network partition but also where servers become highly resource constrained, servers will log WTRN0100E and WTRN0112E messages and may repeatedly start, fail and restart in a cascading fashion. In a partitioned network, servers on either side of the partition may attempt to automatically recover the transaction logs of peer servers with which they have lost communication and in so doing take possession of the recovery logs of a running server. The peer effectively steals ownership of the logs, the owning application server detects this and forces itself to shutdown. This leads to cascading server restarts. The locking mechanism in the case where transaction recovery logs are stored in a relational database does not prevent peer recovery taking place for a server that is running.
Local fix
Disable auto peer recovery.
Problem summary
**************************************************************** * USERS AFFECTED: All users of IBM WebSphere Application * * Server storing transaction and * * compensation logs in a relational * * database. * **************************************************************** * PROBLEM DESCRIPTION: Servers shutdown reporting * * WTRN0100E/WTRN0112E messages when * * recovery logs are stored in an RDBMS * * with auto peer recovery. * **************************************************************** * RECOMMENDATION: * **************************************************************** This issue can occur in a Websphere ND Cluster environment where transaction and compensation logs are stored in a relational database and automated peer transaction recovery has been configured. There is no such issue where the recovery logs are stored in a filesystem, which is the default configuration. The symptoms described by this APAR could arise in a cluster where the network becomes partitioned or where servers become severely resource constrained and therefore unresponsive. In such scenarios HAManager may trigger servers to attempt to take over and recover the logs of running peer servers with which they have lost communication. A running server will detect this, report that another server has taken ownership of its logs with a WTRN0100E error and shut itself down. On restart the server may not be able to open its logs and report a WTRN0112E error or may report further WTRN0100E errors as peers take possession of its logs once more. When the recovery logs are stored in a filesystem the locking mechanism prevents peers from recovering the logs of a running server. In the case where the logs are stored in a relational database the locking mechanism does not prevent such peer log takeovers.
Problem conclusion
The locking mechanism in the relational database case has been changed to cater for this situation. To enable the new locking mechanism it is necessary to specify a new Transaction Service Custom Property named ENABLE_DBTXLOG_PEERLOCKING on each server in the cluster. The value of the property should be set to TRUE. The entire cluster should be quiesced prior to adding this custom property to each cluster member. The fix for this APAR is targeted for inclusion in fix packs 8.5.5.19 and 9.0.5.6. For more information, see 'Recommended Updates for WebSphere Application Server': https://www.ibm.com/support/pages/node/715553
Temporary fix
Comments
APAR Information
APAR number
PH10643
Reported component name
WEBS APP SERV N
Reported component ID
5724H8800
Reported release
850
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2019-04-08
Closed date
2020-11-27
Last modified date
2020-12-14
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
WEBS APP SERV N
Fixed component ID
5724H8800
Applicable component levels
R850 PSY
UP
R900 PSY
UP
[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.5","Line of Business":{"code":"LOB45","label":"Automation"}}]
Document Information
Modified date:
01 November 2021