IBM Support

Unable to obtain an exclusive file lock on tranaction logs during failover:
CWRLS0024E: Exception caught during recovery!

Troubleshooting


Problem

During transaction log failover, you are not able to obtain an exclusive lock on the tranlogs of the failed server from the server that will take over the logs and continue processing the transactions in flight on the failed server to resolution.

Symptom

Unable to obtain an exclusive file lock on
/tools/.../.../tranlogs/....node.../tranlog/log1.
[10/22/14 16:07:47:274 CEST] 000001ae CoordinationL E CWRLS0024E:
Exception caught during recovery!
java.nio.channels.OverlappingFileLockException
at sun.nio.ch.FileChannelImpl.checkList(FileChannelImpl.java:913)
at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:880)
at java.nio.channels.FileChannel.tryLock(FileChannel.java:981)
at
com.ibm.ws.recoverylog.spi.CoordinationLock.obtainLock(CoordinationLock.
java:288)
at
com.ibm.ws.recoverylog.spi.CoordinationLock.lock(CoordinationLock.java:1
82)
at
com.ibm.ws.recoverylog.spi.RecoveryHandler.run(RecoveryHandler.java:180)


Look for the following in the transaction trace log file

? [1/5/14 16:19:04:164] 0000001b RLSHAGroupCal [RLSHA: ACTIVATE recovery
for failure scope FileFailureScope:
???001\???001_node_01\????node_01.0 [1156553819]]
? [1/5/14 16:19:04:165] 0000001b RLSHAGroupCal [RLSHA: This is a
recovery request for the LOCAL server]
? [1/5/14 16:19:04:170] 0000001d RecoveryHandl [RLSHA: STARTING
RecoveryHandler for failure scope FileFailureScope:
???001\???001_node_01\???001.???c.???001_node_01.0 [1156553819]]
? [1/5/14 16:19:04:172] 0000001d TransactionSe [Retrieving transaction
log configuration for server
'???001\???001_node_01\???001.gpcc.???001_node_01.0']
? [1/5/14 16:19:04:173] 0000001d TransactionSe [Resolved transaction log
directory '${TRANLOG_ROOT}/${WAS_SERVER_NAME};3M']
? [1/5/14 16:19:04:174] 0000001d TransactionSe [logFile path after var
expand: /tools/wps/var/tranlogs/???001.????.???001_node_01.0]
For the peer:
? [1/5/14 16:19:06:808] 0000001c RLSHAGroupCal [RLSHA:
ACTIVATE recovery for failure scope FileFailureScope:
???001\???001_node_02\???001.gpcc.???001_node_02.0 [1167870907]]
? [1/5/14 16:19:06:809] 0000001c RLSHAGroupCal [RLSHA:
This is a recovery request for a PEER server]
? [1/5/14 16:19:06:809] 00000023 RecoveryHandl [RLSHA: STARTING
RecoveryHandler for failure scope FileFailureScope:
???001\???001_node_02\???001.????.???001_node_02.1 [1167870908]]
? [1/5/14 16:19:06:810] 00000023 TransactionSe [Retrieving transaction
log configuration for server
'???001\???001_node_02\???001.gpcc.???001_node_02.1']
? [1/5/14 16:19:06:810] 00000023 TransactionSe [Resolved transaction log
directory '${TRANLOG_ROOT}/${WAS_SERVER_NAME};??']
? [1/5/14 16:19:06:811] 00000023 TransactionSe [logFile path after var
expand: /tools/wps/var/tranlogs/???001.????.???001_node_01.0]

Cause

You are using substitution variables in the serverindex.xml:
xmi:id="RecoveryLog_???"
transactionLogDirectory="${TRANLOG_ROOT}/${WAS_SERVER_NAME};"/>

While this works fine for a single server, when multiple servers are clustered in an HA environment with peer recovery, when a server tries to recover transaction logs for a peer, it uses it's own configured values rather than the peer's values. As a consequence, when the server starts 2 threads for recovery, one for itself and one for the peer, they both end up referencing the same log file.

Resolving The Problem

As both threads are trying to access the same file, this results in the
java.nio.channels.OverlappingFileLockException when the second thread tries to access the log file.

You will need to change the transactionLogDirectory entries to qualified entries. If the set-ups for all servers are identical, then they can leave the TRANLOG_ROOT element, but they will have to specify the WAS_SERVER_NAME part explicitly. It would be safer to use a fully qualified name.

[{"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Java Transaction Service (JTS)","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF012","label":"IBM i"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.0;8.5.5;8.0;7.0","Edition":"Base;Network Deployment","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
15 June 2018

UID

swg21460319