Technical Blog Post
Abstract
DB2 & TSA: WHAT PROGRAM DID TSA REPORT TIMING OUT?
Body
In a purescale environment TSA is used to monitor the various resource
that need to be running to ensure a healthy purescale system.
It might happen that some of these resource fail. If that is the case
you want to find out which one it is and take proper actions. Sometimes
it's not easy to navigate the various files to find out what happened.
This small document will use a live example to show you how to do it.
When a problem occurs it will be reported in the system log files that
can be viewed using 'errpt -a'. For example:
---------------------------------------------------------------------------
LABEL: GBLRESRM_MONITOR_TI
IDENTIFIER: 87EB4A70
Date/Time: Fri Aug 11 05:33:00 KORST 2017
Sequence Number: 6932
Machine Id: 00CD15C74C00
Node Id: prodnode1
Class: O
Type: PERM
WPAR: Global
Resource Name: GblResRM
Description
IBM.Application monitor command timed out.
Probable Causes
The Resource Manager killed the monitor command
because it did not return within the specified command timeout
Failure Causes
The Resource Manager killed the monitor command
because it did not return within the specified command timeout
Recommended Actions
Check the time the monitor command needs to complete.
Adapt the MonitorCommandTimeout attribute if necessary.
Detail Data
DETECTING MODULE
RSCT,Application.C,1.92.3.177,6127
ERROR ID
REFERENCE CODE
Resource name
ca_db2prod_0-rs
In the above message we have 3 important information that will allow
us to go a bit further:
Date of event: Fri Aug 11 05:33:00 KORST 2017
Class and Nature of failure: IBM.Application monitor command timed out
Resource for which monitoring failed: ca_db2prod_0-rs
The next step is to find out what is the program used to monitor the resource.
This can be found by printing the list of resource and their attributes using
the 'lsrsrc' command. To do that we will use the 'class' we found in the
'errpt -a' output above, that is 'IBM.Application':
lsrsrc IBM.Application
The output might be rather big but we would be looking for the 'resource'
named 'ca_db2prod_0-rs' as appeared in the 'errpt -a' output. We find it
in the 'lsrsrc' output:
resource 29:
Name = "ca_db2prod_0-rs"
ResourceType = 0
AggregateResource = "0x2028 0xffff 0x187fd6e7 0xb47a34be 0x94760fcd 0xdef74e22"
StartCommand = "/db2/db2prod/sqllib/adm/db2rocme 1 CF db2prod 128 START"
StopCommand = "/db2/db2prod/sqllib/adm/db2rocme 1 CF db2prod 128 STOP"
MonitorCommand = "/db2/db2prod/sqllib/adm/db2rocme 1 CF db2prod 128 MONITOR"
The 'program' we are interested in is 'MonitorCommand'. So we find that the
command that timed out was:
"/db2/db2prod/sqllib/adm/db2rocme 1 CF db2prod 128 MONITOR"
Most commands can be run on their own (with appropriate environment) to
help diagnose further the issue.
UID
ibm11140370