Technical Blog Post
Abstract
Why db2luw instance went down suddenly without any diagnostic dumps
Body
Why db2luw instance went down suddenly without any associated diagnostic dumps.
This is something explained as part of different documents But, still been asked on this currently.
So, I thought to write few lines on that.
First of all we need to understand how db2 dumps any diagnostics when it experience a crash or, trap situation.
Db2 has it's own signal handler which handles the UNIX signals to invoke it's signal handler function/routine where it has the logic on what all to be dumped based on what signal is received by db2 instance process.
Very common signal where crash happens due to memory boundary violation is the SIGSEGV (signal 11).
Db2 can handle that kind of signal to dump needed diagnostics.
But, there are some UNIX signal like SIGKILL (signal 9) which cannot be handled as per UNIX design.
That means if the db2sysc process received a signal 9 it will not have any capability to dump any diagnostics as the signal cannot be handled.
As a result db2 can go down silently without leaving much details or, no dump at all.
It's very easy to test this out.
Just create a db2 instance and while it's running issue a kill -9 to the db2sysc process.
Following kind of messages will show up in the db2diag.log as the first set of messages when this issue is hit :
2018-12-11-12.18.39.959403-300 E16012A577 LEVEL: Severe
PID : 21364950 TID : 258 PROC : db2wdog
INSTANCE: db2inst1 NODE : 000
HOSTNAME: myhost1
EDUID : 258 EDUNAME: db2wdog [db2inst1]
FUNCTION: DB2 UDB, base sys utilities, sqleWatchDog, probe:20
MESSAGE : ADM0503C An unexpected internal processing error has occurred. All
DB2 processes associated with this instance have been shutdown.
Diagnostic information has been recorded. Contact IBM Support for
further assistance.
2018-12-11-12.18.40.004468-300 E16590A455 LEVEL: Error
PID : 21364950 TID : 258 PROC : db2wdog
INSTANCE: db2inst1 NODE : 000
HOSTNAME: myhost1
EDUID : 258 EDUNAME: db2wdog [db2inst1]
FUNCTION: DB2 UDB, base sys utilities, sqleWatchDog, probe:9064
DATA #1 : Process ID, 4 bytes
2818312
DATA #2 : Hexdump, 8 bytes
0x07000000003FDC58 : 0000 0102 0000 0009 ........
This says, db2wdog process found that process 2818312 (which is db2sysc process, check earlier in the db2diag.log)
is killed with signal 9 ( 0009 pattern)
Based on db2 level it can show up bit differently too.
Example , 0201 0000 0900 0000
Only time when Db2 can issue signal 9 for itself when it needs to cleanup the instance after a crash or trap or, even sometime after a graceful shutdown.
To find out if the signal 9 is issued by db2 itself to cleanup the left over instance check if the signal 9 message is logged after the initial crash or not. The cleanup will happen as after effects of a crash or, trap.
Example messages will be,
2018-12-11-12.18.40.006951-300 I18450A984 LEVEL: Event
PID : 21364950 TID : 258 PROC : db2wdog
INSTANCE: db2inst1 NODE : 000
HOSTNAME: myhost1
EDUID : 258 EDUNAME: db2wdog [db2inst1]
FUNCTION: DB2 UDB, oper system services, sqlossig, probe:10
MESSAGE : Sending SIGKILL to the following process id
DATA #1 : signed integer, 4 bytes
59375676
....
2018-12-11-12.18.40.046480-300 I19435A390 LEVEL: Event
PID : 21364950 TID : 258 PROC : db2wdog
INSTANCE: db2inst1 NODE : 000
HOSTNAME: myhost1
EDUID : 258 EDUNAME: db2wdog [db2inst1]
FUNCTION: DB2 UDB, oper system services, sqloCleanUpPosixIPCResources, probe:100
MESSAGE : Clean up POSIX resources attempt from engine.
Any other kind of signal 9 usage will be originated from outside of Db2.
Common kind of signal 9 sources :
- UNIX experienced out of memory (OOM) condition. To protect the OS it has issued kill -9 to db2 process
- Cluster manager decided to bring down the db2 instance based on some criteria
- Somebody manually issued kill -9 to the db2sysc process.
- Somebody issued db2_kill accidentally with wrong instance.
Another very common question asked is. Why db2 cannot log any details about who has issued the signal 9. What is the source of signal 9.
The answer to that will be, Db2 has no control or, awareness on who can issue a signal 9 from outside. That is not an expected behavior to Db2 engine. So, it cannot keep track of the origin of external signal 9 source.
Then, how that could be tracked ?
That has to be done at OS level. It might not be easy to track always.
If any OS level auditing is possible that should be used.
Or, based on OS there are some ways.
As for example in AIX there is probevue which could be used as explained in this Technote,
UID
ibm11139926