Technical Blog Post
Abstract
On pureScale, db2start may hang or timeout if idle and CA resources can't be online
Body
In pureScale environment, when you start CF or member, DB2 will attempt to start idle resources and CA resources first.
If for any reason this takes time, DB2 may appear to be hanging or eventually times out.
For instance, the following shows db2start is waiting for bringing up the TSA resources:
root@lpar215ps4:/>ps -elf|grep db2start
200001 A db2inst1 11010278 9240664 0 60 20 8a92b5590 8804 f1000a00e0bc4ab0 13:57:20 pts/0 0:00 db2start cf 128
root@lpar215ps4:/>pstack 11010278
ksh: pstack: not found.
root@lpar215ps4:/>procstack 11010278
11010278: db2start cf 128
0x090000000055a910 _p_nsleep(??, ??) + 0x10
0x09000000000397e4 nsleep(??, ??) + 0xe4
0x090000000015da90 nanosleep(??, ??) + 0x190
0x090000000118e468 ossSleep(??) + 0xa8
0x0900000003b37f00 sqlhaWaitForResourceState(SQLHA_CLUSTER_OBJECT_INFO*,_sqlhaObjStates,SQLHA_CONTROL_BLOCK*)(0x80000000000080, 0x100000001, 0x200) + 0x1640
0x0900000003b364d0 sqlhaOnlineClusterObject(SQLHA_CLUSTER_OBJECT_INFO*,SQLHA_CONTROL_BLOCK*)(??, ??) + 0x1e30
0x0900000003b6b9b4 sqlhaOperationOnClusterObjectsByType(char*,_sqlhaClusterObjType,SQLHA_CLUSTER_OPERATION,unsigned long,SQLHA_CLUSTER_OBJECT_INFO**,SQLHA_CLUSTER_OPERATION_RESULT_LIST**,SQLHA_CONTROL_BLOCK*)(0xffffffffffffffff, 0x1200000012, 0x0, 0x1, 0x0, 0x90000000b2dc79c, 0xcd) + 0x1054
0x0900000003b6dc28 sqlhaStartSDInfrastructure(char*,unsigned long,SQLHA_CLUSTER_OPERATION_RESULT_LIST**,SQLHA_CONTROL_BLOCK*,short)(0x8000000080, 0x5, 0x0, 0x1, 0x3) + 0x15c8
0x0900000009c95814 sqleIssueStartStop(int,void*,char*,char*,sqlf_kcfd*,SQLE_INTERNAL_ARGS*,unsigned int,unsigned int,sqlca*)(0x100, 0x2a9, 0x0, 0x2f64623266733031, 0x2f646232696e7374, 0x312f73716c6c6962, 0x5f73686172656400, 0x0) + 0x9df4
0x0900000009c88cf8 sqleProcessStartStop(int,void*,SQLE_INTERNAL_ARGS*,sqlf_kcfd*,char*,unsigned int,unsigned int,sqlca*)(0x100000001, 0x0, 0x3f2000003f2, 0x0, 0x0, 0x0, 0x0, 0x0) + 0x1138
0x0000000100002a1c main(??, ??) + 0x219c
0x00000001000002f8 __start() + 0x70
In such case, firstly check whether any idle resource is offline:
root@lpar215ps4:/>lssam|grep idle
Online IBM.ResourceGroup:idle_db2inst1_997_lpar214ps3-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_997_lpar214ps3-rs
'- Online IBM.Application:idle_db2inst1_997_lpar214ps3-rs:lpar214ps3
Online IBM.ResourceGroup:idle_db2inst1_997_lpar215ps4-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_997_lpar215ps4-rs
'- Online IBM.Application:idle_db2inst1_997_lpar215ps4-rs:lpar215ps4
Online IBM.ResourceGroup:idle_db2inst1_998_lpar214ps3-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_998_lpar214ps3-rs
'- Online IBM.Application:idle_db2inst1_998_lpar214ps3-rs:lpar214ps3
Online IBM.ResourceGroup:idle_db2inst1_998_lpar215ps4-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_998_lpar215ps4-rs
'- Online IBM.Application:idle_db2inst1_998_lpar215ps4-rs:lpar215ps4
Online IBM.ResourceGroup:idle_db2inst1_999_lpar214ps3-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_999_lpar214ps3-rs
'- Online IBM.Application:idle_db2inst1_999_lpar214ps3-rs:lpar214ps3
Online IBM.ResourceGroup:idle_db2inst1_999_lpar215ps4-rg Nominal=Online
'- Online IBM.Application:idle_db2inst1_999_lpar215ps4-rs
'- Online IBM.Application:idle_db2inst1_999_lpar215ps4-rs:lpar215ps4
Online IBM.Equivalency:idle_db2inst1_997_lpar214ps3-rg_group-equ
Online IBM.Equivalency:idle_db2inst1_997_lpar215ps4-rg_group-equ
Online IBM.Equivalency:idle_db2inst1_998_lpar214ps3-rg_group-equ
Online IBM.Equivalency:idle_db2inst1_998_lpar215ps4-rg_group-equ
Online IBM.Equivalency:idle_db2inst1_999_lpar214ps3-rg_group-equ
Online IBM.Equivalency:idle_db2inst1_999_lpar215ps4-rg_group-equ
If any idle resource is offline, check if it is caused due to any "Depends-On" resource is not available.
Next also check any CA or primary resource is offline.
root@lpar215ps4:/>lssam|egrep "ca_|primary"
Online IBM.ResourceGroup:ca_db2inst1_0-rg Nominal=Online
'- Online IBM.Application:ca_db2inst1_0-rs
|- Online IBM.Application:ca_db2inst1_0-rs:lpar214ps3
'- Online IBM.Application:ca_db2inst1_0-rs:lpar215ps4
Online IBM.ResourceGroup:primary_db2inst1_900-rg Nominal=Online
'- Online IBM.Application:primary_db2inst1_900-rs
|- Offline IBM.Application:primary_db2inst1_900-rs:lpar214ps3
'- Online IBM.Application:primary_db2inst1_900-rs:lpar215ps4
Online IBM.Equivalency:ca_db2inst1_0-rg_group-equ
Online IBM.Equivalency:primary_db2inst1_900-rg_group-equ
If db2start times out eventually, check db2diag.log to see which resource it takes time to bring online.
UID
ibm11140316