Technical Blog Post
Abstract
DB2 pureScale: db2start fails with SQL1517N, cannot resolved by repairing resources
Body
Recently I see an user is upgrading their pureScale to V10.5 FP8, online upgrade method is used, but he meets this error:
$ ./installFixPack -p /opt/IBM/db2/V10.5.0.8 -I db2hk -online -l /tmp/install.log -t /tmp/install.trc -f RSCT -f GPFS
Execution of a rolling update task failed with an error
Error Message :
SQL1517N db2start failed because the cluster manager resource states are inconsistent.
Refer to db2diag.log for more details
From install trace, we can see the upgrade failed at the last step: db2start instance on db203
23176 |||||||||||| 1 InstallProcess::executeLocal 90 -DATA- , STRING = /source/10_5_fp8/universal/db2/aix/install/db2iexec -n -o "/tmp/db2ioMi7ace:/tmp/db2ieMi7acf" db2hk "db2start instance on db203"
23177 |||||||||||| 1 InstallProcess::executeLocal 100 -DATA- , INT = 1024
23178 |||||||||||| 1 InstallProcess::executeLocal 110 -DATA- , STRING = WEXITSTATUS
23179 |||||||||||| 1 InstallProcess::executeLocal 150 -DATA- , INT = 4
23180 |||||||||||\ 1 InstallProcess::executeLocal EXIT Wed Feb 15 20:33:10 2017 -- , INT = 0
23181 ||||||||||\ 1 InstallProcess::execute EXIT Wed Feb 15 20:33:10 2017 -- , INT = 0
23182 ||||||||||/ 1 InstallProcess::exitCode ENTRY Wed Feb 15 20:33:10 2017 -- ,
23183 ||||||||||\ 1 InstallProcess::exitCode EXIT Wed Feb 15 20:33:10 2017 -- , INT = 0
23184 ||||||||||/ 1 InstallProcess::getStdOutputLength ENTRY Wed Feb 15 20:33:10 2017 -- ,
23185 |||||||||||/ 1 iPutFileInBuffer ENTRY Wed Feb 15 20:33:10 2017 -- ,
23186 |||||||||||| 1 iPutFileInBuffer 10 -DATA- , STRING = /tmp/db2ioMi7ace
23187 ||||||||||||/ 1 iFopen ENTRY Wed Feb 15 20:33:10 2017 -- ,
23188 ||||||||||||| 1 iFopen 10 -DATA- , STRING = /tmp/db2ioMi7ace
23189 ||||||||||||| 1 iFopen 20 -DATA- , STRING = rt
23190 ||||||||||||\ 1 iFopen EXIT Wed Feb 15 20:33:10 2017 -- , INT = 0
23191 |||||||||||| 1 iPutFileInBuffer 20 -DATA- , STRING = SQL1517N db2start failed because the cluster manager resource states are inconsistent.
The user sees the same error when he issues the below commands then:
db2start instance on db203
db2start cf 129
As a start point, suggest him try to repair the inconsistent resources by db2cluster:
Unfortunately, it doesn't work. Well, seems there is no quick way to fix this problem, we have to find out the cause of failure at first.
Checking the db2diag.log, we can see error messages as below:
2017-02-15-20.33.10.188475+480 E21788608A1525 LEVEL: Error
PID : 8847618 TID : 1 PROC : db2start
INSTANCE: db2sdin1 NODE : 000
HOSTNAME: db203
EDUID : 1
FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for CF, sqleCAGetTransportMethod, probe:684
MESSAGE : ZRC=0x87270023=-2027487197=SQLE_SAL_UNEXPECTED_ERROR
"Unexpected SAL Error."
DATA #1 : String, 34 bytes
Unable to determine transport type
DATA #2 : Codepath, 8 bytes
18:22
DATA #3 : unsigned integer, 8 bytes
1
DATA #4 : unsigned integer, 8 bytes
1
DATA #5 : unsigned integer, 4 bytes
0
DATA #6 : unsigned integer, 4 bytes
0
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
[0] 0x09000000315876CC sqleCAGetTransportMethod + 0xC08
[1] 0x0900000031585F84 sqleCAGetTransportMethod + 0x1244
[2] 0x0900000031582080 sqleCAIsRoCE + 0x6E4
[3] 0x0900000031C05618 sqlhaVerifyNetworkResources__FPPPcPiPbiP19SQLHA_CONTROL_BLOCK + 0x920
[4] 0x0900000031C0A99C sqlhaVerifyClusterResources__FPcP18sqlo_db2nodes_descPbP19SQLHA_CONTROL_BLOCK + 0x1CA4
[5] 0x090000002FAECD8C sqleIssueStartStop__FiPvPcT3P9sqlf_kcfdP18SQLE_INTERNAL_ARGSUiT7P5sqlca + 0xE22C
[6] 0x090000002FADA4E8 sqleIssueStartStop__FiPvPcT3P9sqlf_kcfdP18SQLE_INTERNAL_ARGSUiT7P5sqlca + 0x96CC
[7] 0x09000000312D8B68 sqleProcessStartStop__FiPvP18SQLE_INTERNAL_ARGSP9sqlf_kcfdPcUiT6P5sqlca + 0xAF8
[8] 0x0000000100002950 main + 0x20D0
[9] 0x00000001000002F8 __start + 0x70
2017-02-15-20.33.10.190401+480 I21790134A639 LEVEL: Error
PID : 8847618 TID : 1 PROC : db2start
INSTANCE: db2sdin1 NODE : 000
HOSTNAME: db203
EDUID : 1
FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for CF, sqleCAIsRoCE, probe:2105
MESSAGE : ZRC=0x87270023=-2027487197=SQLE_SAL_UNEXPECTED_ERROR
"Unexpected SAL Error."
DATA #1 : Codepath, 8 bytes
2
DATA #2 : String, 0 bytes
Object not dumped: Address: 0x0900000033236D50 Size: 0 Reason: Zero-length data
DATA #3 : String, 0 bytes
Object not dumped: Address: 0x0FFFFFFFFFFAEDA0 Size: 0 Reason: Zero-length data
2017-02-15-20.33.10.218674+480 E21790774A520 LEVEL: Error
PID : 8847618 TID : 1 PROC : db2start
INSTANCE: db2sdin1 NODE : 000
HOSTNAME: db203
EDUID : 1
FUNCTION: DB2 UDB, high avail services, sqlhaVerifyClusterResources, probe:16760
MESSAGE : ZRC=0x87270023=-2027487197=SQLE_SAL_UNEXPECTED_ERROR
"Unexpected SAL Error."
DATA #1 : String, 37 bytes
public network equivalency is missing
DATA #2 : signed integer, 4 bytes
0
DATA #3 : Boolean, 1 bytes
false
2017-02-15-20.33.10.273849+480 I21792079A1374 LEVEL: Event
PID : 8847618 TID : 1 PROC : db2start
INSTANCE: db2sdin1 NODE : 000
HOSTNAME: db203
EDUID : 1
FUNCTION: DB2 UDB, base sys utilities, sqleIssueStartStop, probe:6007
MESSAGE : ZRC=0x87270023=-2027487197=SQLE_SAL_UNEXPECTED_ERROR
"Unexpected SAL Error."
DATA #1 : SQLCA, PD_DB2_TYPE_SQLCA, 136 bytes
sqlcaid : SQLCA sqlcabc: 136 sqlcode: 0 sqlerrml: 0
sqlerrmc:
sqlerrp : SQL10058
sqlerrd : (1) 0x00000000 (2) 0x00000000 (3) 0x00000000
(4) 0x00000000 (5) 0x00000000 (6) 0x00000000
sqlwarn : (1) (2) (3) (4) (5) (6)
(7) (8) (9) (10) (11)
sqlstate:
DATA #2 : SQLCA, PD_DB2_TYPE_SQLCA, 136 bytes
sqlcaid : SQLCA sqlcabc: 136 sqlcode: -1517 sqlerrml: 0
sqlerrmc:
sqlerrp : SQLESSCM
sqlerrd : (1) 0x00000000 (2) 0x00000000 (3) 0x00000000
(4) 0x00000000 (5) 0x00000000 (6) 0x00000000
sqlwarn : (1) (2) (3) (4) (5) (6)
(7) (8) (9) (10) (11)
sqlstate:
DATA #3 : Boolean, 1 bytes
false
DATA #4 : Boolean, 1 bytes
false
DATA #5 : Boolean, 1 bytes
false
DATA #6 : Boolean, 1 bytes
false
DATA #7 : Boolean, 1 bytes
false
We can see the first error is "Unable to determine transport type", this error is usually caused by incorrect interconnect configuration. So, ask the user to check db2nodes.cfg, /etc/hosts and /etc/dat.conf, all things look good without any problem.
Finally, manage to collect a db2trc by reproducing the error:
$ db2trc on -f db2trc.dmp
Trace is turned on
$ db2start cf 129
SQL1517N db2start failed because the cluster manager resource states are inconsistent.
$ db2trc off
Trace is turned off
$ db2trc flw db2trc.dmp db2trc.flw
$ db2trc fmt db2trc.dmp db2trc.fmt
Checking db2trc, I can see the error comes from ossDATCheckIfInterfaceHasValidUDAPLDevice:
2017-02-17-12.58.12.787555+480 E13703A1525 LEVEL: Error
PID : 12582994 TID : 1 PROC : db2start
INSTANCE: db2sdin1 NODE : 000
HOSTNAME: db203
EDUID : 1
FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for CF, sqleCAGetTransportMethod, probe:684
MESSAGE : ZRC=0x87270023=-2027487197=SQLE_SAL_UNEXPECTED_ERROR
"Unexpected SAL Error."
DATA #1 : String, 34 bytes
Unable to determine transport type
DATA #2 : Codepath, 8 bytes
18:22
DATA #3 : unsigned integer, 8 bytes
1
DATA #4 : unsigned integer, 8 bytes
1
DATA #5 : unsigned integer, 4 bytes
0
DATA #6 : unsigned integer, 4 bytes
0
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
[0] 0x090000000CBF56CC sqleCAGetTransportMethod + 0xC08
[1] 0x090000000CBF3F84 sqleCAGetTransportMethod + 0x1244
[2] 0x090000000CBF0080 sqleCAIsRoCE + 0x6E4
[3] 0x090000000D273618 sqlhaVerifyNetworkResources__FPPPcPiPbiP19SQLHA_CONTROL_BLOCK + 0x920
[4] 0x090000000D27899C sqlhaVerifyClusterResources__FPcP18sqlo_db2nodes_descPbP19SQLHA_CONTROL_BLOCK + 0x1CA4
[5] 0x090000000B15AD8C sqleIssueStartStop__FiPvPcT3P9sqlf_kcfdP18SQLE_INTERNAL_ARGSUiT7P5sqlca + 0xE22C
[6] 0x090000000B1484E8 sqleIssueStartStop__FiPvPcT3P9sqlf_kcfdP18SQLE_INTERNAL_ARGSUiT7P5sqlca + 0x96CC
[7] 0x090000000C946B68 sqleProcessStartStop__FiPvP18SQLE_INTERNAL_ARGSP9sqlf_kcfdPcUiT6P5sqlca + 0xAF8
[8] 0x0000000100002950 main + 0x20D0
[9] 0x00000001000002F8 __start + 0x70
50021 | | | | | | sqleCAGetTransportMethod data [probe 15]
50022 | | | | | | sqleCAGetTransportMethod data [probe 20]
50023 | | | | | | | sqleCAGetNetworkInterfaceTransportType entry
50021 data DB2 UDB Shared Data Structure Abstraction Layer for CF sqleCAGetTransportMethod fnc (3.3.39.49.0.15)
...
50058 | | | | | | | | ossDATGetUDAPLDeviceForInterface data [probe 976]
50059 | | | | | | | | | ossDATCheckIfInterfaceHasValidUDAPLDevice entry
50060 | | | | | | | | | | OSSHLibrary::getFuncAddress entry
50061 | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 10]
50062 | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 100]
50063 | | | | | | | | | | OSSHLibrary::getFuncAddress exit
50064 | | | | | | | | | | OSSHLibrary::getFuncAddress entry
50065 | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 10]
50066 | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 100]
50067 | | | | | | | | | | OSSHLibrary::getFuncAddress exit
50068 | | | | | | | | | | OSSHLibrary::getFuncAddress entry
50069 | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 10]
50070 | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 100]
50071 | | | | | | | | | | OSSHLibrary::getFuncAddress exit
50072 | | | | | | | | | ossDATCheckIfInterfaceHasValidUDAPLDevice error [probe 691]
50078 | | | | | | | | | ossDATCheckIfInterfaceHasValidUDAPLDevice exit [rc = 0x90000620 = -1879046624]
50079 | | | | | | | | ossDATGetUDAPLDeviceForInterface data [probe 1050]
50072 error DB2 Common OSSe ossDATCheckIfInterfaceHasValidUDAPLDevice cei (4.1.3.219.2.691)
pid 12582994 tid 1 probe 691
Error ZRC = 0x00000000 = 0 = PSM_OK
bytes 12
Data1 (PD_TYPE_DEFAULT,4) Hexdump:
9000 0620 ...
50078 exit DB2 Common OSSe ossDATCheckIfInterfaceHasValidUDAPLDevice cei (2.1.3.219.2)
pid 12582994 tid 1
rc = 0x90000620 = -1879046624
50079 data DB2 Common OSSe ossDATGetUDAPLDeviceForInterface cei (3.1.3.220.2.1050)
pid 12582994 tid 1 probe 1050
bytes 297
Data1 (PD_TYPE_DEFAULT,4) Hexdump:
9000 0620 ...
Data2 (PD_TYPE_DEFAULT,256) Hexdump:
6863 6131 0000 0000 0000 0000 0000 0000 hca1............
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
0000 0000 0000 0000 0000 0000 0000 0000 ................
Data3 (PD_TYPE_DEFAULT,13) Hexdump:
3130 2E31 3333 2E36 332E 3130 39 10.133.63.109
Looks like the adapter is not working, so I suggest the user check the status of adapter:
$ ibstat -v
ERROR: "/dev/roce0": open failed rc=46, errno=46
It seems the adapter is down on the host, after fixing this adapter, the problem is resolved.
Also recommend the user verify the adapters by running DB2ClusterPing tool
UID
ibm11140610