Technical Blog Post
Abstract
A SQL running very slowly on pureScale database with isolation level of RS
Body
We had a customer, he had a WebSphere application connecting to a database running on pureScale server. The application would execute a SQL statement as below:
SELECT DISTINCT(MYCOLUMN1) FROM MYTABLE FOR FETCH ONLY
Normally, the SQL take only < 10 ms to complete. However, when the SQL is executed from the WebSphere application, it ALWAYS(NOT randomly) take around 2 seconds.
To narrow down the issue, he already tried to:
1. Connect from the DB2 CLP on the same WebSphere machine to the DB, SQL only took < 10 ms to complete
2. Run a standalone java program which connect to the same pureScale DB on the same WebSphere machine, SQL also took < 10 ms to complete
3. Using the same WebSphere application connect to another standalone ESE database and ran the same SQL, it took < 10 ms to complete.
Asked him to collect explain and db2trc for both bad case and good case.
Note, to collect explain(a.k.a access plan) from section, the following steps can be followed:
db2 "call SYSPROC.SYSINSTALLOBJECTS('EXP
db2 "call SYSPROC.SYSINSTALLOBJECTS('EXP
# run the statement from WebSphere
# back to db2 server
db2 "SELECT executable_id, Varchar(stmt_text, 200) as stmt_text FROM TABLE(MON_GET_PKG_CACHE_STMT (NULL, NULL,NULL, -1)) AS T"
# from above output find out the executable_id for the statement 'SELECT DISTINCT(MYCOLUMN1) FROM MYTABLE FOR FETCH ONLY', assume it is x'0000000100000000000000000000
# then:
db2 "CALL EXPLAIN_FROM_SECTION(x'0000000
db2exfmt -d <db name> -g TIC -w -1 -n % -s % -# 0 -o explain.out
For usage of EXPLAIN_FROM_SECTION, refer to:
https://www.ibm.com/support/kn
See the db2trc analysis report, pay attention to the lines in red color, we found CF communication due to lock requests took much of the time:
GOOD CASE:
Pid Lvl FuncName TTime(ms) HTime(ms) LTime(ms) AvgTime(ms) NCalls ERecnumHTime
---------- --- ------------------------------- --------------- --------------- --------------- --------------- --------- -------------
22085672(Tid = 276671, Node = 0)
...
0 sqljsParse 90.027 89.696 0.330 45.013 2 272
1 sqljsParseRdbAccessed 89.996 89.269 0.057 22.499 4 875
2 sqljs_ddm_opnqry 89.268 89.268 89.268 89.268 1 876
3 sqlrr_open 89.258 89.258 89.258 89.258 1 885
4 sqlrr_process_fetch_request 88.949 88.949 88.949 88.949 1 1033
5 sqlriSectInvoke 88.938 88.938 88.938 88.938 1 1034
6 sqlritaSimplePerm 88.932 88.932 88.932 88.932 1 1037
7 sqldRowFetch 88.436 88.436 88.436 88.436 1 1307
8 sqldIndexFetch 88.433 88.433 88.433 88.433 1 1308
9 sqlirdk 88.431 88.431 88.431 88.431 1 1309
11 sqlifnxt 88.347 88.347 88.347 88.347 1 1372
12 sqliScanLeaf2 88.113 11.406 3.471 8.010 11 165897
13 sqlriDistinct2 70.476 0.417 0.001 0.002 34564 89377
BAD CASE:
Pid Lvl FuncName TTime(ms) HTime(ms) LTime(ms) AvgTime(ms) NCalls ERecnumHTime
---------- --- ------------------------------- --------------- --------------- --------------- --------------- --------- -------------
22085672(Tid = 267899, Node = 0)
0 sqljsParse 5178.360 5177.289 1.071 2589.180 2 32
1 sqljsParseRdbAccessed 5178.327 5177.216 0.055 2589.136 3 116
2 sqljs_ddm_opnqry 5177.214 5177.214 5177.214 5177.214 1 117
3 sqlrr_open 5177.203 5177.203 5177.203 5177.203 1 126
4 sqlrr_process_fetch_request 5176.753 5176.753 5176.753 5176.753 1 324
5 sqlriSectInvoke 5176.704 5176.704 5176.704 5176.704 1 325
6 sqlritaSimplePerm 5176.695 5176.695 5176.695 5176.695 1 328
7 sqldRowFetch 5176.639 5176.639 5176.639 5176.639 1 374
8 sqldIndexFetch 5176.636 5176.636 5176.636 5176.636 1 375
9 sqlirdk 5176.635 5176.635 5176.635 5176.635 1 376
11 sqlifnxt 5176.458 5176.458 5176.458 5176.458 1 530
12 sqliScanLeaf2 5176.042 650.660 301.629 650.463 10 2799047
13 sqlplrq 4132.845 7.371 0.046 0.135 34368 3460771
14 sqlplMakeNewRequestSD 3984.122 7.367 0.044 0.131 34368 3460772
15 sqlpLLMInformGLMIfStateChanged 3704.859 7.356 0.040 0.123 34368 3460779
16 sqlpLLMSetLockState 3214.568 7.342 0.033 0.108 34368 3460784
17 SAL_GLM_HANDLE::SAL_SetLockSta 3150.604 7.340 0.032 0.106 34368 3460785
18 SAL_GLM_HANDLE::SAL_SetLockSta 3106.833 7.338 0.031 0.104 34368 3460786
In bad case, we can see within the index scan, sqlplrq->sqlplMakeNewRequestSD->sqlpLLMInformGLMIfStateChanged->sqlpLLMSetLockState were called to set the lock state on the CF:
324 0.256215503 | | | | sqlrr_process_fetch_request entry [eduid 267899 eduname db2agent]
325 0.256216466 | | | | | sqlriSectInvoke entry [eduid 267899 eduname db2agent]
328 0.256219351 | | | | | | sqlritaSimplePerm entry [eduid 267899 eduname db2agent]
374 0.256273871 | | | | | | | sqldRowFetch entry [eduid 267899 eduname db2agent]
375 0.256275224 | | | | | | | | sqldIndexFetch entry [eduid 267899 eduname db2agent]
376 0.256276154 | | | | | | | | | sqlirdk entry [eduid 267899 eduname db2agent]
364 0.256264945 | | | | | | | | | | sqliSetupReadaheadPrefetch entry [eduid 267899 eduname db2agent
530 0.256432664 | | | | | | | | | | | sqlifnxt entry [eduid 267899 eduname db2agent]
536 0.256436914 | | | | | | | | | | | | sqliScanLeaf2 entry [eduid 267899 eduname db2agent]
537 0.256439289 | | | | | | | | | | | | | sqlplrq entry [eduid 267899 eduname db2agent]
538 0.256440707 | | | | | | | | | | | | | | sqlplMakeNewRequestSD entry [eduid 267899 eduname db2agent]
539 0.256441044 | | | | | | | | | | | | | | | sqlplComputeLocalConflict entry [eduid 267899 eduname db2agent]
540 0.256441521 | | | | | | | | | | | | | | | sqlplComputeLocalConflict exit
541 0.256441976 | | | | | | | | | | | | | | | sqlpUpdateLrbForNewRequest entry [eduid 267899 eduname db2agent]
542 0.256442583 | | | | | | | | | | | | | | | sqlpUpdateLrbForNewRequest exit
543 0.256443039 | | | | | | | | | | | | | | | sqlpLLMComputeCurrentNonXF entry [eduid 267899 eduname db2agent]
544 0.256443726 | | | | | | | | | | | | | | | sqlpLLMComputeCurrentNonXF exit
545 0.256443990 | | | | | | | | | | | | | | | sqlpLLMInformGLMIfStateChanged entry [eduid 267899 eduname db2agent]
546 0.256444357 | | | | | | | | | | | | | | | | sqlpLLMPrepareSetState entry [eduid 267899 eduname db2agent]
547 0.256444568 | | | | | | | | | | | | | | | | | sqlpLLMDetermineSendType entry [eduid 267899 eduname db2agent]
548 0.256444972 | | | | | | | | | | | | | | | | | sqlpLLMDetermineSendType exit
549 0.256445480 | | | | | | | | | | | | | | | | sqlpLLMPrepareSetState exit
550 0.256446126 | | | | | | | | | | | | | | | | sqlpLLMSetLockState entry [eduid 267899 eduname db2agent]
However, no such behavior in good case:
1033 13.491516158 | | | | sqlrr_process_fetch_request entry [eduid 276671 eduname db2agent]
1034 13.491517132 | | | | | sqlriSectInvoke entry [eduid 276671 eduname db2agent]
1037 13.491519658 | | | | | | sqlritaSimplePerm entry [eduid 276671 eduname db2agent]
1307 13.492014476 | | | | | | | sqldRowFetch entry [eduid 276671 eduname db2agent]
1308 13.492016142 | | | | | | | | sqldIndexFetch entry [eduid 276671 eduname db2agent]
1309 13.492016675 | | | | | | | | | sqlirdk entry [eduid 276671 eduname db2agent]
1297 13.492006033 | | | | | | | | | | sqliSetupReadaheadPrefetch entry [eduid 276671 eduname db2agent]
1372 13.492086845 | | | | | | | | | | | sqlifnxt entry [eduid 276671 eduname db2agent]
1378 13.492092304 | | | | | | | | | | | | sqliScanLeaf2 entry [eduid 276671 eduname db2agent]
1379 13.492095648 | | | | | | | | | | | | | sqlriDistinct2 entry [eduid 276671 eduname db2agent]
1380 13.492096175 | | | | | | | | | | | | | | sqlriProcessDistinct entry [eduid 276671 eduname db2agent]
1381 13.492097595 | | | | | | | | | | | | | | | sqlrxcp1 entry [eduid 276671 eduname db2agent]
1382 13.492098220 | | | | | | | | | | | | | | | sqlrxcp1 exit [rc = 1]
1383 13.492098771 | | | | | | | | | | | | | | sqlriProcessDistinct exit
1384 13.492099087 | | | | | | | | | | | | | sqlriDistinct2 exit
Why did Db2 have such differences between good case and bad case? Any difference between the access plans?
At the first glance, they had exactly the 'same' explain:
Rows
RETURN
( 1)
Cost
I/O
|
47
UNIQUE
( 2)
87.025
10
|
34470
IXSCAN
( 3)
85.9056
10
|
34470
INDEX: DB2SDIN1
MYTABLE
Q1
Only compare the plan graph is not enough, then I I saw in the access plan related to the index was that the good case had an isolation level of CS while the bad case had an isolation level of RS in addition to not having arguments to the IXSCAN, CUR_COMMIT and LCKAVOID set to TRUE. These two would likely result in an increased number of locks as RS isolation level needs more locks to enable repeateable reads and CUR_COMMIT and LCKAVOID both help reduce the number of locks needed during an index scan.
More info on isolation levels
https://www.ibm.com/support/kn
More info on Currently Commited:
https://www.ibm.com/support/kn
GOOD CASE:
3) IXSCAN: (Index Scan)
Cumulative Total Cost: 85.9056
Cumulative CPU Cost: 5.94928e+07
Cumulative I/O Cost: 10
Cumulative Re-Total Cost: 15.4126
Cumulative Re-CPU Cost: 5.93274e+07
Cumulative Re-I/O Cost: 0
Cumulative First Row Cost: 7.05876
Estimated Bufferpool Buffers: 11
Arguments:
---------
CUR_COMM: (Currently Committed)
TRUE
LCKAVOID: (Lock Avoidance)
TRUE
MAXPAGES: (Maximum pages for prefetch)
10
PREFETCH: (Type of Prefetch)
SEQUENTIAL,READAHEAD
ROWLOCK : (Row Lock intent)
SHARE (CS/RS)
SCANDIR : (Scan Direction)
FORWARD
SKIP_INS: (Skip Inserted Rows)
TRUE
TABLOCK : (Table Lock intent)
INTENT SHARE
TBISOLVL: (Table access Isolation Level)
CURSOR STABILITY
BAD CASE:
3) IXSCAN: (Index Scan)
Cumulative Total Cost: 85.9056
Cumulative First Row Cost: 7.05876
Estimated Bufferpool Buffers: 11
Arguments:
---------
MAXPAGES: (Maximum pages for prefetch)
10
PREFETCH: (Type of Prefetch)
READAHEAD
ROWLOCK : (Row Lock intent)
SHARE (CS/RS)
SCANDIR : (Scan Direction)
FORWARD
SKIP_INS: (Skip Inserted Rows)
TRUE
TABLOCK : (Table Lock intent)
INTENT SHARE
TBISOLVL: (Table access Isolation Level)
READ STABILITY
Why did we have isolation level of RS(i.e. READ STABILITY) in bad case? Did the Websphere application explicitly set the isolation level to RS as the default isolation level should be CS(Ii.e. CURSOR STABILITY)?
Then I checked the JCC trace collected in bad case (refer to this link to collect JCC trace with different ways -> http://www-01.ibm.com/support/
[jcc][Time:2017-10-26-11:10:01
For Transaction Isolation levels in JCC world, see: https://www.ibm.com/support/kn
So, the thing got clear, the customer removed the call of setTransactionIsolation (4) from his application, the performance became good, the SQL completed in 10 ms!
UID
ibm13286197