Technical Blog Post
Abstract
DB2 Case sharing: I/O issue caused deactivation of HADR Standby slow
Body
One of my customer complaints that it usually takes more than 6 minutes to deactivate one standby database, he asks me help him find out the reason.
A a start, I asked him collect "db2pd -stack all" and "db2pd -eve" for at least 3 time while deactivating the database. Then data was collected.
From the stack, the deactivate db agent was busy on flushing dirty pages to disk:
0x0900000000037C18 pwrite64 + 0x38
0x09000000042861BC sqloseekwrite64 + 0x144
0x0900000004285FDC sqloWriteBlocks + 0xE0
0x0900000004285E64 sqlbWriteBlocks__FP16SqlbOpenFileInfoPvlUlUiPUlP12SQLB_GLOBALS + 0x48
0x09000000042859A8 sqlbWritePageToDisk__FP11SQLO_IO_REQP16SqlbOpenFileInfoP12SQLB_GLOBALS + 0x22C
0x09000000042E05DC sqlbWritePage + 0xF8
0x09000000040F90D4 sqlbFlushForDLSubset__FP9SQLP_LSN8UiUsiT4P12SQLB_GLOBALS + 0x7F8
0x0900000004446478 sqlbFlush__FP9SQLP_LSN8UiUsP12SQLB_GLOBALS + 0x24
0x0900000003C166A0 sqlbFlush__FP9SQLP_LSN8UiUsP12SQLB_GLOBALS@glue52A + 0x7C
0x0900000004A300E4 TermDbConnect__16sqeLocalDatabaseFP8sqeAgentP5sqlcai + 0xE4
0x0900000004131634 AppStopUsing__14sqeApplicationFP8sqeAgentUcP5sqlca + 0xDA4
0x090000000445F02C AppLocalStart__14sqeApplicationFP14db2UCinterface + 0x65C
0x0900000004256488 sqlelostWrp__FP14db2UCinterface + 0x40
0x090000000425653C sqleUCengnInit__FP14db2UCinterfaceUs + 0x64
0x090000000446CC40 sqleUCagentConnect + 0x2B4
0x090000000446B5F4 sqljsConnectAttach__FP13sqljsDrdaAsCbP14db2UCinterface + 0x104
0x090000000446B344 sqljs_ddm_accsec__FP14db2UCinterfaceP13sqljDDMObject + 0x340
0x090000000446AF6C sqljsParseConnect__FP13sqljsDrdaAsCbP13sqljDDMObjectP14db2UCinterface + 0x5C
0x090000000446AEA0 .sqljsParse.fdpr.clone.378__FP13sqljsDrdaAsCbP14db2UCinterfaceP8sqeAgentb + 0x28
0x0900000004368A04 @63@sqljsSqlam__FP14db2UCinterfaceP8sqeAgentb + 0x2D0
0x090000000448B818 @63@sqljsDriveRequests__FP8sqeAgentP14db2UCconHandle + 0xB4
0x090000000448B4CC @63@sqljsDrdaAsInnerDriver__FP18SQLCC_INITSTRUCT_Tb + 0x2D4
0x090000000448AF54 sqljsDrdaAsDriver__FP18SQLCC_INITSTRUCT_T + 0x100
0x090000000417E110 RunEDU__8sqeAgentFv + 0x2F0
0x0900000004185E3C EDUDriver__9sqzEDUObjFv + 0xF8
0x090000000418B580 sqloEDUEntry + 0x264
Summary:
Found in 5 stacks of a total of 280 stacks ( 1.79% ) in 56 files
Found in:
./collect_ys/17170518.13271.000.stack.txt -- db2agent(PAYEE) -- 2016-07-13-21.39.52.661077(Signal #30)
./collect_ys/17170518.13271.000.stack.txt -- db2agent(PAYEE) -- 2016-07-13-21.40.36.216268(Signal #30)
./collect_ys/17170518.13271.000.stack.txt -- db2agent(PAYEE) -- 2016-07-13-21.41.17.765316(Signal #30)
./collect_ys/17170518.13271.000.stack.txt -- db2agent(PAYEE) -- 2016-07-13-21.42.05.778419(Signal #30)
./collect_ys/17170518.13271.000.stack.txt -- db2agent(PAYEE) -- 2016-07-13-21.43.09.038292(Signal #30)
Normally we can decrease CHNGPGS_THRESH to make things better:
CHNGPGS_THRESH 80 80
However, I saw the DB2_USE_ALTERNATE_PAGE_CLEANING registry variable is set (that is, the alternate method of page cleaning is used), the chngpgs_thresh parameter has no effect, and the database manager automatically determines how many dirty pages to maintain in the buffer pool:
DB2_USE_ALTERNATE_PAGE_CLEANING=ON
The number of dirty pages is high:
Address Id Name PageSz PA-NumPgs BA-NumPgs BlkSize NumTbsp PgsLeft CurrentSz PostAlter SuspndTSCt Automatic
0x0A0004000010DE20 1 IBMDEFAULTBP 32768 111594 0 0 4
Bufferpool: 1
Dirty pages % : 101283 / 111594 (90.76% dirty)
Bufferpool minbuflsn: 000000D0A629906A
101283 * 32768 ~= 3G ==> Flush 3G data onto disk took 4 minutes, looks like we have I/O bottleneck.
So I asked customer collect additional data when the problem happens again:
db2pd -eve > db2pd.eve
nohup iostat -Dl 10 > iostat.out &
db2trc on -Madd sqlbFlushForDLSubset -Madd sqloseekwrite64 -t -f db2trc.dmp
db2 deactivate db payee
db2trc off
db2trc flw -t -wc db2trc.dmp db2trc.flw
db2trc perfrep -g -sort timeelapsed db2trc.dmp db2trc.rpt
#stop the iostat command
The data was collected. From the new data, I did see the I/O bottleneck. sqloseekwrite64 directly invokes system call pwrite64. The db2agent flushing data to disk has AvgElapsed = 0.182239108, while db2hadrs and db2loggr that flush logs to transaction log has AvgElapsed=0.001366460. 0.182239108 vs 0.001366460 suggest we encountered I/O slow on data containers. Plz check and/or tune the I/O.
Node : 0 , PID : 20513020 , TID : 11310 ==> db2hadrs
nCalls TotalElapsed AvgElapsed TotalSpent AvgSpent FunctionName
15603 25.716884552 0.001648201 25.716884552 0.001648201 sqloseekwrite64
Node : 0 , PID : 20513020 , TID : 4113 ==> db2agent
nCalls TotalElapsed AvgElapsed TotalSpent AvgSpent FunctionName
1 233.521891755 233.521891755 0.073594333 0.073594333 sqlbFlushForDLSubset
1281 233.448297421 0.182239108 233.448297421 0.182239108 sqloseekwrite64
Node : 0 , PID : 20513020 , TID : 4885 ==> db2loggr
nCalls TotalElapsed AvgElapsed TotalSpent AvgSpent FunctionName
4 0.005465841 0.001366460 0.005465841 0.001366460 sqloseekwrite64
The high dirty pages most probably caused by the same I/O issue.
So, I asked customer check the I/O performance from OS/storage side.
UID
ibm13286737