Technical Blog Post
Abstract
HADR slow ! Log disk ? or communication ?
Body
Once you find DB2 HADR ( High Availability Disaster Recovery ) operation is somehow slow, what's the next ?
You may want to narrow down if this delay is from writing to transaction log disk subsystem or something on network communication.
Here is a sample scenario.
- Intermittently, the number of 'commit active' status applications spikes.
- Application team reports this has very serious impact even though that status stays very short 1 second period.
( Application processes increasing with High CPU consumption due to 'fork' operation....etc. )
To investigate this on HADR side, I suggest to collect data using the following two functions in overall database point of view.
MON_GET_TRANSACTION_LOG
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0059253.html
MON_GET_WORKLOAD
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0053940.html
And here is the sample result output summary to study.
( Collected the output using the two functions in every 5 seconds.
Then calculated the delta value for each 5 seconds. )
MON_GET_WORKLOAD shows TOTAL_COMMIT_TIME spiked and this is from LOG_DISK_WAIT_TIME.
TIMESTAMP / LOCK_WAIT_TIME /LOG_DISK_WAIT_TIME/TOTAL_COMMIT_TIME
17:39:16, 0, 2, 3,
17:39:21, 0, 2, 2,
17:39:26, 0, 2, 2,
17:39:31, 0, 4, 4,
17:39:37, 0, 2, 2,
17:39:42, 0, 9, 9,
17:39:47, 0, 3, 3,
17:39:52, 79, 99, 99, <====
17:39:58, 0, 3, 3,
17:40:03, 0, 2, 2,
17:40:08, 1, 4, 4,
17:40:13, 0, 4, 4,
17:40:19, 0, 4, 4,
LOG_DISK_WAIT_TIME sometimes overlaps with LOG write time and sometimes with HADR communication time.
So it would be helpful to check the LOG_WRITE_TIME and LOG_HADR_WAIT_TIME from MON_GET_TRANSACTION_LOG function.
In the same way, I get the delta values and did some calculation.
-Log write Time per IOs = LOG_WRITE_TIME / LOG_WRITES
-AVG HADR Wait time = LOG_HADR_WAIT_TIME / LOG_HADR_WAITS_TOTAL
TIMESTAMP /Log write Time per IOs /AVG HADR Wait time
17:39:16, 0.310, 0.041
17:39:21, 0.298, 0.042
17:39:26, 0.308, 0.040
17:39:32, 0.302, 0.072
17:39:37, 0.294, 0.042
17:39:42, 0.342, 0.805<===
17:39:47, 0.326, 0.058
17:39:52, 0.334, 0.485<===
17:39:57, 0.344, 0.064
17:40:02, 0.299, 0.046
As you can see above, HADR wait time spiked around the time whereas log write time did not increase much.
So the next is checking network side.
( NOTE : I assume there is not transaction blocking in HADR primary side due to any HADR receive buffer full on standby.)
UID
ibm11139986