Technical Blog Post
Abstract
In Db2LUW FCM Wait Time is showing high. Is that a FCM problem ?
Body
I have come across this question quite a few number of times :
In Db2LUW FCM Wait Time is showing high. Is that a FCM problem ?
Wanted clarify bit more on that question on top of what this Technote already explained :
/support/pages/node/279243
FCM means Fast Communication Manager.
When it's used for the communications across two physical hosts it goes through FCM channels
using network sockets down below.
But, when it's used in the same physical host for any reason it uses IPC using shared memory.
The IPCs usually never have response problems similar to TCP/IP sockets
For within one physical host FCM could be used for the communication across two logical members/partitions.
However, as the above Technote described it can also be used when INTRA_PARALLEL is set to ON.
That uses FCM local channels for communications across sub-agents.
Here is an example which shows most of the Db2 time is being spent in FCM_SEND_WAIT_TIME :
========================================================
-- Detailed breakdown of TOTAL_WAIT_TIME --
% Total
--- ---------------------------------------------
TOTAL_WAIT_TIME 100 239995
I/O wait time
POOL_READ_TIME 0 293
POOL_WRITE_TIME 0 0
DIRECT_READ_TIME 0 51
DIRECT_WRITE_TIME 0 65
LOG_DISK_WAIT_TIME 3 7290
LOCK_WAIT_TIME 0 263
AGENT_WAIT_TIME 0 0
Network and FCM
TCPIP_SEND_WAIT_TIME 0 1668
TCPIP_RECV_WAIT_TIME 1 4249
IPC_SEND_WAIT_TIME 0 0
IPC_RECV_WAIT_TIME 0 0
FCM_SEND_WAIT_TIME 90 216724
FCM_RECV_WAIT_TIME 3 8228
WLM_QUEUE_TIME_TOTAL 0 0
CF_WAIT_TIME 0 0
RECLAIM_WAIT_TIME 0 0
SMP_RECLAIM_WAIT_TIME 0 0
====================================================
And, in this case the INTRA_PARALLEL was set to ON in a single member/partition setup.
As a result the queries were parallelized using multiple sub-agents.
And, how the queries communicate across difference sub-agents is, it uses something called tableQ.
One part of a query sends a part of the total work using tableQ to sub-agents and wait to hear back from them.
The sub-agents will take whatever time to complete the work assigned and then send back the result to the
parent (coordinator) agent who will compile the total work. The parent agent will wait until all the sub-agents sends back the completion.
So, for some reason if the FCM wait time is increased in single member setup it's not due to issues at FCM itself. It's an issue at the query level using tableQ.
In certain big query cases it's a normal observation. But, in certain other cases the query access plan
could be checked and improvement could be done. And, that might reduce the FCM wait time as a result.
In summary, the wait time below FCM layer is reflected as part of FCM wait time and that might confuse users.
Need to understand if there is no multi-partition across different physical members are involved
then the FCM wait time should be purely due to the wait time in lower layer than FCM level and not a
FCM layer issue.
Also it's important to understand when outputs like monreport.dbsummary() or, many other
3rd party tools shows the similar wait time those are based on the consideration that total wait time within Db2 is 100%.
Out of that what areas are showing how much percentage.
So, in the above example, it's 90% out of total entire wait time inside Db2.
That is totally a relative figure.
It's possible there were only one query active in the database that time which was using a tableQ
and nothing else was running.
Even if the total database response was fast it's the percentage within that total database time
which showed in those outputs.
So, the 90% figure was not a wait time with respect to the total physical box time.
In fact, it' purely a measurement internal to DB2 only.
UID
ibm11139944