Technical Blog Post
Abstract
ITM Agent Insights: Common Causes for High CPU with the Windows OS Agent
Body
One of the more common problems reported by customer's with their OS agents is high CPU utilization. Often it is not the agent itself that is causing the problem. The OS agent is the mechanism that makes the problem visible. This blog will provide some steps that can help you resolve the problem or help identify the cause with the assistance of IBM support.
Symptom: Windows OS agent high CPU message
During the initialization phase the agent performs a set of sanity checks on the performance counters it is going to use and it is capable of detecting corrupted or malfunctioning counters. Messages like “This is a possible source of high CPU usage” can be found in the agent logs in one of the following directories of the server hosting the agent:
%ITM_Install%\logs\TMAITM6\logs
%ITM_Install%\logs\TMAITM6_x64\logs
For example:
(...:kntkthrd.cpp,533,"kntkthrd::ServiceThreadMain") Counter:'1500' is taking long (>3 sec). This is a possible source of high CPU usage
(...:kntkthrd.cpp,534,"kntkthrd::ServiceThreadMain") Consider to disable it by means of the NT_EXCLUDE_PERF_OBJS variable
Symptom: High CPU due to Windows Performance Counters
A common cause of issues like this are often related to the Windows perfmon counters. Try to rebuild the perfmon counters as described in the link below to see if that resolves the problem.
For details on how to rebuild performance counters refer to
http://support.microsoft.com/default.aspx?scid=kb;en-us;300956
Here is a reference to a Microsoft link that describes a problem on Windows 2008 servers with Corrupt Performance Counters in Win2008
http://social.technet.microsoft.com/Forums/en-US/winservergen/thread/9efd2ef8-8c0e-4550-a0eb-afd826cf8b7e
If the problem persist you may need to engage Microsoft Tech Support to correct the problem.
Symptom: System Event Throttle settings for Windows High CPU
Sometimes data being processed by the OS agent may need to be filtered. The following environment variables can be set in the KNTENV file to enable duplicate events to be throttled back or dropped.
Apply to all six event logs:
NT_LOG_THROTTLE=X
Apply to each log separately:
NT_APPLICATION_LOG_THROTTLE=X
NT_SYSTEM_LOG_THROTTLE=X
NT_SECURITY_LOG_THROTTLE=X
NT_DNS_LOG_THROTTLE=X
NT_DIRSERVICE_LOG_THROTTLE=X
NT_FILEREPSRV_LOG_THROTTLE=X
where X=0, event drop throttle disabled
X=1, drop all duplicate events every read cycle of the event log
X=a value > 1, drop all duplicate events in groups of X every read cycle of the event log.
For example, if X=50 then duplicate events are dropped in groups of 50.
X= 1 should be a good place to start for most customers. >1 is intended for event storms
Restart the agent to activate the changes.
Symptom: KNTCMA.EXE consuming high CPU due to inefficient formula and/or persistent situations
An inefficiently written situation formula can also cause high CPU utilization. Using wildcard characters (*) and/or the MISSING function in situations is one of the most common causes of high CPU usage related to situations evaluation.
The persistent situations file called psit_Primary_<hostname>_NT.str located in the directory shown below stores the list of all situations the agent is supposed to run. Its purpose is to reduce RPC traffic from TEMS to agent during agent start up. If there is no psit file, the TEMS has to send multiple RPC requests to start situations, one RPC per situation. With the psit file, there is only 1 RPC to confirm the integrity of psit file content. If the psit file is not up to date or usable the TEMS sends additional RPC to stop or start situations. Renaming the file can often help correct excessive RPC requests.
On the server where the Windows NT agent resides,
- Stop the agent
- Rename the file extension for
- \IBM\ITM\TMAITM6\psit_Primary_<hostname>_NT.str to
- \IBM\ITM\TMAITM6\psit_Primary_<hostname>_NT.old
- Restart the agent
The following technote helped resolve similar problems for customers that encountered high CPU after installing version ITM 6.23 FP1,
Distributed Agent may Loop after ITM V622 FP07 or V623 FP1
http://www-01.ibm.com/support/docview.wss?uid=swg21591510
If you suspect a situation formula may be causing high utilization try disabling it by removing the managed system from the situation's distribution list and restarting the agent. Allow the agent to stabilize for about 10 minutes and if the utilization drops to a normal range, collect the situation definition using the viewSit command shown in the following link and open a PMR with IBM support.
http://www-01.ibm.com/support/knowledgecenter/SSTFXA_6.3.0/com.ibm.itm.doc_6.3/cmdref/viewsit.htm
Symptom: KNTCMA.EXE consuming high CPU due to a high number of running situations
In the logs directory referenced above you will find a file with a name similar to <HOSTNAME>_NT.LG0. This file will show all the situations that have been started by this agent. Sometimes too many situations are started or running with too short a sampling interval (e.g. one minute) which causes high utilization.
Removing unnecessary situations or increasing the sampling interval to 5 or 10 minutes may resolve the problem.
Symptom: KNTCMA.EXE consuming high CPU due to Historical Collection
Historical Collection is another culprit that may contribute to high CPU. When data is collected at the agent, combined with situations and other required agent activities, it may result in high utilization.
If possible, temporarily disable historical collection on the problematic system to see if utilization returns to normal.
Setting traces and gathering a PDCollect
If you are unable to resolve your problem with any of the recommendations shown in this blog, collecting logs and environmental information is the next step. This can be easily accomplish with the PDCollect utility. Here is a link:
http://www-01.ibm.com/support/knowledgecenter/SSTFXA_6.3.0/com.ibm.itm.doc_6.3/cmdref/pdcollect.htm
Then open a PMR with IBM Support.
For High CPU initial contact with IBM:
The KBB_RAS1 value must be left at the default level of ERROR. This makes certain the utilization is not attributed to log tracing.
For High CPU general:
If the cause cannot be determined for the high CPU, IBM support will ask for the following:
- Edit <ITM home>\TMAITM6\KNTENV and set KBB_RAS1=ALL
- Increase the number of trace log files by setting MAXFILES=10 and COUNT=10 in the KBB_RAS1_LOG parameter.
- Restart the agent to activate the changes and run a PDCollect.
- Provide the output of the PDCollect to IBM for evaluation.
For High CPU specific:
If the errors indicate the problem is related to an ITM component set the following:
- Edit <ITM home>\TMAITM6\KNTENV and set KBB_RAS1=ERROR (UNIT:KNT ALL) (UNIT:KRA ALL) (UNIT:KNL ALL) (UNIT: knz all)
- Increase the number of trace log files by setting MAXFILES=10 and COUNT=10 in the KBB_RAS1_LOG parameter.
- Restart the agent to activate the changes and run a PDCollect.
- Provide the output of the PDCollect to IBM for evaluation.
Resources
Situation to detect high CPU
Refer to the following link for a Windows situation you can add to your environment to detect high CPU Utilization:
Diagnosing Resources used by the Agent
To diagnose this condition the Agent Workload Audit tool report summarizes activity at the agent. The goal is to make measurements of Agent side processing. For more details refer to
Summary
Hopefully this blog has provided you with a better understanding of why your Windows NT agent may be experiencing and reporting high CPU utilization. If you were not able to resolve the problem then following the steps outlined will certainly help to reduce the time necessary to identify the cause.
Future blogs will cover this topic as it relates to the Linux and UNIX OS agents.
UID
ibm11085367