Fixing high CPU issue with Agentless Linux

Technical Blog Post

Abstract

Body

If you have one or more Agentless Linux instances monitoring a meaningful number of remote systems, let's say 70-80 for each instance, you may notice the Agentless processes are constantly consuming around 30% of CPU.

This could cause resource shortage in case other processes are started or when there are temporary CPU peaks from other workloads.

It happens because

by default Agentless collect data for all attribute groups every 60 seconds.

Depending on the number of monitored servers, the used CPU can reach such peaks (20%-30%).

Typically the Processes attribute group can take longer to collect as there can be a lot of processes on some of the remote systems.

In order to reduce CPU consumption, you should go through a fine tuning of the Agentless instances, by increasing the data collection interval time.

This is configurable, down to the attribute group level.

Here are some options for tuning that can be done.

These environment variables are set in the `r<n>.ini/env` file:

CDP_DP_REFRESH_INTERVAL = 60
This is the overall SNMP polling interval in seconds which updates the cache. Default is 60.

Can change this to a value more like 120 or 180   seconds.

CDP_DP_CACHE_TTL = 60
This is the timeout for the cache in seconds. Default is 60.
It should be set to a minimum of the highest value of the REFRESH_INTERVAL   settings.

There is also the ability to specify collection intervals by attribute group using

CDP_<attrubute group name >_REFRESH_INTERVAL.

So different attribute groups could be collected at different intervals.

For example:

CDP_DISK_REFRESH_INTERVAL=70
CDP_PROCESSES_REFRESH_INTERVAL=90
CDP_NETWORK_REFRESH_INTERVAL=80


The biggest benefit would most likely be for increase the interval for the processes as that typically will have the most requests back and forth, like Processes.

Those are the attribute groups available for the Linux Agentless:

LNX Performance Object Status
Performance Object Status
Managed Systems
Disk
hrStorageTable
Memory
Network
Processes
Processor
System
Thread Pool Status
Total Virtual MB
Used Virtual MB
Virtual Memory

The best practice requires to to check the situations active on the Agentless instances, investigate about the situation interval
of each situation, find the smaller situation interval for each attribute group and then make further evaluation for an optimization of the agent data collection interval.
For example, we know that Processes attribute group may be CPU intensive, especially in the servers with a big number of running processes.
If the situations running for the Processes attribute group have a 5 minute interval, we can think to set the data collection for this attribute group to 3 or 4 minutes instead of 1 minute.

Similar evaluation should be made for all the other attribute groups, once you know the interval of the situations
and of the historical collection currently active, but generally speaking, I would not expect to keep any of them at the default 1 minute interval.

For the attribute groups you are not interested to, you can set a very high collection interval (hourly, for example) so that you will save further CPU from being wasted for useless data collection

Thanks for reading

Subscribe and follow us for all the latest information directly on your social feeds:

Check out all our other posts and updates:
Academy Blogs:	https://goo.gl/U7cYYY
Academy Videos:	https://goo.gl/FE7F59
Academy Google+:	https://goo.gl/Kj2mvZ
Academy Twitter :	https://goo.gl/GsVecH

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11082637

Tips

Fixing high CPU issue with Agentless Linux

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?