Windows: Monitoring with the Guardium Agent Monitor

The Guardium Agent Monitor (GAM) process monitors Guardium agent performance and responsiveness. It is good for detailed analysis during troubleshooting.

Note: The GAM service should be off by default as it requires configuration specific to the environment in which it is installed. Improper configuration can cause very serious operational issues. This is a tool to aid in troubleshooting and otherwise is not required.

Monitoring covers:

  • CPU usage
  • Memory
  • Handles
  • Number of threads
  • Alive - responsiveness (supported agents only, currently S-TAP is the only supported agent) (See Responsiveness)
If a monitored agent exceeds a configured threshold, or if it does not respond to the console request, the following actions can be taken, in any combination:
  • Automatically run diag.bat
  • Automatically stop/restart the service
  • Automatically perform a core dump

Guardium Agent Monitor is installed when S-TAP is installed but is not enabled by default. When S-TAP is uninstalled, GAM is uninstalled.

Note: Just like S-TAP, GAM requires administrative privileges. When installing, run with "Run as Administrator" as an administrative user.

The default install location for GAM is the parent folder of S-TAP (C:\Program Files\IBM\Guardium Agent Monitor\).

The default location for GAM output is the \Bin\ subfolder.

After enabling GAM, make sure the process is running on the database server (resmon.exe).

GAM Configuration

The Guardium Agent Monitor runs with its configuration file, resmon.ini, as its argument. The monitor is controlled by using the resmon.ini file. See sample resmon. Note that the default values for all of the parameters are at the bottom in the sample ini.

Global Configuration

NUMBER_OF_SERVICES: Number of services being monitored

UPDATE_INTERVAL: The length of the interval between polling metrics, in seconds

DEBUG: 1 enables the GAM debug log, 0 disables the log

NUMBER_BYTES_IN_LOG: Maximum number of KB for the GAM log

CPU Threshold Configuration

CPU_LOAD_LIMIT: Percentage CPU threshold at which either action is taken, or UPDATE_INTERVAL starts counting occurrences of reaching threshold

CPU_INTERVALS_ALLOWED: Number of intervals the CPU can be above the threshold before triggering an action (used in conjunction with UPDATE_INTERVAL to set a time limit)

UPDATE_INTERVAL: 0 = action is taken when CPU reaches its load limit. 1 = action is taken when CPU has reached its load limit the number of times specified by CPU_INTERVALS_ALLOWED

CPUAVE: Defines the type of CPU average. 1 = usage averaged across all CPU cores (system average), 0 = percentage of the core used by the process.

Memory Usage, Handle Count and Thread Count Thresholds Configuration

For these metrics there are two thresholds, limit and peak limit. An action is triggered when a limit threshold is passed for more intervals than allowed, or when a peak limit threshold is passed. Metrics refers to CPU, memory, and so on.

[METRIC]_LIMIT: Lower level threshold. An action is triggered if this limit is exceeded for more intervals than [METRIC]_INTERVALS_ALLOWED

[METRIC]_INTERVALS_ALLOWED: Number of intervals allowed for the lower limit threshold before an action is triggered (used with UPDATE_INTERVAL for time limit)

[METRIC]_PEAK_LIMIT: Upper level threshold. An action is triggered if this threshold is exceeded once

Note: [METRIC]_INTERVALS_ALLOWED is used in conjunction with UPDATE_INTERVAL to set a time limit for the threshold. (for example, UPDATE_INTERVAL=1, CPU_INTERVALS_ALLOWED=10, CPU_LOAD_LIMIT=10 means an action is triggered if the CPU load is over 10% for over 10 seconds).
Responsiveness
NAMEDPIPE_INTERVAL: The interval, in seconds, at which the S-TAP agent is pinged to verify responsiveness. Set to "0" to disable
Action Configuration

The actions that can be triggered are described under Core Dump Configuration and Diagnostic Configuration. The second and third actions are only initiated if they are triggered within the ACTION_RESET_INTERVAL of the previous action. If the ACTION_RESET_INTERVAL time has elapsed with no new triggers, then the next trigger starts a new cycle starts with the FIRST_ACTION.

FIRST_ACTION: 0 = no action. 1 = stop then restart the service. 2 = stop the service.

SECOND_ACTION: The action initiated the second time there is a trigger during the ACTION_RESET_INTERVAL. 0 = no action. 1 = stop then restart the service. 2 = stop the service.

THIRD_ACTION: The action initiated the third time there is a trigger during the ACTION_RESET_INTERVAL. 0 = no action. 1 = stop then restart the service. 2 = stop the service.

ACTION_RESET_INTERVALS: Number of seconds before resetting the actions.

Core Dump Configuration

A core dump can be taken every time an action is triggered.

ACTION: 1 = take a core dump whenever an action is triggered; 0 = no core dump is taken.

MAX_NUM_DUMP: The maximum number of core dumps to be stored in the dump directory (keeping the latest).

MDTIMEOUT: Core dump timeout time (in milliseconds)

Diagnostic Configuration

A diagnostic file can be run whenever an action is triggered. The diag.bet diagnostic script, found in the same folder as the service's executable path, runs with the DIAG_PARAMETER parameters.

DIAGACTION: 1 = run the diagnostic script whenever an action is triggered; 0 = no diagnostic script is run.

DIAGNAME: Name of the diagnostic file to be run (must be in the same folder as the service executable)

DIAG_PARAMETER: Parameters to be used when running the diagnostic file

Example of resmon.ini
;Semi-colon at the beginning of the line indicates a comment
;
[Global]
NUMBER_OF_SERVICES=1
;
;Interval for checking thresholds (seconds)
UPDATE_INTERVAL=1
;
;Enables monitor log
DEBUG=1
;
;"0" means it won't take minidump for action. "1", it will take minidump
ACTION=1
;
;The maximum number of dump stores in dump directory
MAX_NUM_DUMP=3
;
;The average CPU time, "0" is percentage of one core, "1" is average percentage of all cores in system
CPUAVE=1
;
;miniDump timeout in milliseconds
MDTIMEOUT=1000
;Maximum number of BYTES for monitor log (in KB)
NUMBER_BYTES_IN_LOG=200
;
;Configuration for the service
[Service1]
Name=GUARDIUM_STAP
;
;Interval to check aliveness (supported agents only), set to "0" to disable
NAMEDPIPE_INTERVAL=30
;
;Run diagnostic on action, set to "1" to enable
DIAGACTION=0
;
;Diagnostic file name
DIAGNAME=diag.bat
;
;Diagnostic parameters. If the parameter has spaces it needs to be enclosed with quotes
DIAG_PARAMETER=
;
;Percentage of cpu limit
CPU_LOAD_LIMIT=10
;
;Maximum sequential intervals over CPU_LOAD_LIMIT allowed
CPU_INTERVALS_ALLOWED=10
;
;Memory limit (KB)
MEM_USAGE_LIMIT=150000
MEM_USAGE_PEAK_LIMIT=200000
MEM_USAGE_INTERVALS_ALLOWED=30
;
;Handle limit
HANDLE_COUNT_LIMIT=500
HANDLE_COUNT_PEAK_LIMIT=1000
HANDLE_COUNT_INTERVALS_ALLOWED=20
;
;Thread limit
THREAD_COUNT_LIMIT=200
THREAD_COUNT_PEAK_LIMIT=300
THREAD_COUNT_INTERVALS_ALLOWED=20
;
;'1' take action, then restart the service
;'2' take action, then stop the service without start
FIRST_ACTION=1
SECOND_ACTION=1
THIRD_ACTION=2
;
;Reset interval in seconds
ACTION_RESET_INTERVALS=60