Windows: Monitoring with the Guardium Agent Monitor
The Guardium Agent Monitor (GAM) process monitors Guardium agent performance and responsiveness. It is good for detailed analysis during troubleshooting.
Monitoring covers:
- CPU usage
- Memory
- Handles
- Number of threads
- Alive - responsiveness (supported agents only, currently S-TAP is the only supported agent) (See Responsiveness)
- Automatically run diag.bat
- Automatically stop/restart the service
- Automatically perform a core dump
Guardium Agent Monitor is installed when S-TAP is installed but is not enabled by default. When S-TAP is uninstalled, GAM is uninstalled.
The default install location for GAM is the parent folder of S-TAP (C:\Program Files\IBM\Guardium Agent Monitor\).
The default location for GAM output is the \Bin\ subfolder.
After enabling GAM, make sure the process is running on the database server (resmon.exe).
- GAM Configuration
-
The Guardium Agent Monitor runs with its configuration file, resmon.ini, as its argument. The monitor is controlled by using the resmon.ini file. See sample resmon. Note that the default values for all of the parameters are at the bottom in the sample ini.
- Global Configuration
-
NUMBER_OF_SERVICES: Number of services being monitored
UPDATE_INTERVAL: The length of the interval between polling metrics, in seconds
DEBUG: 1 enables the GAM debug log, 0 disables the log
NUMBER_BYTES_IN_LOG: Maximum number of KB for the GAM log
- CPU Threshold Configuration
-
CPU_LOAD_LIMIT: Percentage CPU threshold at which either action is taken, or UPDATE_INTERVAL starts counting occurrences of reaching threshold
CPU_INTERVALS_ALLOWED: Number of intervals the CPU can be above the threshold before triggering an action (used in conjunction with UPDATE_INTERVAL to set a time limit)
UPDATE_INTERVAL: 0 = action is taken when CPU reaches its load limit. 1 = action is taken when CPU has reached its load limit the number of times specified by CPU_INTERVALS_ALLOWED
CPUAVE: Defines the type of CPU average. 1 = usage averaged across all CPU cores (system average), 0 = percentage of the core used by the process.
- Memory Usage, Handle Count and Thread Count Thresholds Configuration
-
For these metrics there are two thresholds, limit and peak limit. An action is triggered when a limit threshold is passed for more intervals than allowed, or when a peak limit threshold is passed. Metrics refers to CPU, memory, and so on.
[METRIC]_LIMIT: Lower level threshold. An action is triggered if this limit is exceeded for more intervals than [METRIC]_INTERVALS_ALLOWED
[METRIC]_INTERVALS_ALLOWED: Number of intervals allowed for the lower limit threshold before an action is triggered (used with UPDATE_INTERVAL for time limit)
[METRIC]_PEAK_LIMIT: Upper level threshold. An action is triggered if this threshold is exceeded once
Note: [METRIC]_INTERVALS_ALLOWED is used in conjunction with UPDATE_INTERVAL to set a time limit for the threshold. (for example, UPDATE_INTERVAL=1, CPU_INTERVALS_ALLOWED=10, CPU_LOAD_LIMIT=10 means an action is triggered if the CPU load is over 10% for over 10 seconds).
- Responsiveness
- NAMEDPIPE_INTERVAL: The interval, in seconds, at which the S-TAP agent is pinged to verify responsiveness. Set to "0" to disable
- Action Configuration
-
The actions that can be triggered are described under Core Dump Configuration and Diagnostic Configuration. The second and third actions are only initiated if they are triggered within the ACTION_RESET_INTERVAL of the previous action. If the ACTION_RESET_INTERVAL time has elapsed with no new triggers, then the next trigger starts a new cycle starts with the FIRST_ACTION.
FIRST_ACTION: 0 = no action. 1 = stop then restart the service. 2 = stop the service.
SECOND_ACTION: The action initiated the second time there is a trigger during the ACTION_RESET_INTERVAL. 0 = no action. 1 = stop then restart the service. 2 = stop the service.
THIRD_ACTION: The action initiated the third time there is a trigger during the ACTION_RESET_INTERVAL. 0 = no action. 1 = stop then restart the service. 2 = stop the service.
ACTION_RESET_INTERVALS: Number of seconds before resetting the actions.
- Core Dump Configuration
-
A core dump can be taken every time an action is triggered.
ACTION: 1 = take a core dump whenever an action is triggered; 0 = no core dump is taken.
MAX_NUM_DUMP: The maximum number of core dumps to be stored in the dump directory (keeping the latest).
MDTIMEOUT: Core dump timeout time (in milliseconds)
- Diagnostic Configuration
-
A diagnostic file can be run whenever an action is triggered. The diag.bet diagnostic script, found in the same folder as the service's executable path, runs with the DIAG_PARAMETER parameters.
DIAGACTION: 1 = run the diagnostic script whenever an action is triggered; 0 = no diagnostic script is run.
DIAGNAME: Name of the diagnostic file to be run (must be in the same folder as the service executable)
DIAG_PARAMETER: Parameters to be used when running the diagnostic file
- Example of resmon.ini
-
;Semi-colon at the beginning of the line indicates a comment ; [Global] NUMBER_OF_SERVICES=1 ; ;Interval for checking thresholds (seconds) UPDATE_INTERVAL=1 ; ;Enables monitor log DEBUG=1 ; ;"0" means it won't take minidump for action. "1", it will take minidump ACTION=1 ; ;The maximum number of dump stores in dump directory MAX_NUM_DUMP=3 ; ;The average CPU time, "0" is percentage of one core, "1" is average percentage of all cores in system CPUAVE=1 ; ;miniDump timeout in milliseconds MDTIMEOUT=1000 ;Maximum number of BYTES for monitor log (in KB) NUMBER_BYTES_IN_LOG=200 ; ;Configuration for the service [Service1] Name=GUARDIUM_STAP ; ;Interval to check aliveness (supported agents only), set to "0" to disable NAMEDPIPE_INTERVAL=30 ; ;Run diagnostic on action, set to "1" to enable DIAGACTION=0 ; ;Diagnostic file name DIAGNAME=diag.bat ; ;Diagnostic parameters. If the parameter has spaces it needs to be enclosed with quotes DIAG_PARAMETER= ; ;Percentage of cpu limit CPU_LOAD_LIMIT=10 ; ;Maximum sequential intervals over CPU_LOAD_LIMIT allowed CPU_INTERVALS_ALLOWED=10 ; ;Memory limit (KB) MEM_USAGE_LIMIT=150000 MEM_USAGE_PEAK_LIMIT=200000 MEM_USAGE_INTERVALS_ALLOWED=30 ; ;Handle limit HANDLE_COUNT_LIMIT=500 HANDLE_COUNT_PEAK_LIMIT=1000 HANDLE_COUNT_INTERVALS_ALLOWED=20 ; ;Thread limit THREAD_COUNT_LIMIT=200 THREAD_COUNT_PEAK_LIMIT=300 THREAD_COUNT_INTERVALS_ALLOWED=20 ; ;'1' take action, then restart the service ;'2' take action, then stop the service without start FIRST_ACTION=1 SECOND_ACTION=1 THIRD_ACTION=2 ; ;Reset interval in seconds ACTION_RESET_INTERVALS=60