Monitoring thresholds

You can configure the IBM Spectrum Scale™ to raise events when certain thresholds are reached. Use the Monitoring > Thresholds page to define or modify thresholds for the data that is collected through the performance monitoring sensors.

You can set the following two types of threshold levels for data collected through performance monitoring sensors:

Warning level: When the data that is being monitored reaches the warning level, the system raises an event with severity "Warning". When the observed value is under or over the current threshold level, the system removes the warning.
Error level: When the data that is being monitored reaches the error level, the system raises an event with severity "Error". When the observed value is under or over the current threshold level, the system removes the error state.

Certain types of thresholds are predefined in the system. The following predefined thresholds are available:

Inode utilization at the fileset level
Data pool capacity utilization
Metadata pool capacity utilization
Free memory utilization

Apart from the predefined thresholds, you can also create user-defined thresholds for the data collected through the performance monitoring sensors.

You can use the Monitoring > Thresholds page in the GUI and the mmhealth command to manage both predefined and user-defined thresholds.

Defining thresholds

Use the Create Thresholds option to define user-defined thresholds or to modify the predefined thresholds. You can specify the following details in a threshold rule:

Metric category: Lists all performance monitoring sensors that are enabled in the system and thresholds that are derived by performing certain calculations on certain performance metrics. These derived thresholds are referred as measurements. The measurements category provides the flexibility to edit certain predefined threshold rules. The following measurements are available for selection:
Fileset_inode

Inode capacity utilization at the fileset level. This is calculated as:
(sum(gpfs_fset_allocInodes)-sum(gpfs_fset_freeInodes))/sum(gpfs_fset_maxInodes)

DataPool_capUtil

Data pool capacity utilization. This is calculated as:
(sum(gpfs_pool_total_dataKB)-sum(gpfs_pool_free_dataKB))/sum(gpfs_pool_total_dataKB)

MetaDataPool_capUtil

Metadata pool capacity utilization. This is calculated as:
(sum(gpfs_pool_total_metaKB)-sum(gpfs_pool_free_metaKB))/sum(gpfs_pool_total_metaKB)

FsLatency_diskWaitRd

File system latency for the read operations. Average disk wait time per read operation on the IBM Spectrum Scale client.
sum(gpfs_fs_tot_disk_wait_rd)/sum(gpfs_fs_read_ops)

FsLatency_diskWaitWr

File system latency for the write operations. Average disk wait time per write operation on the IBM Spectrum Scale client.
sum(gpfs_fs_tot_disk_wait_wr)/sum(gpfs_fs_write_ops)

SMBNodeLatency_read

SMB read latency at the node level.
avg(op_time)/avg(op_count)

SMBNodeLatency_write

SMB write latency at the node level.
avg(op_time)/avg(op_count)

NFSNodeLatency_read

NFS read latency at the node level.
sum(nfs_read_lat)/sum(nfs_read_ops)

NFSNodeLatency_write

NFS write latency at the node level.
sum(nfs_write_lat)/sum(nfs_write_ops)
Metric name: The list of performance metrics that are available under the selected performance monitoring sensor or the measurement.
Name: User-defined name of the threshold rule.
Filter by: Defines the filter criteria for the threshold rule.
Group by: Groups the threshold values by the selected grouping criteria. If you select a value in this field, you must select an aggregator criteria in the Aggregator field. By default, there is no grouping, which means that the thresholds are evaluated based on the finest available key.
Warning level: Defines the threshold level for warning events to be raised for the selected metric. When the warning level is reached, the system raises an event with severity "Warning". You can customize the warning message to specify the user action that is required to fix the issue.
Error level: Defines the threshold level for error events to be raised for the selected metric. When the error level is reached, the system raises an event with severity "Error". You can customize the error message to specify the user action that is required to fix the issue.
Aggregator: When grouping is selected in the Group by field, an aggregator must be chosen to define the aggregation function. When the Rate aggregator is set, the grouping is automatically set to the finest available grouping.
Sensitivity: Defines the sample interval value. If a sensor is configured with interval period greater than 5 minutes, then the sensitivity is set to the same value as sensors period. The minimum value allowed is 120 seconds. If a sensor is configured with interval period less than 120 seconds, the --sensitivity is set to 120 seconds.
Hysteresis: Defines the percentage of the observed value that must be under or over the current threshold level to switch back to the previous state. The default value is 0%. Hysteresis is used to avoid frequent state changes when the values are close to the threshold. The level needs to be set according to the volatility of the metric.
Direction: Defines whether the events and messages are sent when the value that is being monitored exceeds or goes below the threshold level.

You can also edit and delete a threshold rule. Use the Use as Template option that is available in the Actions menu to use an already defined threshold as the template to create a threshold.

Threshold configuration - A scenario

The user wants to configure a threshold rule to monitor the maximum disk capacity usage. The following table shows the values against each field of the Create Threshold dialog and their respective functionality.

Table 1. Threshold rule configuration - A sample scenario
GUI fields	Value and Function
Metric Category	GPFSDiskCap Specifies that the threshold rule is going to be defined for the metrics that belong to the GPFSDiskCap sensor.
Metric name	Total Capacity The threshold rule is going to be defined to monitor the threshold levels of total capacity usage.
Name	Total capacity threshold By default, the performance monitoring metric name is used as the threshold rule name. Overwritten this default value with "Total capacity threshold".
Filter by	Cluster The values are filtered at the cluster level.
Group by:	File System Groups the selected metric by file system.
Aggregator	Maximum When maximum capacity exceeds the threshold level, the system raises the event. If the following values are selected, the nature of the threshold rule change accordingly: Sum: When the sum of the metric values exceeds the threshold levels, the system raises the events. Average: When the average value exceeds the average, the system raises the events. Maximum: When the maximum value exceeds maximum level, the system raises the events. Minimum: When the minimum value exceeds the sum of or goes below the threshold levels, the system raises the events. Rate: When the rate exceeds the threshold value, the system raises the events. Rate is only added for the "finest" group by clause. If we wanted to get a rate for a "partial key" this is not supported. That is, when Rate is selected, the system automatically selects the best possible values in the grouping field.
Warning level	9 GiB The system raises an event with severity Warning when the total capacity usage exceeds 9 GiB.
Error level	10 GiB The system raises an event with severity level Error when the total capacity usage exceeds 10 GiB.
Sensitivity	10 Minutes The threshold value is being monitored at an interval of 10 minutes.
Hysteresis	50% If the value is reduced below 4.5 GiB, the warning state is removed. Similarly, if the value is reduced below 5 GiB, the error state is removed.
Direction	High When the value that is being monitored exceeds the threshold limit, the system raises an event.