Monitoring thresholds
You can configure the IBM Spectrum Scale™ to raise events when certain thresholds are reached. Use the page to define or modify thresholds for the data that is collected through the performance monitoring sensors.
- Warning level
- When the data that is being monitored reaches the warning level, the system raises an event with severity "Warning". When the observed value is under or over the current threshold level, the system removes the warning.
- Error level
- When the data that is being monitored reaches the error level, the system raises an event with severity "Error". When the observed value is under or over the current threshold level, the system removes the error state.
- Inode utilization at the fileset level
- Data pool capacity utilization
- Metadata pool capacity utilization
- Free memory utilization
Apart from the predefined thresholds, you can also create user-defined thresholds for the data collected through the performance monitoring sensors.
You can use the mmhealth command to manage both predefined and user-defined thresholds.
page in the GUI and theDefining thresholds
- Metric category: Lists all performance monitoring sensors that are
enabled in the system and thresholds that are derived by performing certain calculations on certain
performance metrics. These derived thresholds are referred as measurements. The
measurements category provides the flexibility to edit certain predefined
threshold rules. The following measurements are available for selection:
- Fileset_inode
- Inode capacity utilization at the fileset level. This is calculated
as:
(sum(gpfs_fset_allocInodes)-sum(gpfs_fset_freeInodes))/sum(gpfs_fset_maxInodes)
- DataPool_capUtil
- Data pool capacity utilization. This is calculated as:
(sum(gpfs_pool_total_dataKB)-sum(gpfs_pool_free_dataKB))/sum(gpfs_pool_total_dataKB)
- MetaDataPool_capUtil
- Metadata pool capacity utilization. This is calculated
as:
(sum(gpfs_pool_total_metaKB)-sum(gpfs_pool_free_metaKB))/sum(gpfs_pool_total_metaKB)
- FsLatency_diskWaitRd
- File system latency for the read operations. Average disk wait time per read operation on the
IBM Spectrum
Scale
client.
sum(gpfs_fs_tot_disk_wait_rd)/sum(gpfs_fs_read_ops)
- FsLatency_diskWaitWr
- File system latency for the write operations. Average disk wait time per write operation on the IBM Spectrum Scale client.
sum(gpfs_fs_tot_disk_wait_wr)/sum(gpfs_fs_write_ops)
- SMBNodeLatency_read
- SMB read latency at the node level.
avg(op_time)/avg(op_count)
- SMBNodeLatency_write
- SMB write latency at the node level.
avg(op_time)/avg(op_count)
- NFSNodeLatency_read
- NFS read latency at the node level.
sum(nfs_read_lat)/sum(nfs_read_ops)
- NFSNodeLatency_write
- NFS write latency at the node level.
sum(nfs_write_lat)/sum(nfs_write_ops)
- Metric name: The list of performance metrics that are available under the selected performance monitoring sensor or the measurement.
- Name: User-defined name of the threshold rule.
- Filter by: Defines the filter criteria for the threshold rule.
- Group by: Groups the threshold values by the selected grouping criteria. If you select a value in this field, you must select an aggregator criteria in the Aggregator field. By default, there is no grouping, which means that the thresholds are evaluated based on the finest available key.
- Warning level: Defines the threshold level for warning events to be raised for the selected metric. When the warning level is reached, the system raises an event with severity "Warning". You can customize the warning message to specify the user action that is required to fix the issue.
- Error level: Defines the threshold level for error events to be raised for the selected metric. When the error level is reached, the system raises an event with severity "Error". You can customize the error message to specify the user action that is required to fix the issue.
- Aggregator: When grouping is selected in the Group by field, an aggregator must be chosen to define the aggregation function. When the Rate aggregator is set, the grouping is automatically set to the finest available grouping.
- Sensitivity: Defines the sample interval value. If a sensor is configured with interval period greater than 5 minutes, then the sensitivity is set to the same value as sensors period. The minimum value allowed is 120 seconds. If a sensor is configured with interval period less than 120 seconds, the --sensitivity is set to 120 seconds.
- Hysteresis: Defines the percentage of the observed value that must be under or over the current threshold level to switch back to the previous state. The default value is 0%. Hysteresis is used to avoid frequent state changes when the values are close to the threshold. The level needs to be set according to the volatility of the metric.
- Direction: Defines whether the events and messages are sent when the value that is being monitored exceeds or goes below the threshold level.
You can also edit and delete a threshold rule. Use the Use as Template option that is available in the Actions menu to use an already defined threshold as the template to create a threshold.
Threshold configuration - A scenario
The user wants to configure a threshold rule to monitor the maximum disk capacity usage. The following table shows the values against each field of the Create Threshold dialog and their respective functionality.
GUI fields | Value and Function |
---|---|
Metric Category | GPFSDiskCap Specifies that the threshold rule is going to be defined for the metrics that belong to the GPFSDiskCap sensor. |
Metric name | Total Capacity The threshold rule is going to be defined to monitor the threshold levels of total capacity usage. |
Name | Total capacity threshold By default, the performance monitoring metric name is used as the threshold rule name. Overwritten this default value with "Total capacity threshold". |
Filter by | Cluster The values are filtered at the cluster level. |
Group by: | File System Groups the selected metric by file system. |
Aggregator | Maximum When maximum capacity exceeds the threshold level, the system raises the event. If
the following values are selected, the nature of the threshold rule change accordingly:
|
Warning level | 9 GiB The system raises an event with severity Warning when the total capacity usage exceeds 9 GiB. |
Error level | 10 GiB The system raises an event with severity level Error when the total capacity usage exceeds 10 GiB. |
Sensitivity | 10 Minutes The threshold value is being monitored at an interval of 10 minutes. |
Hysteresis | 50% If the value is reduced below 4.5 GiB, the warning state is removed. Similarly, if the value is reduced below 5 GiB, the error state is removed. |
Direction | High When the value that is being monitored exceeds the threshold limit, the system raises an event. |