Resolving alerts and errors
IBM Storage Fusion Data Foundation can detect and automatically resolve a number of common failure scenarios. However, some problems require administrator intervention. Use this information to resolve alerts and errors.
- click the storage system link in the pop up and then go to
- click the storage system link in the pop up and then go to
CephMonVersionMismatch- Message:
There are multiple versions of storage services
running.
Description:There are {{ $value }} different versions of Ceph Mon components running.
Severity: Warning
Resolution: Fix
Procedure: Inspect the user interface and log, and verify if an update is in progress.- If an update in progress, this alert is temporary.
- If an update is not in progress, restart the upgrade process.
CephOSDVersionMismatch- Message:
There are multiple versions of storage services
running.
Description: There are {{ $value }} different versions of Ceph OSD components running.
Severity: Warning
Resolution: Fix
Procedure: Inspect the user interface and log, and verify if an update is in progress.- If an update in progress, this alert is temporary.
- If an update is not in progress, restart the upgrade process.
CephClusterCriticallyFull- Message:
Storage cluster is critically full and needs immediate
expansion
Description: Storage cluster utilization has crossed 85%.
Severity: Critical
Resolution: Fix
Procedure: Remove unnecessary data or expand the cluster.
CephClusterNearFull- Message:
Storage cluster is nearing full. Expansion is
required.
Description: Storage cluster utilization has crossed 75%.
Severity: Warning
Resolution: Fix
Procedure: Remove unnecessary data or expand the cluster.
NooBaaBucketErrorState- Message:
A NooBaa Bucket Is In Error State
Description: A NooBaa bucket {{ $labels.bucket_name }} is in error state for more than 6m
Severity: Warning
Resolution: Workaround
Procedure: Resolving NooBaa Bucket Error State
NooBaaNamespaceResourceErrorState- Message:
A NooBaa Namespace Resource Is In Error State
Description: A NooBaa namespace resource {{ $labels.namespace_resource_name }} is in error state for more than 5m
Severity: Warning
Resolution: Fix
Procedure: Resolving NooBaa Bucket Error State
NooBaaBucketExceedingQuotaState- Message:
A NooBaa Bucket Is In Exceeding Quota State
Description: A NooBaa bucket {{ $labels.bucket_name }} is exceeding its quota - {{ printf "%0.0f" $value }}% used message: A NooBaa Bucket Is In Exceeding Quota State
Severity: Warning
Resolution: Fix
NooBaaBucketLowCapacityState- Message:
A NooBaa Bucket Is In Low Capacity State
Description: A NooBaa bucket {{ $labels.bucket_name }} is using {{ printf "%0.0f" $value }}% of its capacity
Severity: Warning
Resolution: Fix
NooBaaBucketNoCapacityState- Message:
A NooBaa Bucket Is In No Capacity State
Description: A NooBaa bucket {{ $labels.bucket_name }} is using all of its capacity
Severity: Warning
Resolution: Fix
NooBaaBucketReachingQuotaState- Message:
A NooBaa Bucket Is In Reaching Quota State
Description: A NooBaa bucket {{ $labels.bucket_name }} is using {{ printf "%0.0f" $value }}% of its quota
Severity: Warning
Resolution: Fix
NooBaaResourceErrorState- Message:
A NooBaa Resource Is In Error State
Description: A NooBaa resource {{ $labels.resource_name }} is in error state for more than 6m
Severity: Warning
Resolution: Workaround
Procedure: Resolving NooBaa Bucket Error State
NooBaaSystemCapacityWarning100- Message:
A NooBaa System Approached Its Capacity
Description: A NooBaa system approached its capacity, usage is at 100%
Severity: Warning
Resolution: Fix
NooBaaSystemCapacityWarning85- Message:
A NooBaa System Is Approaching Its Capacity
Description: A NooBaa system is approaching its capacity, usage is more than 85%
Severity: Warning
Resolution: Fix
CephMdsMissingReplicas- Message:
Insufficient replicas for storage metadata
service.
Description: Minimum required replicas for storage metadata service not available. Might affect the working of storage cluster.
Severity: Warning
Resolution: Contact IBM Support.
Procedure:- Check for alerts and operator status.
- If the issue cannot be identified, contact IBM Support.
CephMgrIsAbsent- Message:
Storage metrics collector service not available
anymore.
Description: Ceph Manager has disappeared from Prometheus target discovery.
Severity: Critical
Resolution: Contact IBM Support.
Procedure:- Inspect the user interface and log, and verify if an update is in progress.
- If an update in progress, this alert is temporary.
- If an update is not in progress, restart the upgrade process.
- Once the upgrade is complete, check for alerts and operator status.
- If the issue persistents or cannot be identified, contact IBM Support.
- Inspect the user interface and log, and verify if an update is in progress.
CephNodeDown- Message:
Storage node {{ $labels.node }} went down
Description: Storage node {{ $labels.node }} went down. Please check the node immediately.
Severity: Critical
Resolution: Contact IBM Support.
Procedure:- Check which node stopped functioning and its cause.
- Take appropriate actions to recover the node. If node cannot be recovered:
- See Replacing nodes.
- Contact IBM Support.
CephClusterErrorState- Message:
Storage cluster is in error state
Description: Storage cluster is in error state for more than 10m.
Severity: Critical
Resolution: Contact IBM Support.
Procedure:- Check for alerts and operator status.
- If the issue cannot be identified, download log files and diagnostic information using must-gather.
- Open a support ticket with IBM Support. Be sure to attach an output of the must-gather.
CephClusterWarningState- Message:
Storage cluster is in degraded state
Description: Storage cluster is in warning state for more than 10m.
Severity: Warning
Resolution: Contact IBM Support.
Procedure:- Check for alerts and operator status.
- If the issue cannot be identified, download log files and diagnostic information using must-gather.
- Open a support ticket with IBM Support. Be sure to attach an output of the must-gather.
CephDataRecoveryTakingTooLong- Message:
Data recovery is slow
Description: Data recovery has been active for too long.
Severity: Warning
Resolution: Contact IBM Support.
CephOSDDiskNotResponding- Message:
Disk not responding
Description: Disk device {{ $labels.device }} not responding, on host {{ $labels.host }}.
Severity: Critical
Resolution:
Resolution: Contact IBM Support.
CephOSDDiskUnavailable- Message:
Disk not accessible
Description: Disk device {{ $labels.device }} not accessible on host {{ $labels.host }}.
Severity: Critical
Resolution:
Resolution: Contact IBM Support.
CephPGRepairTakingTooLong- Message:
Self heal problems detected
Description: Self heal operations taking too long.
Severity: Warning
Resolution: Contact IBM Support.
CephMonHighNumberOfLeaderChanges- Message:
Storage Cluster has seen many leader changes
recently.
Description: 'Ceph Monitor "{{ $labels.job }}": instance {{ $labels.instance }} has seen {{ $value printf "%.2f" }} leader changes per minute recently.'
Severity: Warning
Resolution: Contact IBM Support.
CephMonQuorumAtRisk- Message:
Storage quorum at risk
Description: Storage cluster quorum is low.
Severity: Critical
Resolution: Contact IBM Support.
ClusterObjectStoreState- Message:
Cluster Object Store is in unhealthy state. Please check Ceph cluster
health
Description: Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health
Severity: Critical
Resolution: Contact IBM Support.
Procedure:- Check the
CephObjectStoreCR instance. - Contact IBM Support.
- Check the
CephOSDFlapping- Message:
Storage daemon osd.x has restarted 5 times in the last 5 minutes. Please check the pod
events or Ceph status to find out the cause
Description: Storage OSD restarts more than 5 times in 5 minutes
Severity: Critical
Resolution: Contact IBM Support.
OdfPoolMirroringImageHealth- Message:
Mirroring image(s) (PV) in the pool <pool-name> are in Warning state for more
than a 1m. Mirroring might not work as expected.
Description: Disaster recovery is failing for one or a few applications.
Severity: Warning
Resolution: Contact IBM Support.
OdfMirrorDaemonStatus- Message:
Mirror daemon is unhealthy
Description: Disaster recovery is failing for the entire cluster. Mirror daemon is in unhealthy status for more than 1m. Mirroring on this cluster is not working as expected.
Severity: Critical
Resolution: Contact IBM Support.