Resolving alerts and errors

IBM Storage Fusion Data Foundation can detect and automatically resolve a number of common failure scenarios. However, some problems require administrator intervention. Use this information to resolve alerts and errors.

To know the errors currently firing, check any of the following locations:

Observe > Alerting > Firing
Home > Overview > Cluser
Storage > Data Foundation > Storage System click the storage system link in the pop up and then go to Overview > Block and File
Storage > Data Foundation > Storage System click the storage system link in the pop up and then go to Overview > Object

Use the following information to search and understand the displayed error(s), and to understand its severity and resolution.

Copy the error displayed and search it in the following section to know its severity and resolution:

CephMonVersionMismatch

Message: There are multiple versions of storage services running.

Description:There are {{ $value }} different versions of Ceph Mon components running.

Severity: Warning

Resolution: Fix

Procedure: Inspect the user interface and log, and verify if an update is in progress.

If an update in progress, this alert is temporary.
If an update is not in progress, restart the upgrade process.

CephOSDVersionMismatch

Message: There are multiple versions of storage services running.

Description: There are {{ $value }} different versions of Ceph OSD components running.

Severity: Warning

Resolution: Fix

Procedure: Inspect the user interface and log, and verify if an update is in progress.

If an update in progress, this alert is temporary.
If an update is not in progress, restart the upgrade process.

CephClusterCriticallyFull

Message: Storage cluster is critically full and needs immediate expansion

Description: Storage cluster utilization has crossed 85%.

Severity: Critical

Resolution: Fix

Procedure: Remove unnecessary data or expand the cluster.

CephClusterNearFull

Message: Storage cluster is nearing full. Expansion is required.

Description: Storage cluster utilization has crossed 75%.

Severity: Warning

Resolution: Fix

Procedure: Remove unnecessary data or expand the cluster.

NooBaaBucketErrorState

Message: A NooBaa Bucket Is In Error State

Description: A NooBaa bucket {{ $labels.bucket_name }} is in error state for more than 6m

Severity: Warning

Resolution: Workaround

Procedure: Resolving NooBaa Bucket Error State

NooBaaNamespaceResourceErrorState

Message: A NooBaa Namespace Resource Is In Error State

Description: A NooBaa namespace resource {{ $labels.namespace_resource_name }} is in error state for more than 5m

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Error State

NooBaaBucketExceedingQuotaState

Message: A NooBaa Bucket Is In Exceeding Quota State

Description: A NooBaa bucket {{ $labels.bucket_name }} is exceeding its quota - {{ printf "%0.0f" $value }}% used message: A NooBaa Bucket Is In Exceeding Quota State

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Exceeding Quota State

NooBaaBucketLowCapacityState

Message: A NooBaa Bucket Is In Low Capacity State

Description: A NooBaa bucket {{ $labels.bucket_name }} is using {{ printf "%0.0f" $value }}% of its capacity

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

NooBaaBucketNoCapacityState

Message: A NooBaa Bucket Is In No Capacity State

Description: A NooBaa bucket {{ $labels.bucket_name }} is using all of its capacity

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

NooBaaBucketReachingQuotaState

Message: A NooBaa Bucket Is In Reaching Quota State

Description: A NooBaa bucket {{ $labels.bucket_name }} is using {{ printf "%0.0f" $value }}% of its quota

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

NooBaaResourceErrorState

Message: A NooBaa Resource Is In Error State

Description: A NooBaa resource {{ $labels.resource_name }} is in error state for more than 6m

Severity: Warning

Resolution: Workaround

Procedure: Resolving NooBaa Bucket Error State

NooBaaSystemCapacityWarning100

Message: A NooBaa System Approached Its Capacity

Description: A NooBaa system approached its capacity, usage is at 100%

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

NooBaaSystemCapacityWarning85

Message: A NooBaa System Is Approaching Its Capacity

Description: A NooBaa system is approaching its capacity, usage is more than 85%

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

CephMdsMissingReplicas

Message: Insufficient replicas for storage metadata service.

Description: Minimum required replicas for storage metadata service not available. Might affect the working of storage cluster.

Severity: Warning

Resolution: Contact IBM Support.

Procedure:

Check for alerts and operator status.
If the issue cannot be identified, contact IBM Support.

CephMgrIsAbsent

Message: Storage metrics collector service not available anymore.

Description: Ceph Manager has disappeared from Prometheus target discovery.

Severity: Critical

Resolution: Contact IBM Support.

Procedure:

Inspect the user interface and log, and verify if an update is in progress.
- If an update in progress, this alert is temporary.
- If an update is not in progress, restart the upgrade process.
Once the upgrade is complete, check for alerts and operator status.
If the issue persistents or cannot be identified, contact IBM Support.

CephNodeDown

Message: Storage node {{ $labels.node }} went down

Description: Storage node {{ $labels.node }} went down. Please check the node immediately.

Severity: Critical

Resolution: Contact IBM Support.

Procedure:

Check which node stopped functioning and its cause.
Take appropriate actions to recover the node. If node cannot be recovered:
- See Replacing nodes.
- Contact IBM Support.

CephClusterErrorState

Message: Storage cluster is in error state

Description: Storage cluster is in error state for more than 10m.

Severity: Critical

Resolution: Contact IBM Support.

Procedure:

Check for alerts and operator status.
If the issue cannot be identified, download log files and diagnostic information using must-gather.
Open a support ticket with IBM Support. Be sure to attach an output of the must-gather.

CephClusterWarningState

Message: Storage cluster is in degraded state

Description: Storage cluster is in warning state for more than 10m.

Severity: Warning

Resolution: Contact IBM Support.

Procedure:

Check for alerts and operator status.
If the issue cannot be identified, download log files and diagnostic information using must-gather.
Open a support ticket with IBM Support. Be sure to attach an output of the must-gather.

CephDataRecoveryTakingTooLong

Message: Data recovery is slow

Description: Data recovery has been active for too long.

Severity: Warning

Resolution: Contact IBM Support.

CephOSDDiskNotResponding

Message: Disk not responding

Description: Disk device {{ $labels.device }} not responding, on host {{ $labels.host }}.

Severity: Critical

Resolution:

Resolution: Contact IBM Support.

CephOSDDiskUnavailable

Message: Disk not accessible

Description: Disk device {{ $labels.device }} not accessible on host {{ $labels.host }}.

Severity: Critical

Resolution:

Resolution: Contact IBM Support.

CephPGRepairTakingTooLong

Message: Self heal problems detected

Description: Self heal operations taking too long.

Severity: Warning

Resolution: Contact IBM Support.

CephMonHighNumberOfLeaderChanges

Message: Storage Cluster has seen many leader changes recently.

Description: 'Ceph Monitor "{{ $labels.job }}": instance {{ $labels.instance }} has seen {{ $value printf "%.2f" }} leader changes per minute recently.'

Severity: Warning

Resolution: Contact IBM Support.

CephMonQuorumAtRisk

Message: Storage quorum at risk

Description: Storage cluster quorum is low.

Severity: Critical

Resolution: Contact IBM Support.

ClusterObjectStoreState

Message: Cluster Object Store is in unhealthy state. Please check Ceph cluster health

Description: Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health

Severity: Critical

Resolution: Contact IBM Support.

Procedure:

Check the CephObjectStore CR instance.
Contact IBM Support.

CephOSDFlapping

Message: Storage daemon osd.x has restarted 5 times in the last 5 minutes. Please check the pod events or Ceph status to find out the cause

Description: Storage OSD restarts more than 5 times in 5 minutes

Severity: Critical

Resolution: Contact IBM Support.

OdfPoolMirroringImageHealth

Message: Mirroring image(s) (PV) in the pool <pool-name> are in Warning state for more than a 1m. Mirroring might not work as expected.

Description: Disaster recovery is failing for one or a few applications.

Severity: Warning

Resolution: Contact IBM Support.

OdfMirrorDaemonStatus

Message: Mirror daemon is unhealthy

Description: Disaster recovery is failing for the entire cluster. Mirror daemon is in unhealthy status for more than 1m. Mirroring on this cluster is not working as expected.

Severity: Critical

Resolution: Contact IBM Support.