Troubleshooting installation and upgrade issues in IBM Storage Fusion services
Use these troubleshooting information to know the problem and workaround when you work with the IBM Storage Fusion services.
IBM Storage Fusion Data Cataloging service
- Data Cataloging service is not installed successfully
- Data catalog service is in installing state for hours. To resolve the problem, do the following
steps:
- Label GPU nodes with a
isf.ibm.com/node Type=gpu, the node name values depend on the GPU node names in ru25 and ru27:oc label node compute-1-ru25.mydomain.com -l isf.ibm.com/nodeType=gpu oc label node compute-1-ru27.mydomain.com -l isf.ibm.com/nodeType=gpu - Patch FSD with new affinity to not schedule
isdworkload on those nodes:
The fsd_dcs_patch.yaml file:oc -n ibm-spectrum-fusion-ns patch fusionservicedefinitions.service.isf.ibm.com data-cataloging-service-definition --patch "$(cat fsd_dcs_patch.yaml)"cat >> fsd_dcs_patch.yaml << EOF apiVersion: service.isf.ibm.com/v1 kind: FusionServiceDefinition metadata: name: data-cataloging-service-definition namespace: ibm-spectrum-fusion-ns spec: onboarding: parameters: - dataType: string defaultValue: ibm-data-cataloging descriptionCode: BMYSRV00003 displayNameCode: BMYSRV00004 name: namespace required: true userInterface: false - dataType: storageClass defaultValue: '' descriptionCode: BMYDC0300 displayNameCode: BMYDC0301 name: rwx_storage_class required: true userInterface: true - dataType: bool defaultValue: 'true' descriptionCode: descriptionCode displayNameCode: displayNameCode name: doInstall required: true userInterface: false - dataType: json defaultValue: '{"accept": true}' descriptionCode: descriptionCode displayNameCode: displayNameCode name: license required: true userInterface: false - dataType: json defaultValue: '{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"isf.ibm.com/nodeType","operator":"NotIn","values":["gpu"]}]}]}}}' descriptionCode: descriptionCode displayNameCode: displayNameCode name: affinity required: true userInterface: false EOF - Display the patch FSD:
oc -n ibm-spectrum-fusion-ns get fusionservicedefinitions.service.isf.ibm.com data-cataloging-service-definition -o yaml - Install from the user interface.
- Delete namespace:
oc delete ns ibm-data-cataloging - Delete FSD
instance:
oc -n ibm-spectrum-fusion-ns delete fusionserviceinstances.service.isf.ibm.com data-cataloging-service-instance
- Label GPU nodes with a
- Data Cataloging service not available
- Data Cataloging service is not available whenever single node is down or maintenance mode is enabled. If Data Cataloging service is in degraded state, then check the nodesstatus and scale pod status to ensure that everything is up and running.
- Data Cataloging installation is stuck at 35% for more than 1 hour
- Cause:
Either
instdborrestore-morphjob is stuck in thesyncstep. Useoc -n ibm-data-cataloging logs -f jobs/c-isd-instdbandoc -n ibm-data-cataloging logs -f jobs/c-isd-restore-morphto determine whether the last line of logs issyncand it is not progressing.Resolution:- Based on the logs, identify the job that is stuck with sync (
c-isd-instdb or c-isd-restore-morph) by using the commands that are provided in the Cause section. - Run the following command to determine the name of the running node based on where the job was
allocated:
Here,oc -n ibm-data-cataloging get pod -l job-name=<JOB_NAME> -o jsonpath='{.items[].spec.nodeName}'<JOB_NAME>is eitherc-isd-instdborc-isd-restore-morph. - Do a graceful re of the affected node. For the actual procedure, see Red Hat OpenShift document.
- Run the following command to scale down the
operator:
oc -n ibm-data-cataloging scale --replicas=0 deployment/spectrum-discover-operator - Run the following command to delete the Db2 instance:
oc -n ibm-data-cataloging delete db2u isd - Wait until the
ibm-data-catalogingproject is free of pods starting withc-isd. - Run the following command to scale up the
operator:
oc -n ibm-data-cataloging scale --replicas=1 deployment/spectrum-discover-operator
- Based on the logs, identify the job that is stuck with sync (
- Data Cataloging installation is stuck at 80% for more than 4 hours
- Cause:
If you find that this issue is related to IBM Db2 resource, then manually restart Db2 deployment.
Resolution:- Run the following command to scale down the
operator:
oc -n ibm-data-cataloging scale --replicas=0 deployment/spectrum-discover-operator - Run the following command to delete the Db2 instance:
oc -n ibm-data-cataloging delete db2u isd - Wait until the Db2 pods and persistent volume claims get
removed.
oc -n ibm-data-cataloging get pod,pvc -o name | grep c-isdNote: Run the following command if you cannot delete PVC for long period of time (more than one hour).oc patch pvc c-isd-meta -n ibm-data-cataloging -p '{"metadata":{"finalizers":null}}' - Run the following command to scale up the
operator:
oc -n ibm-data-cataloging scale --replicas=1 deployment/spectrum-discover-operator
- Run the following command to scale down the
operator:
IBM Storage Fusion Backup & Restore service
- MongoDB pod crashes with
CrashLoopBackOffstatus - The MongoDB pod crashes due to OOM error. To resolve the error, increase the memory limit from
256Mi to 512Mi. Do the following steps to change the memory limit:
- Log in to the Red Hat® OpenShift® web console as an administrator.
- Go to .
- Select the project
ibm-backup-restore. - Select the MongoDB pod, go to YAML tab.
- In the YAML, change the memory limit for MongoDB container from 256Mi to 512Mi.
- Click Save.
- Pods in Crashlookbackoff state after upgrade
- The Backup & Restore service health changes to
unknown and two pods go into Crashlookbackoff state.
Resolution:
In the resource settings of
guardian-dp-operatorpod that resides inibm-backup-restorenamespace, set the value of IBM Storage Fusion operator memory limits to 1000mi.Example:resources: limits: cpu: 1000m memory: 1000Mi requests: cpu: 500m memory: 250Mi
- Backup and restore service installation gets stuck at 77%
-
Cause:
Pods get stuck in ContainerCreating state.
Resolution:
- Run the following command to locate the kafka cluster
YAML:
oc get kafkas.kafka.strimzi.io guardian-kafka-cluster -o yaml -n ibm-backup-restore | grep -e " version" - In the YAML, if the version of
spec.kafka.versionis 3.2.3, then proceed to the next step. - Run the following script to modify the
spec.kafka.versionfrom 3.2.3 to 3.4.0, andinter.broker.protocol.versionfrom 3.2 to 3.4:oc patch kafkas.kafka.strimzi.io guardian-kafka-cluster -n ibm-backup-restore --type='json' -p='[{"op": "replace", "path":"/spec/kafka/version", "value":3.3.1}]' oc patch kafkas.kafka.strimzi.io guardian-kafka-cluster -n ibm-backup-restore --type='json' -p='[{"op": "replace", "path":"/spec/kafka/config/inter.broker.protocol.version", "value":"3.3"}]' for po in $(oc get pods --field-selector='status.phase=Pending' -n ibm-backup-restore -o name) do oc delete ${po} -n ibm-backup-restore done
- Run the following command to locate the kafka cluster
YAML:
IBM Storage Fusion Backup & Restore (Legacy) service
- The IBM Spectrum Protect Plus server could not be contacted
with '
no route available' message after the IBM Spectrum Protect Plus server reinstall. - After the re-installation of the IBM Spectrum Protect Plus
server, the transaction-manager pods went into a 1/3
crashloopstate and the transaction-manager-worker pods indicated that the IBM Spectrum Protect Plus server could not be contacted with ano route availablemessage.
- ImagePull failure during Backup & Restore (Legacy) installation or upgrade
- If an
ImagePullfailure occurs on the virgo pod during the installation or upgrade of Backup & Restore (Legacy), then as a resolution restart the Backup & Restore (Legacy) virgo pod in theibm-spectrum-protect-plus-nsnamespace:- Go to OpenShift Container Platform web management console.
- Go to .
- Select
ibm-spectrum-protect-plus-nsproject. - Search for the
sppvirgopod. - From the Actions menu, click Delete pod to respun it.
Common issues
- ImagePull failure during installation or upgrade of any service
- If an
ImagePullfailure occurs during the installation or upgrade of the any service, then restart the pod and retry. If the issue persists, contact IBM support.
- Bundle unpacking failed
- If you see a Job was active longer than specified deadline message during the
installation or upgrade of the any service, then you need to delete job with 0 of
1 completion status in
openshift-marketplacenamespace.