Troubleshooting installation and upgrade issues in IBM Storage Fusion services

Use these troubleshooting information to know the problem and workaround when you work with the IBM Storage Fusion services.

IBM Storage Fusion Data Cataloging service

Data Cataloging service is not installed successfully

Data catalog service is in installing state for hours. To resolve the problem, do the following steps:

Label GPU nodes with a isf.ibm.com/node Type=gpu, the node name values depend on the GPU node names in ru25 and ru27:


oc label node compute-1-ru25.mydomain.com -l isf.ibm.com/nodeType=gpu
oc label node compute-1-ru27.mydomain.com -l isf.ibm.com/nodeType=gpu

Patch FSD with new affinity to not schedule isd workload on those nodes:

oc -n ibm-spectrum-fusion-ns patch fusionservicedefinitions.service.isf.ibm.com data-cataloging-service-definition  --patch "$(cat fsd_dcs_patch.yaml)"

The fsd_dcs_patch.yaml file:


cat >> fsd_dcs_patch.yaml << EOF

apiVersion: service.isf.ibm.com/v1
kind: FusionServiceDefinition
metadata:
  name: data-cataloging-service-definition
  namespace: ibm-spectrum-fusion-ns
spec:
  onboarding:
    parameters:
      - dataType: string
        defaultValue: ibm-data-cataloging
        descriptionCode: BMYSRV00003
        displayNameCode: BMYSRV00004
        name: namespace
        required: true
        userInterface: false
      - dataType: storageClass
        defaultValue: ''
        descriptionCode: BMYDC0300
        displayNameCode: BMYDC0301
        name: rwx_storage_class
        required: true
        userInterface: true
      - dataType: bool
        defaultValue: 'true'
        descriptionCode: descriptionCode
        displayNameCode: displayNameCode
        name: doInstall
        required: true
        userInterface: false
      - dataType: json
        defaultValue: '{"accept": true}'
        descriptionCode: descriptionCode
        displayNameCode: displayNameCode
        name: license
        required: true
        userInterface: false
      - dataType: json
        defaultValue: '{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"isf.ibm.com/nodeType","operator":"NotIn","values":["gpu"]}]}]}}}'
        descriptionCode: descriptionCode
        displayNameCode: displayNameCode
        name: affinity
        required: true
        userInterface: false
    
EOF

Display the patch FSD:


oc -n ibm-spectrum-fusion-ns get fusionservicedefinitions.service.isf.ibm.com data-cataloging-service-definition -o yaml

Install from the user interface.
Delete namespace:
```
oc delete ns ibm-data-cataloging
```

Delete FSD instance:

oc -n ibm-spectrum-fusion-ns delete fusionserviceinstances.service.isf.ibm.com data-cataloging-service-instance

Data Cataloging service not available: Data Cataloging service is not available whenever single node is down or maintenance mode is enabled. If Data Cataloging service is in degraded state, then check the nodesstatus and scale pod status to ensure that everything is up and running.

Data Cataloging installation is stuck at 35% for more than 1 hour

Cause:

Either instdb or restore-morph job is stuck in the sync step. Use oc -n ibm-data-cataloging logs -f jobs/c-isd-instdb and oc -n ibm-data-cataloging logs -f jobs/c-isd-restore-morph to determine whether the last line of logs is sync and it is not progressing.

Resolution:

Based on the logs, identify the job that is stuck with sync (c-isd-instdb or c-isd-restore-morph) by using the commands that are provided in the Cause section.
Run the following command to determine the name of the running node based on where the job was allocated:
```
oc -n ibm-data-cataloging get pod -l job-name=<JOB_NAME> -o jsonpath='{.items[].spec.nodeName}'
```
Here, <JOB_NAME> is either c-isd-instdb or c-isd-restore-morph.
Do a graceful re of the affected node. For the actual procedure, see Red Hat OpenShift document.

Run the following command to scale down the operator:

oc -n ibm-data-cataloging scale --replicas=0 deployment/spectrum-discover-operator

Run the following command to delete the Db2 instance:
```
oc -n ibm-data-cataloging delete db2u isd
```
Wait until the ibm-data-cataloging project is free of pods starting with c-isd.

Run the following command to scale up the operator:

oc -n ibm-data-cataloging scale --replicas=1 deployment/spectrum-discover-operator

Data Cataloging installation is stuck at 80% for more than 4 hours

Cause:

If you find that this issue is related to IBM Db2 resource, then manually restart Db2 deployment.

Resolution:

Run the following command to scale down the operator:

oc -n ibm-data-cataloging scale --replicas=0 deployment/spectrum-discover-operator

Run the following command to delete the Db2 instance:
```
oc -n ibm-data-cataloging delete db2u isd
```
Wait until the Db2 pods and persistent volume claims get removed.
```
oc -n ibm-data-cataloging get pod,pvc -o name | grep c-isd
```
Note: Run the following command if you cannot delete PVC for long period of time (more than one hour).
```
oc patch pvc c-isd-meta -n ibm-data-cataloging -p '{"metadata":{"finalizers":null}}'
```

Run the following command to scale up the operator:

oc -n ibm-data-cataloging scale --replicas=1 deployment/spectrum-discover-operator

IBM Storage Fusion Backup & Restore service

MongoDB pod crashes with CrashLoopBackOff status

The MongoDB pod crashes due to OOM error. To resolve the error, increase the memory limit from 256Mi to 512Mi. Do the following steps to change the memory limit:

Log in to the Red Hat® OpenShift® web console as an administrator.
Go to Workloads > StatefulSet.
Select the project ibm-backup-restore.
Select the MongoDB pod, go to YAML tab.
In the YAML, change the memory limit for MongoDB container from 256Mi to 512Mi.
Click Save.

Pods in Crashlookbackoff state after upgrade

The Backup & Restore service health changes to unknown and two pods go into Crashlookbackoff state.

Resolution:

In the resource settings of guardian-dp-operator pod that resides inibm-backup-restore namespace, set the value of IBM Storage Fusion operator memory limits to 1000mi.

Example:


resources:    
          limits:    
            cpu: 1000m    
            memory: 1000Mi    
          requests:    
            cpu: 500m    
            memory: 250Mi

Backup and restore service installation gets stuck at 77%

Cause:

Pods get stuck in ContainerCreating state.

Resolution:

Run the following command to locate the kafka cluster YAML:

oc get kafkas.kafka.strimzi.io guardian-kafka-cluster -o yaml -n ibm-backup-restore | grep -e " version"

In the YAML, if the version of spec.kafka.version is 3.2.3, then proceed to the next step.

Run the following script to modify the spec.kafka.version from 3.2.3 to 3.4.0, and inter.broker.protocol.version from 3.2 to 3.4:


oc patch kafkas.kafka.strimzi.io guardian-kafka-cluster -n ibm-backup-restore --type='json' -p='[{"op": "replace", "path":"/spec/kafka/version", "value":3.3.1}]'
oc patch kafkas.kafka.strimzi.io guardian-kafka-cluster -n ibm-backup-restore --type='json' -p='[{"op": "replace", "path":"/spec/kafka/config/inter.broker.protocol.version", "value":"3.3"}]'
for po in $(oc get pods --field-selector='status.phase=Pending' -n ibm-backup-restore -o name)
do
  oc delete ${po} -n ibm-backup-restore 
done

IBM Storage Fusion Backup & Restore (Legacy) service

The IBM Spectrum Protect Plus server could not be contacted with 'no route available' message after the IBM Spectrum Protect Plus server reinstall.: After the re-installation of the IBM Spectrum Protect Plus server, the transaction-manager pods went into a 1/3 crashloop state and the transaction-manager-worker pods indicated that the IBM Spectrum Protect Plus server could not be contacted with a no route available message.; To resolve the error, reinstall the IBM Spectrum Protect Plus agent.

ImagePull failure during Backup & Restore (Legacy) installation or upgrade

If an ImagePull failure occurs on the virgo pod during the installation or upgrade of Backup & Restore (Legacy), then as a resolution restart the Backup & Restore (Legacy) virgo pod in the ibm-spectrum-protect-plus-ns namespace:

Go to OpenShift Container Platform web management console.
Go to Workloads > Pods.
Select ibm-spectrum-protect-plus-ns project.
Search for the sppvirgo pod.
From the Actions menu, click Delete pod to respun it.

Common issues

ImagePull failure during installation or upgrade of any service: If an ImagePull failure occurs during the installation or upgrade of the any service, then restart the pod and retry. If the issue persists, contact IBM support.

Bundle unpacking failed: If you see a Job was active longer than specified deadline message during the installation or upgrade of the any service, then you need to delete job with 0 of 1 completion status in openshift-marketplace namespace.