Troubleshooting installation and upgrade issues in IBM Storage Fusion services

Use these troubleshooting information to know the problem and workaround when you work with the IBM Storage Fusion services.

IBM Storage Fusion Data Cataloging service

Data Cataloging service is not installed successfully
Data catalog service is in installing state for hours. To resolve the problem, do the following steps:
  1. Label GPU nodes with a isf.ibm.com/node Type=gpu, the node name values depend on the GPU node names in ru25 and ru27:
    
    oc label node compute-1-ru25.mydomain.com -l isf.ibm.com/nodeType=gpu
    oc label node compute-1-ru27.mydomain.com -l isf.ibm.com/nodeType=gpu
  2. Patch FSD with new affinity to not schedule isd workload on those nodes:
    oc -n ibm-spectrum-fusion-ns patch fusionservicedefinitions.service.isf.ibm.com data-cataloging-service-definition  --patch "$(cat fsd_dcs_patch.yaml)" 
    The fsd_dcs_patch.yaml file:
    
    cat >> fsd_dcs_patch.yaml << EOF
    
    apiVersion: service.isf.ibm.com/v1
    kind: FusionServiceDefinition
    metadata:
      name: data-cataloging-service-definition
      namespace: ibm-spectrum-fusion-ns
    spec:
      onboarding:
        parameters:
          - dataType: string
            defaultValue: ibm-data-cataloging
            descriptionCode: BMYSRV00003
            displayNameCode: BMYSRV00004
            name: namespace
            required: true
            userInterface: false
          - dataType: storageClass
            defaultValue: ''
            descriptionCode: BMYDC0300
            displayNameCode: BMYDC0301
            name: rwx_storage_class
            required: true
            userInterface: true
          - dataType: bool
            defaultValue: 'true'
            descriptionCode: descriptionCode
            displayNameCode: displayNameCode
            name: doInstall
            required: true
            userInterface: false
          - dataType: json
            defaultValue: '{"accept": true}'
            descriptionCode: descriptionCode
            displayNameCode: displayNameCode
            name: license
            required: true
            userInterface: false
          - dataType: json
            defaultValue: '{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"isf.ibm.com/nodeType","operator":"NotIn","values":["gpu"]}]}]}}}'
            descriptionCode: descriptionCode
            displayNameCode: displayNameCode
            name: affinity
            required: true
            userInterface: false
        
    EOF
    
  3. Display the patch FSD:
    
    oc -n ibm-spectrum-fusion-ns get fusionservicedefinitions.service.isf.ibm.com data-cataloging-service-definition -o yaml
  4. Install from the user interface.
  5. Delete namespace:
    oc delete ns ibm-data-cataloging
  6. Delete FSD instance:
    oc -n ibm-spectrum-fusion-ns delete fusionserviceinstances.service.isf.ibm.com data-cataloging-service-instance
Data Cataloging service not available
Data Cataloging service is not available whenever single node is down or maintenance mode is enabled. If Data Cataloging service is in degraded state, then check the nodesstatus and scale pod status to ensure that everything is up and running.
Data Cataloging installation is stuck at 35% for more than 1 hour
Cause:

Either instdb or restore-morph job is stuck in the sync step. Use oc -n ibm-data-cataloging logs -f jobs/c-isd-instdb and oc -n ibm-data-cataloging logs -f jobs/c-isd-restore-morph to determine whether the last line of logs is sync and it is not progressing.

Resolution:
  1. Based on the logs, identify the job that is stuck with sync (c-isd-instdb or c-isd-restore-morph) by using the commands that are provided in the Cause section.
  2. Run the following command to determine the name of the running node based on where the job was allocated:
    oc -n ibm-data-cataloging get pod -l job-name=<JOB_NAME> -o jsonpath='{.items[].spec.nodeName}'
    Here, <JOB_NAME> is either c-isd-instdb or c-isd-restore-morph.
  3. Do a graceful re of the affected node. For the actual procedure, see Red Hat OpenShift document.
  4. Run the following command to scale down the operator:
    oc -n ibm-data-cataloging scale --replicas=0 deployment/spectrum-discover-operator
  5. Run the following command to delete the Db2 instance:
    oc -n ibm-data-cataloging delete db2u isd
  6. Wait until the ibm-data-cataloging project is free of pods starting with c-isd.
  7. Run the following command to scale up the operator:
    oc -n ibm-data-cataloging scale --replicas=1 deployment/spectrum-discover-operator
Data Cataloging installation is stuck at 80% for more than 4 hours
Cause:

If you find that this issue is related to IBM Db2 resource, then manually restart Db2 deployment.

Resolution:
  1. Run the following command to scale down the operator:
    oc -n ibm-data-cataloging scale --replicas=0 deployment/spectrum-discover-operator
  2. Run the following command to delete the Db2 instance:
    oc -n ibm-data-cataloging delete db2u isd
  3. Wait until the Db2 pods and persistent volume claims get removed.
    oc -n ibm-data-cataloging get pod,pvc -o name | grep c-isd
    Note: Run the following command if you cannot delete PVC for long period of time (more than one hour).
    oc patch pvc c-isd-meta -n ibm-data-cataloging -p '{"metadata":{"finalizers":null}}'
  4. Run the following command to scale up the operator:
    oc -n ibm-data-cataloging scale --replicas=1 deployment/spectrum-discover-operator

IBM Storage Fusion Backup & Restore service

MongoDB pod crashes with CrashLoopBackOff status
The MongoDB pod crashes due to OOM error. To resolve the error, increase the memory limit from 256Mi to 512Mi. Do the following steps to change the memory limit:
  1. Log in to the Red Hat® OpenShift® web console as an administrator.
  2. Go to Workloads > StatefulSet.
  3. Select the project ibm-backup-restore.
  4. Select the MongoDB pod, go to YAML tab.
  5. In the YAML, change the memory limit for MongoDB container from 256Mi to 512Mi.
  6. Click Save.
Pods in Crashlookbackoff state after upgrade
The Backup & Restore service health changes to unknown and two pods go into Crashlookbackoff state.

Resolution:

In the resource settings of guardian-dp-operator pod that resides in ibm-backup-restore namespace, set the value of IBM Storage Fusion operator memory limits to 1000mi.

Example:

resources:    
          limits:    
            cpu: 1000m    
            memory: 1000Mi    
          requests:    
            cpu: 500m    
            memory: 250Mi 
Backup and restore service installation gets stuck at 77%

Cause:

Pods get stuck in ContainerCreating state.

Resolution:

  1. Run the following command to locate the kafka cluster YAML:
    oc get kafkas.kafka.strimzi.io guardian-kafka-cluster -o yaml -n ibm-backup-restore | grep -e " version"
  2. In the YAML, if the version of spec.kafka.version is 3.2.3, then proceed to the next step.
  3. Run the following script to modify the spec.kafka.version from 3.2.3 to 3.4.0, and inter.broker.protocol.version from 3.2 to 3.4:
    
    oc patch kafkas.kafka.strimzi.io guardian-kafka-cluster -n ibm-backup-restore --type='json' -p='[{"op": "replace", "path":"/spec/kafka/version", "value":3.3.1}]'
    oc patch kafkas.kafka.strimzi.io guardian-kafka-cluster -n ibm-backup-restore --type='json' -p='[{"op": "replace", "path":"/spec/kafka/config/inter.broker.protocol.version", "value":"3.3"}]'
    for po in $(oc get pods --field-selector='status.phase=Pending' -n ibm-backup-restore -o name)
    do
      oc delete ${po} -n ibm-backup-restore 
    done

IBM Storage Fusion Backup & Restore (Legacy) service

The IBM Spectrum Protect Plus server could not be contacted with 'no route available' message after the IBM Spectrum Protect Plus server reinstall.
After the re-installation of the IBM Spectrum Protect Plus server, the transaction-manager pods went into a 1/3 crashloop state and the transaction-manager-worker pods indicated that the IBM Spectrum Protect Plus server could not be contacted with a no route available message.
To resolve the error, reinstall the IBM Spectrum Protect Plus agent.
ImagePull failure during Backup & Restore (Legacy) installation or upgrade
If an ImagePull failure occurs on the virgo pod during the installation or upgrade of Backup & Restore (Legacy), then as a resolution restart the Backup & Restore (Legacy) virgo pod in the ibm-spectrum-protect-plus-ns namespace:
  1. Go to OpenShift Container Platform web management console.
  2. Go to Workloads > Pods.
  3. Select ibm-spectrum-protect-plus-ns project.
  4. Search for the sppvirgo pod.
  5. From the Actions menu, click Delete pod to respun it.

Common issues

ImagePull failure during installation or upgrade of any service
If an ImagePull failure occurs during the installation or upgrade of the any service, then restart the pod and retry. If the issue persists, contact IBM support.
Bundle unpacking failed
If you see a Job was active longer than specified deadline message during the installation or upgrade of the any service, then you need to delete job with 0 of 1 completion status in openshift-marketplace namespace.