Backup & Restore service issues

Use this troubleshooting information to resolve install and upgrade problems that are related to Backup & Restore service.

Installation and upgrade in CrashLoopBackOff error state
Backup of virtual machines created on Red Hat® OpenShift® Container Platform 4.13 fail after upgrade to 4.14
Backup & Restore server installation throws an Invalid Input error
Backup & Restore hub service in a custom namespace
Backup & Restore stuck at 5% during upgrade
MongoDB pod crashes with CrashLoopBackOff status
Storage-operator pod crashes with CrashLoopBackOff status
Pods in Crashloopbackoff state after upgrade
Backup & Restore service goes into unknown state

Installation and upgrade in CrashLoopBackOff error state

When you install or upgrade to IBM Storage Fusion 2.7.2, you may encounter an issue wherein the application-controller and transaction-manager pods in the ibm-backup-restore namespace get stuck in CrashLoopBackOff status. This may occur when your cluster uses kubeconfig. To confirm whether you are affected by this issue, you can check the


application-controller

and transaction-manager pod logs. A sample error is as follows:

UnicodeDecodeError: utf-8' codec can't decode byte 0x98 in position 1: invalid start byte

As a resolution, apply hot fix.

Make a note of the image location. Replace the sample image locations with the actual hotfix image location.

Update the transaction manager:

oc set image deployment/transaction-manager transaction-manager=cp.icr.io/cp/fbr/guardian-transaction-manager:2.7.2-15294285-33585 -n ibm-backup-restore

Update the application controller:

oc set image deployment/application-controller application-controller=cp.icr.io/cp/fbr/guardian-transaction-manager:2.7.2-15294285-33585 -n ibm-backup-restore

Note: The two pods use the same image, so the value for the image is identical for both commands.

Backup of virtual machines created on Red Hat OpenShift Container Platform 4.13 fail after upgrade to 4.14

Backup of virtual machines created on Red Hat OpenShift Container Platform version 4.13 fail after the cluster gets upgraded to OpenShift Container Platform version 4.14 and OpenShift Virtualization to 4.14. The backup job details show either of the following errors:

Failed transferring data
Virtual machine snapshots failed

The log shows ‘Retrieved VirtualMachineSnapshot .... waiting for snapshot to complete message.

Cause:

The backups fail because the virtual machine snapshots time out after 30 minutes.

Resolution:

Stop and start the virtual machine and retry the backup.

Backup & Restore server installation throws an Invalid Input error

If the Backup & Restore service deployment fails with the following error in the IBM Storage Fusion user interface, try with a private browser window.

Install Error: Invalid Input: Both the api Server and bootstrapToken fields need to be populated

Backup & Restore hub service in a custom namespace

In general, the Backup & Restore service is installed or upgraded to the default ibm-backup-restore namespace. To avoid issues during the installation or upgrade of the service with custom namespace, do the resolution steps.

Resolution

Go to Operators > Installed Operators > Fusion Service Definition.
Open the ibm-backup-restore-service CR and navigate to the YAML tab.

In spec.onboarding.parameters, search for parameter with name as namespace and change the defaultValue to the custom-namespace where the service must be installed.

Example:


parameters:
      - dataType: string
        name: namespace
        defaultValue: <custom-namespace>
        userInterface: false
        required: true
        descriptionCode: BMYSRV00003
        displayNameCode: BMYSRV00004

Store the following YAML in server-fsi.yaml, and replace <custom-namespace> with the namespace where the service must be installed.


apiVersion: service.isf.ibm.com/v1
kind: FusionServiceInstance
metadata:
  name: ibm-backup-restore-service-instance
  namespace: ibm-spectrum-fusion-ns
spec:
  parameters:
    - name: doInstall
      provided: false
      value: 'true'
    - name: namespace
      provided: false
      value: <custom-namespace>
    - name: storageClass
      provided: true
      value: lvms-lmvg
  serviceDefinition: ibm-backup-restore-service
  triggerUpdate: false
  enabled: true
  doInstall: true

Run the following command to apply the changes:
```
oc apply -f server-fsi.yaml 
```
For the upgrade procedure, upgrade the IBM Storage Fusion operator, repeat step 1 of the resolution, and then upgrade the service.

Backup & Restore service installation gets stuck after upgrade to IBM Storage Fusion 2.7.0

If the installation or upgrade of the Backup & Restore service does not reach 100% completion, then check for failed startup probes. Run the following command to look for any pods that are not in the READY state:

oc get pods -n ibm-backup-restore

Example output:

NAME                                                              READY   STATUS      RESTARTS        AGE
applicationsvc-855746ffbf-2pmz7                                   0/1     Running     5 (2m8s ago)    17m
...
job-manager-845dc56b8d-r5w6j                                      0/1     Running     5 (2m50s ago)   17m
...

If found, then describe each of the pods and look at the event section to determine if there is a startup probe failure.

Example output showing probe failure:

Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  ... 
  Warning  Unhealthy       3m40s (x5 over 6m40s)  kubelet            Startup probe failed: HTTP probe failed with statuscode: 503

If one or more pods show startup probe failures, then patch the probes to provide additional start time. Update the following script and run to patch the probes. To determine the respective deployment names, update the DEPLOYMENT_NAMES variable with the list of deployments that have failing startup probes and run the script.

oc get deployment -n ibm-backup-restore

Example output:

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
...
applicationsvc                                0/1     1            1           12h
...
job-manager                                   0/1     1            1           12h
...

Now, check whether the pods are online. If the startup probes continue to fail, then increase the STARTUP_DELAY_SEC parameter and try again.

If it is an upgrade and the pods come online, but the progress does not reach 100% in the IBM Storage Fusion user interface even after 30 minutes, then you must update the Backup & Restore Server CR to retrigger the upgrade:

Go to OpenShift Container Platform.
Go to the Installed Operators view with the selected ibm-backup-restore project or namespace.
Click IBM Storage Fusion Backup & Restore Server.
Select the Data Protection Server tab and click the instance.
Select the YAML tab and update the following:
```
triggerUpgrade: false
```
to
```
triggerUpgrade: true
```
Save and reload the YAML.
After some time, all pods must be READY and the Backup & Restore service must show healthy in IBM Storage Fusion user interface.

Backup & Restore stuck at 5% during upgrade

Diagnosis:

In OpenShift user interface, go to the Installed Operator and filter on ibm-backup-restore namespace.
Click IBM Storage Fusion Backup and Restore Server.
Go to Subscriptions.

If you see the following errors, then do the workaround steps to resolve the issue:

"error validating existing CRs against new CRD's schema for "guardiancopyrestores.guardian.isf.ibm.com": error validating custom resource against new schema for GuardianCopyRestore"

Resolution:

From command line or command prompt, log in to the cluster and run the following commands to delete the guardiancopyrestore CRs and 2.6.0 cs:

oc -n ibm-backup-restore delete guardiancopyrestore.guardian.isf.ibm.com --all
oc -n ibm-backup-restore delete csv guardian-dm-operator.v2.6.0

From OpenShift Container Platform console, go to Installed Operator and filter on ibm-backup-restore namespace.
Click IBM Storage Fusion Backup and Restore Server.
Go to Subscriptions.
Find the failing installplan and delete it.
Go to Installed Operator and go to ibm-backup-restore namespace.
Find IBM Storage Fusion Backup and Restore Server and click Upgrade available and approve the Install plan for IBM Storage Fusion Backup and Restore Server.
Wait for the service upgrade to resume.

MongoDB pod crashes with CrashLoopBackOff status

The MongoDB pod crashes due to OOM error. To resolve the error, increase the memory limit from 256Mi to 512Mi. Do the following steps to change the memory limit:

Log in to the Red Hat OpenShift web console as an administrator.
Go to Workloads > StatefulSet.
Select the project ibm-backup-restore.
Select the MongoDB pod, and go to the YAML tab.
In the YAML, change the memory limit for MongoDB container from 256Mi to 512Mi.
Click Save.

storage-operator pod crashes with CrashLoopBackOff status

The storage-operator pod crashes due to OOM error. To resolve the error, increase the memory limit from 300Mi to 500Mi.

Log in to the Red Hat OpenShift web console as an administrator.
Select the project ibm-spectrum-fusion-ns.
Go to Operators > Installed operators, click IBM Storage Fusion, and go to the YAML tab.
Search for control-plane: isf-storage-operator in the YAML file.

In the YAML, change the memory limit for isf-storage-operator container from 300Mi to 500Mi.


containers:
    - resources:
        limits:
          cpu: 100m
          memory: 500Mi

Click Save.
Wait until a new isf-storage-operator pod comes up.

Pods in Crashloopbackoff state after upgrade

The Backup & Restore service health changes to unknown and two pods go into Crashlookbackoff state.

Resolution:

In the resource settings of guardian-dp-operator pod that resides in ibm-backup-restore namespace, set the value of IBM Storage Fusion operator memory limits to 1000 mi.

Example:


resources:    
          limits:    
            cpu: 1000m    
            memory: 1000Mi    
          requests:    
            cpu: 500m    
            memory: 250Mi

Backup & Restore service goes into unknown state

This Backup & Restore service might go into unknown state when you upgrade IBM Storage Fusion from 2.6.x to 2.7.0.

Resolution:

It automatically shows healthy on the IBM Storage Fusion user interface after you upgrade the Backup & Restore service.