IBM Storage Fusion Backup & Restore service issues

List of known Backup & Restore issues in IBM Storage Fusion.

Service protection issues

  • Service protection can be configured on one cluster with both application and service backups. You can use the same bucket on cloud storage to configure service protection on a second cluster to restore service and application backups from the first cluster. Backups that no longer exist on cloud storage appear on the user interface, and a failure occurs during restore attempts of those backups.
    1. If the first cluster remains as is, the retention period on the original backups can expire, and backups get removed from cloud storage. As the second cluster is unaware of the removal, it can attempt to remove the restored backups. The attempt fails because the backup on cloud storage no longer exists.
    2. If you uninstall the Backup & Restore service from the first cluster, use the -s option to prevent DeleteBackupRequest CRs from getting created. If you do not set this option, the backups on cloud storage get removed, and the second cluster is again unaware that they no longer exist on the cloud storage.
    Note: The first deployment must not exist during the configuration of the second cluster.
  • The service protection (scheduled) backups fail with Failed snapshot error.
    Cause
    Pod gets terminated during data export due to lack of resources on the node, where the pod that exports the service protection data is running.
    Resolution
    Address the lack of resources on the node issue and rerun the service protection backup.
  • After a service protection restores the control plane and an application is restored, the following four columns do not populate on the backed up applications pane:
    • Backup status
    • Last backup on
    • Success rate
    • Backup capacity
    Resolution
    To populate those columns with data, run backup again for that application. Note that they do not include the counts before restore.

Backup issues

  • The OpenShift® Container Platform cluster can have problems and become unusable. After you recover the cluster, rejoin the connections.
  • If you have a PolicyAssignment for an application on the hub and you create a PolicyAssignment for the same application on the spoke, then your attempt to assign a backup policy for the application fails. In both assignments, the application, backup policy, and short-form cluster name are the same. The current format of the PolicyAssignment CR name is appName-backupPolicyName-shortFormClusterName. The issue happens when the first string of the cluster names are identical. In this scenario, the creation gets rejected because the PolicyAssignment name already exists in OpenShift Container Platform.

    For example:

    Hub assignment creates app1-bp1-apps:
    • Application - app1
    • BackupPolicy - bp1
    • AppCluster - apps.cluster1
    Spoke assignment creates app1-bp1-apps (This is rejected by OpenShift Container Platform)
    • Application - app1
    • BackupPolicy - bp1
    • AppCluster - apps.cluster2

    Resolution:

    To create the PolicyAssignment for the spoke application, delete the PolicyAssignment CR for the hub application assignment and attempt spoke application assignment again.

  • Sometimes, backups do not work as defined in the backup policies, especially when you set hourly policies. For example, if you set a policy for two hours and it does not run every two hours, then gaps exist in the backup history. The possible reason might be that when a pod crashed and restarted, jobs scheduled were not accounting for the time zone, causing gaps in run intervals.
    Diagnosis
    The following are the observed symptoms:
    • Policies with custom every X hour at minute YY schedules: the first scheduled run of this policy will run at minute YY after X hours + time zone offset from UTC instead of at minute YY after X hours.
    • Monthly and yearly policies run more frequently.
    Resolution
    You can start backups manually until the next scheduled time.
  • You can encounter an error when you attempt backup operation on IBM Storage Fusion Backup & Restore service that is deployed in IBM Cloud Satellite.
    Backup operations fail with the following log entries:
    
    level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=pods, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name>
    level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=replicasets.apps, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name>
    level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=deployments.apps, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name>
    
    Cause
    An issue exists with the default OADP plug-in and it must be disabled to continue.
    Resolution

    Do the following steps to disable the plug-in:

    1. In the OpenShift console, go to Administration > CustomerResourceDefinitions.
    2. Search for the CustomResourceDefiniton DataProtectionApplication.
    3. In the Instances tab, locate the instance that is named velero.
    4. Open the YAML file in edit mode for the instance.
    5. Under the entry spec:velero:defaultPlugins, remove the line for openshift.
    6. Save the YAML file.
  • Sometimes, backup jobs are stuck in a running state for a long time and are not canceled.
    Resolution
    Do the following steps to resolve the issue:
    1. Ensure that all jobs are finished and the queue is empty before you perform some disruptive actions like node reboots.
    2. If jobs are running for a long period of time and not making progress, follow the steps to delete the backup or restore CR directly.
      1. Log in to IBM Storage Fusion.
      2. Go to Backup & Restore > Jobs > Queue and get the name of the job that is stuck.
      3. Run the following command to delete backup job.
        oc delete fbackup <job_name>
      4. Run the following command to delete restore job.
        oc delete frestore <job_name>
  • Issues during policy creation
    Sometimes, when you create a backup policy, the following error may occur:
    Error: Policy daily-snapshot could not created. 
    
    Resolution
    Restart the isf-data-protection-operator-controller-manager-* pod in IBM Storage Fusion namespace. It triggers the recreation of the in-place-snapshot BackupStorageLocation CR.
  • Backing up to Storage Protect with Tape BSL fails while transferring data.
    This issue occurs when you create a BSL using the Storage Protect type in the UI. The BSL successfully creates and connects to the S3 cloud storage that it represents. Using this BSL results in backups failing.
    Note: If you are on IBM Storage Fusion 2.7.1 or higher, then this issue is not applicable.
    Resolution
    Run the following oc command, and replace the BSL name (bsl-to-patch) and IBM Storage Fusion namespace (ibm-spectrum-fusion-ns).
    oc patch fbsl bsl-to-patch -n ibm-spectrum-fusion-ns --type json -p '[{ "op": "replace", "path": "/spec/type", "value": "s3" }]'
    
  • Backup of multiple VMs attempt is failed.

    This issue occurs when some VMs are in a migrating state, and a snapshot of the VMs in the migrating state is not supported by OpenShift Container Platform.

    Resolution
    Follow the steps to resolve this issue:
    1. Check whether the virtual machine is in migrating state:
    2. Run the following command to check migrating VM.
      oc get virtualmachineinstancemigrations -A
      Example output:
      NAMESPACE            NAME                                          PHASE         VMI
      fb-bm1-fs-1-5g-10    rhel8-lesser-wildcat-migration-8fhbo          Failed        rhel8-lesser-wildcat
      vm-centipede-bm2     centos-stream9-chilly-hawk-migration-57jyk    Failed        centos-stream9-chilly-hawk
      vm-centos9-bm1-1     centos-stream9-instant-toad-migration-bfyz6   Failed        centos-stream9-instant-toad
      vm-centos9-bm1-1     centos-stream9-instant-toad-migration-d9547   Failed        centos-stream9-instant-toad
      vm-windows10-bm2-1   kubevirt-workload-update-4dm57                Failed        win10-zealous-unicorn
      vm-windows10-bm2-1   kubevirt-workload-update-f2s5w                Failed        win10-zealous-unicorn
      vm-windows10-bm2-1   kubevirt-workload-update-gt6nj                Failed        win10-zealous-unicorn
      vm-windows10-bm2-1   kubevirt-workload-update-rjwmn                Failed        win10-zealous-unicorn
      vm-windows10-bm2-1   kubevirt-workload-update-vfxfl                TargetReady   win10-zealous-unicorn
      vm-windows10-bm2-1   kubevirt-workload-update-z2thw                Failed        win10-zealous-unicorn
      vm-windows11-bm2-1   kubevirt-workload-update-9gr6v                Failed        win11-graceful-coyote
      vm-windows11-bm2-1   kubevirt-workload-update-clbck                Failed        win11-graceful-coyote
      vm-windows11-bm2-1   kubevirt-workload-update-j6pmx                Failed        win11-graceful-coyote
      vm-windows11-bm2-1   kubevirt-workload-update-sfbbx                Pending       win11-graceful-coyote
      vm-windows11-bm2-1   kubevirt-workload-update-th5dd                Failed        win11-graceful-coyote
      vm-windows11-bm2-1   kubevirt-workload-update-zl679                Failed        win11-graceful-coyote
      vm-windows11-bm2-2   kubevirt-workload-update-7dp6g                Failed        win11-conservative-moth
      vm-windows11-bm2-2   kubevirt-workload-update-9nb9m                TargetReady   win11-conservative-moth
      vm-windows11-bm2-2   kubevirt-workload-update-cdrf5                Failed        win11-conservative-moth
      vm-windows11-bm2-2   kubevirt-workload-update-dm8fz                Failed        win11-conservative-moth
      vm-windows11-bm2-2   kubevirt-workload-update-kwr6c                Failed        win11-conservative-moth
      vm-windows11-bm2-2   kubevirt-workload-update-zt8wx                Failed        win11-conservative-moth
    3. Exclude the migrating virtual machine from the backup. Reattempt it after the migration is complete.

Restore issues

  • Backup & Restore currently uses restic to move the Velero backups to repositories on S3 buckets. When a restic repository gets initialized, it creates a configuration file and several subdirectories for its snapshots. As restic does not update it after initialization, the modification timestamp of this configuration file is never updated. If you configure an expiration policy on the container, it can eventually delete the configuration file and subdirectories that are not yet modified. All restic commands check this configuration file to determine the initialization of the restic repository. If it does not exist, the restore jobs fail to find the repository and restore the backup. Subsequent backup jobs cannot find the repository, and initialization attempts can result in a repository with errors.
  • Sometimes, you may observe the following error message:
    "exec <executable name>": exec format error
    For example:
    The pod log is empty except for this message: exec /filebrowser 
    The example error can be due to the wrong architecture of the container. For example, an amd64 container on s390x nodes or an s90x container on amd64 nodes. As a resolution, check whether the container that you want to restore and the local node architecture match.
  • Restore of namespaces that contains admission webhooks fails.
    Example error in IBM Storage Fusion restore job:
    "Failed restore <some  resource>" "BMYBR0003
          RestorePvcsFailed There was an error when  processing the job in the Transaction Manager
          service"
    Example error in Velero pod:
    level=error msg="Namespace
          domino-platform, resource restore error: error restoring
          certificaterequests.cert-manager.io/domino-platform/hephaestus-buildkit-client-85k2v:
          Internal error occurred: failed calling webhook  "webhook.cert-manager.io": failed to call
          webhook: Post  "https://cert-manager-webhook.domino-platform.svc:443/mutate?timeout=10s\\":
          service "cert-manager-webhook" not found"
    Resolution
    1. Identify the admission webhooks that is applicable to the namespace being restored:
      oc get mutatingwebhookconfigurations
      oc describe mutatingwebhookconfigurations
    2. Change the failure Policy parameter from Fail to Ignore to temporarily disable webhook validation prior to restore:
      failurePolicy: Ignore
  • When you try to restore backups before upgrade, it fails with a BMYBR0003 error.
    Diagnosis

    After you upgrade, your jobs may fail:

    • Backup jobs with the status:
      "Failed transferring data" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"
    • Restore jobs with the status:
      "Failed restore <some resource>" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"

    Confirm the issue in the logs of the manager container of the Data Mover pod.

    A sample error message:

    2023-07-26T03:39:47Z	ERROR	Failed with error.	{"controller": "guardiancopyrestore", "controllerGroup": "guardian.isf.ibm.com", "controllerKind": "GuardianCopyRestore", "GuardianCopyRestore": {"name":"52a2abfb-ea9b-422f-a60d-fed59527d38e-r","namespace":"ibm-backup-restore"}, "namespace": "ibm-backup-restore", "name": "52a2abfb-ea9b-422f-a60d-fed59527d38e-r", "reconcileID": "c8642f40-c086-413f-a10e-6d6a85531337", "attempt": 2, "total attempts": 3, "error": "EOF"}
    github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/util.Retry
    	/workspace/controllers/util/utils.go:39
    github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/kafka.(*kafkaWriterConnection).PublishMessage
    	/workspace/controllers/kafka/kafka_native_connection.go:71
    github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).updateOverallCrStatus
    	/workspace/controllers/status.go:191
    github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).doRestore
    	/workspace/controllers/guardiancopyrestore_controller.go:187
    github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).Reconcile
    	/workspace/controllers/guardiancopyrestore_controller.go:92
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
    
    Resolution:
    Search for the guardian-dm-controller-manager and kill it. A new pod starts in a minute. After the pod reaches a healthy state, retry backup and restores.
  • A "Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs.
    Cause
    The "disk quota exceeded" error occurs whenever you restore from an object storage location having applications that use IBM Storage Scale PVC with a size less than 5 GB.
    Resolution
    Increase the IBM Storage Scale PVC size to a minimum of 5 GB and do a backup and restore operation.

Job issues

  • If you see your jobs stuck in 'New' state after node restart, then restart the job manager pod.
  • If you run more jobs at the same time, performance and scalability issues can occur.
    Resolution
    As a prerequisite, modify KafkaTopic objects from the hub cluster installation. It cannot be done on a spoke.
    1. Log in to OpenShift Container Platform console.
    2. Go to Operators > Installed Operators.
    3. Select AMQ Streams operator.
    4. Click Kafka Topic tab.
    5. Do the following steps to edit backup, restore, and dm-job-info topic objects:
      1. Click the topic object and choose the YAML tab to edit the object.
      2. Under spec.partitions, change to 10.
      3. Click Save.
    If you want to use commands to modify KafkaTopic objects, run the following commands:
    
    oc patch -n ibm-backup-restore KafkaTopic backup --type merge -p '{"spec":{"partitions":10}}'
    oc patch -n ibm-backup-restore KafkaTopic restore --type merge -p '{"spec":{"partitions":10}}'
    oc patch -n ibm-backup-restore KafkaTopic dm-job-info --type merge -p '{"spec":{"partitions":10}}'
    
    
    The ibm-backup-restore is the default namespace. Replace it with your custom namespace.

    Do the following workaround steps:

    Note: It must be done on a Hub cluster install. It cannot be done on a Spoke cluster.
    Using commands:
    Restart the application-controller on the affected cluster that the application is on. If the application is on the same cluster as the hub, restart the application-controller pod on the hub. If the application is on a spoke cluster, restart the application-controller on the spoke cluster. To restart using the command option, run the following command:
    oc delete pod -n <namespace of ibm-backup-restore install> --selector app.kubernetes.io/component=application-controller
    

    Using OpenShift Container Platform console:

    1. Go to Workloads > Pods.
    2. Select install namespace of Backup and restore install agent. The default is ibm-backup-restore.
    3. Search for application-controller and find the application-controller line.
    4. From the ellipsis overflow menu, click Delete pod.
    5. Wait for the application-controller pod to restart and go to ready state. Now, missing application is available for backup in the IBM Storage Fusion user interface.

Known issues

  • In the Backup & Restore service page of the OpenShift Container Platform console, if the backup policy assignment to an application fails with a gateway timeout error, then use your IBM Storage Fusion user interface.
  • The Restore job fails as it is unable to restore the route. The solution is to ensure that the restore namespace + nameOfRestoreCluster are not longer than the original namespace and the nameOfCluster.
  • You cannot restore multiple namespaces to a single alternative namespace. If you attempt such a restore, then the job fails. Example transaction manager log:
    023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:resre_application_w_recipe Line 172][INFO] - altNS: ns2, number of ns: 2
    
    2023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:restore_application_w_recipe Line 176][ERROR] - Alternate namespace specified, but more than one namespace was backed up
  • You cannot restore to a cluster that does not have an identical storage class as the source cluster. However, the transaction manager still attempts to create PVCs with the non-existent storage class on the spoke cluster and eventually fails with Failed restore snapshot status.
  • After an upgrade, the backup information taken before IBM Storage Fusion 2.6 is not available in the Backed up applications page.
  • The new backed up applications page does not show the details of the application when you upgrade IBM Storage Fusion to the latest version while leaving the Backup & Restore service in the older version.

    As a resolution, it is recommended to upgrade the Backup & Restore service to the latest version after a IBM Storage Fusion upgrade.

  • The backup applications table does not show the new backup times for the backed-up applications. Go to the Applications and Jobs view to see the last successful backup job for a given application. For applications on the hub, the Applications table has the correct last backup time.