IBM Storage Fusion Backup & Restore service issues
List of known Backup & Restore issues in IBM Storage Fusion.
Service protection issues
- Service protection can be configured on one cluster with both application and service backups.
You can use the same bucket on cloud storage to configure service protection on a second cluster to
restore service and application backups from the first cluster. Backups that no longer exist on
cloud storage appear on the user interface, and a failure occurs during restore attempts of those
backups.
- If the first cluster remains as is, the retention period on the original backups can expire, and backups get removed from cloud storage. As the second cluster is unaware of the removal, it can attempt to remove the restored backups. The attempt fails because the backup on cloud storage no longer exists.
- If you uninstall the Backup & Restore service from the first cluster, use the
-soption to preventDeleteBackupRequestCRs from getting created. If you do not set this option, the backups on cloud storage get removed, and the second cluster is again unaware that they no longer exist on the cloud storage.
Note: The first deployment must not exist during the configuration of the second cluster. - The service protection (scheduled) backups fail with Failed snapshot error.
- Cause
- Pod gets terminated during data export due to lack of resources on the node, where the pod that exports the service protection data is running.
- Resolution
- Address the lack of resources on the node issue and rerun the service protection backup.
- After a service protection restores the control plane and an application is restored, the
following four columns do not populate on the backed up applications pane:
- Backup status
- Last backup on
- Success rate
- Backup capacity
- Resolution
- To populate those columns with data, run backup again for that application. Note that they do not include the counts before restore.
Backup issues
- The OpenShift® Container Platform cluster can have problems and become unusable. After you recover the cluster, rejoin the connections.
- If you have a PolicyAssignment for an application on the hub and you create a PolicyAssignment
for the same application on the spoke, then your attempt to assign a backup policy for the
application fails. In both assignments, the application, backup policy, and short-form cluster name
are the same. The current format of the PolicyAssignment CR name is
appName-backupPolicyName-shortFormClusterName. The issue happens when the first string of the cluster names are identical. In this scenario, the creation gets rejected because the PolicyAssignment name already exists in OpenShift Container Platform.For example:
Hub assignment createsapp1-bp1-apps:- Application -
app1 - BackupPolicy -
bp1 - AppCluster -
apps.cluster1
app1-bp1-apps(This is rejected by OpenShift Container Platform)- Application -
app1 - BackupPolicy -
bp1 - AppCluster -
apps.cluster2
Resolution:
To create the PolicyAssignment for the spoke application, delete the PolicyAssignment CR for the hub application assignment and attempt spoke application assignment again.
- Application -
- Sometimes, backups do not work as defined in the backup policies, especially when you set hourly
policies. For example, if you set a policy for two hours and it does not run every two hours, then
gaps exist in the backup history. The possible reason might be that when a pod crashed and
restarted, jobs scheduled were not accounting for the time zone, causing gaps in run intervals.
- Diagnosis
- The following are the observed symptoms:
- Policies with custom every X hour at minute YY schedules: the first scheduled run of this policy will run at minute YY after X hours + time zone offset from UTC instead of at minute YY after X hours.
- Monthly and yearly policies run more frequently.
- Resolution
- You can start backups manually until the next scheduled time.
- You can encounter an error when you attempt backup operation on IBM Storage Fusion Backup & Restore service that is deployed in
IBM Cloud Satellite.Backup operations fail with the following log entries:
level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=pods, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name> level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=replicasets.apps, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name> level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=deployments.apps, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name>- Cause
- An issue exists with the default OADP plug-in and it must be disabled to continue.
- Resolution
-
Do the following steps to disable the plug-in:
- In the OpenShift console, go to .
- Search for the CustomResourceDefiniton
DataProtectionApplication. - In the Instances tab, locate the instance that is named
velero. - Open the YAML file in edit mode for the instance.
- Under the entry
spec:velero:defaultPlugins, remove the line foropenshift. - Save the YAML file.
- Sometimes, backup jobs are stuck in a running state for a long time and are not canceled.
- Resolution
- Do the following steps to resolve the issue:
- Ensure that all jobs are finished and the queue is empty before you perform some disruptive actions like node reboots.
- If jobs are running for a long period of time and not making progress, follow the steps to
delete the backup or restore CR directly.
- Log in to IBM Storage Fusion.
- Go to and get the name of the job that is stuck.
- Run the following command to delete backup
job.
oc delete fbackup <job_name> - Run the following command to delete restore
job.
oc delete frestore <job_name>
- Issues during policy creationSometimes, when you create a backup policy, the following error may occur:
Error: Policy daily-snapshot could not created.
- Resolution
- Restart the
isf-data-protection-operator-controller-manager-* podin IBM Storage Fusion namespace. It triggers the recreation of the in-place-snapshot BackupStorageLocation CR.
- Backing up to Storage Protect with Tape BSL fails while transferring data.This issue occurs when you create a BSL using the Storage Protect type in the UI. The BSL successfully creates and connects to the S3 cloud storage that it represents. Using this BSL results in backups failing.Note: If you are on IBM Storage Fusion 2.7.1 or higher, then this issue is not applicable.
- Resolution
-
Run the following
occommand, and replace the BSL name (bsl-to-patch) and IBM Storage Fusion namespace (ibm-spectrum-fusion-ns).oc patch fbsl bsl-to-patch -n ibm-spectrum-fusion-ns --type json -p '[{ "op": "replace", "path": "/spec/type", "value": "s3" }]'
- Backup of multiple VMs attempt is failed.
This issue occurs when some VMs are in a migrating state, and a snapshot of the VMs in the migrating state is not supported by OpenShift Container Platform.
- Resolution
- Follow the steps to resolve this issue:
- Check whether the virtual machine is in migrating state:
- Run the following command to check migrating
VM.
oc get virtualmachineinstancemigrations -AExample output:NAMESPACE NAME PHASE VMI fb-bm1-fs-1-5g-10 rhel8-lesser-wildcat-migration-8fhbo Failed rhel8-lesser-wildcat vm-centipede-bm2 centos-stream9-chilly-hawk-migration-57jyk Failed centos-stream9-chilly-hawk vm-centos9-bm1-1 centos-stream9-instant-toad-migration-bfyz6 Failed centos-stream9-instant-toad vm-centos9-bm1-1 centos-stream9-instant-toad-migration-d9547 Failed centos-stream9-instant-toad vm-windows10-bm2-1 kubevirt-workload-update-4dm57 Failed win10-zealous-unicorn vm-windows10-bm2-1 kubevirt-workload-update-f2s5w Failed win10-zealous-unicorn vm-windows10-bm2-1 kubevirt-workload-update-gt6nj Failed win10-zealous-unicorn vm-windows10-bm2-1 kubevirt-workload-update-rjwmn Failed win10-zealous-unicorn vm-windows10-bm2-1 kubevirt-workload-update-vfxfl TargetReady win10-zealous-unicorn vm-windows10-bm2-1 kubevirt-workload-update-z2thw Failed win10-zealous-unicorn vm-windows11-bm2-1 kubevirt-workload-update-9gr6v Failed win11-graceful-coyote vm-windows11-bm2-1 kubevirt-workload-update-clbck Failed win11-graceful-coyote vm-windows11-bm2-1 kubevirt-workload-update-j6pmx Failed win11-graceful-coyote vm-windows11-bm2-1 kubevirt-workload-update-sfbbx Pending win11-graceful-coyote vm-windows11-bm2-1 kubevirt-workload-update-th5dd Failed win11-graceful-coyote vm-windows11-bm2-1 kubevirt-workload-update-zl679 Failed win11-graceful-coyote vm-windows11-bm2-2 kubevirt-workload-update-7dp6g Failed win11-conservative-moth vm-windows11-bm2-2 kubevirt-workload-update-9nb9m TargetReady win11-conservative-moth vm-windows11-bm2-2 kubevirt-workload-update-cdrf5 Failed win11-conservative-moth vm-windows11-bm2-2 kubevirt-workload-update-dm8fz Failed win11-conservative-moth vm-windows11-bm2-2 kubevirt-workload-update-kwr6c Failed win11-conservative-moth vm-windows11-bm2-2 kubevirt-workload-update-zt8wx Failed win11-conservative-moth
- Exclude the migrating virtual machine from the backup. Reattempt it after the migration is complete.
Restore issues
- Backup & Restore currently uses restic to move the Velero backups to repositories on S3 buckets. When a restic repository gets initialized, it creates a configuration file and several subdirectories for its snapshots. As restic does not update it after initialization, the modification timestamp of this configuration file is never updated. If you configure an expiration policy on the container, it can eventually delete the configuration file and subdirectories that are not yet modified. All restic commands check this configuration file to determine the initialization of the restic repository. If it does not exist, the restore jobs fail to find the repository and restore the backup. Subsequent backup jobs cannot find the repository, and initialization attempts can result in a repository with errors.
- Sometimes, you may observe the following error
message:
"exec <executable name>": exec format errorFor example:The example error can be due to the wrong architecture of the container. For example, an amd64 container on s390x nodes or an s90x container on amd64 nodes. As a resolution, check whether the container that you want to restore and the local node architecture match.The pod log is empty except for this message: exec /filebrowser - Restore of namespaces that contains admission webhooks fails.Example error in IBM Storage Fusion restore job:
"Failed restore <some resource>" "BMYBR0003 RestorePvcsFailed There was an error when processing the job in the Transaction Manager service"Example error in Velero pod:level=error msg="Namespace domino-platform, resource restore error: error restoring certificaterequests.cert-manager.io/domino-platform/hephaestus-buildkit-client-85k2v: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.domino-platform.svc:443/mutate?timeout=10s\\": service "cert-manager-webhook" not found"- Resolution
-
- Identify the admission webhooks that is applicable to the namespace being
restored:
oc get mutatingwebhookconfigurations oc describe mutatingwebhookconfigurations - Change the failure Policy parameter from
FailtoIgnoreto temporarily disable webhook validation prior to restore:failurePolicy: Ignore
- Identify the admission webhooks that is applicable to the namespace being
restored:
- When you try to restore backups before upgrade, it fails with a BMYBR0003 error.
- Diagnosis
-
After you upgrade, your jobs may fail:
- Backup jobs with the status:
"Failed transferring data" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"
- Restore jobs with the status:
"Failed restore <some resource>" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"
Confirm the issue in the logs of the manager container of the Data Mover pod.
A sample error message:
2023-07-26T03:39:47Z ERROR Failed with error. {"controller": "guardiancopyrestore", "controllerGroup": "guardian.isf.ibm.com", "controllerKind": "GuardianCopyRestore", "GuardianCopyRestore": {"name":"52a2abfb-ea9b-422f-a60d-fed59527d38e-r","namespace":"ibm-backup-restore"}, "namespace": "ibm-backup-restore", "name": "52a2abfb-ea9b-422f-a60d-fed59527d38e-r", "reconcileID": "c8642f40-c086-413f-a10e-6d6a85531337", "attempt": 2, "total attempts": 3, "error": "EOF"} github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/util.Retry /workspace/controllers/util/utils.go:39 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/kafka.(*kafkaWriterConnection).PublishMessage /workspace/controllers/kafka/kafka_native_connection.go:71 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).updateOverallCrStatus /workspace/controllers/status.go:191 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).doRestore /workspace/controllers/guardiancopyrestore_controller.go:187 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).Reconcile /workspace/controllers/guardiancopyrestore_controller.go:92 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235 - Backup jobs with the status:
- Resolution:
- Search for the
guardian-dm-controller-managerand kill it. A new pod starts in a minute. After the pod reaches a healthy state, retry backup and restores.
- A "Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs.
- Cause
- The "disk quota exceeded" error occurs whenever you restore from an object storage location having applications that use IBM Storage Scale PVC with a size less than 5 GB.
- Resolution
- Increase the IBM Storage Scale PVC size to a minimum of 5 GB and do a backup and restore operation.
Job issues
- If you see your jobs stuck in 'New' state after node restart, then restart the job manager pod.
- If you run more jobs at the same time, performance and scalability issues can occur.
- Resolution
- As a prerequisite, modify KafkaTopic objects from the hub cluster installation. It cannot be
done on a spoke.
- Log in to OpenShift Container Platform console.
- Go to Operators > Installed Operators.
- Select AMQ Streams operator.
- Click Kafka Topic tab.
- Do the following steps to edit
backup,restore, anddm-job-infotopic objects:- Click the topic object and choose the YAML tab to edit the object.
- Under
spec.partitions, change to 10. - Click Save.
If you want to use commands to modify KafkaTopic objects, run the following commands:
Theoc patch -n ibm-backup-restore KafkaTopic backup --type merge -p '{"spec":{"partitions":10}}' oc patch -n ibm-backup-restore KafkaTopic restore --type merge -p '{"spec":{"partitions":10}}' oc patch -n ibm-backup-restore KafkaTopic dm-job-info --type merge -p '{"spec":{"partitions":10}}'ibm-backup-restoreis the default namespace. Replace it with your custom namespace.Do the following workaround steps:
Note: It must be done on a Hub cluster install. It cannot be done on a Spoke cluster.Using commands:Restart the application-controller on the affected cluster that the application is on. If the application is on the same cluster as the hub, restart the application-controller pod on the hub. If the application is on a spoke cluster, restart the application-controller on the spoke cluster. To restart using the command option, run the following command:oc delete pod -n <namespace of ibm-backup-restore install> --selector app.kubernetes.io/component=application-controllerUsing OpenShift Container Platform console:
- Go to .
- Select install namespace of Backup and restore install agent. The default is
ibm-backup-restore. - Search for application-controller and find the application-controller line.
- From the ellipsis overflow menu, click Delete pod.
- Wait for the
application-controllerpod to restart and go to ready state. Now, missing application is available for backup in the IBM Storage Fusion user interface.
Known issues
- In the Backup & Restore service page of the OpenShift Container Platform console, if the backup policy assignment to an application fails with a gateway timeout error, then use your IBM Storage Fusion user interface.
- The Restore job fails as it is unable to restore the route. The solution is to ensure that the
restore namespace +
nameOfRestoreClusterare not longer than the original namespace and thenameOfCluster. - You cannot restore multiple namespaces to a single alternative namespace. If you attempt such a
restore, then the job fails.
Example
transaction manager log:
023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:resre_application_w_recipe Line 172][INFO] - altNS: ns2, number of ns: 2 2023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:restore_application_w_recipe Line 176][ERROR] - Alternate namespace specified, but more than one namespace was backed up
- You cannot restore to a cluster that does not have an identical storage class as the source
cluster. However, the transaction manager still attempts to create PVCs with the non-existent
storage class on the spoke cluster and eventually fails with
Failed restore snapshotstatus. - After an upgrade, the backup information taken before IBM Storage Fusion 2.6 is not available in the Backed up applications page.
- The new backed up applications page does not show the details of the application when you
upgrade IBM Storage Fusion to the latest version while leaving
the Backup & Restore service in the older version.
As a resolution, it is recommended to upgrade the Backup & Restore service to the latest version after a IBM Storage Fusion upgrade.
- The backup applications table does not show the new backup times for the backed-up applications. Go to the Applications and Jobs view to see the last successful backup job for a given application. For applications on the hub, the Applications table has the correct last backup time.