IBM Storage Fusion Backup & Restore service issues

List of known Backup & Restore issues in IBM Storage Fusion.

Service protection issues

Service protection can be configured on one cluster with both application and service backups. You can use the same bucket on cloud storage to configure service protection on a second cluster to restore service and application backups from the first cluster. Backups that no longer exist on cloud storage appear on the user interface, and a failure occurs during restore attempts of those backups.
1. If the first cluster remains as is, the retention period on the original backups can expire, and backups get removed from cloud storage. As the second cluster is unaware of the removal, it can attempt to remove the restored backups. The attempt fails because the backup on cloud storage no longer exists.
2. If you uninstall the Backup & Restore service from the first cluster, use the -s option to prevent DeleteBackupRequest CRs from getting created. If you do not set this option, the backups on cloud storage get removed, and the second cluster is again unaware that they no longer exist on the cloud storage.
Note: The first deployment must not exist during the configuration of the second cluster.
The service protection (scheduled) backups fail with Failed snapshot error.

Cause

Pod gets terminated during data export due to lack of resources on the node, where the pod that exports the service protection data is running.

Resolution

Address the lack of resources on the node issue and rerun the service protection backup.
After a service protection restores the control plane and an application is restored, the following four columns do not populate on the backed up applications pane:
- Backup status
- Last backup on
- Success rate
- Backup capacity
Resolution

To populate those columns with data, run backup again for that application. Note that they do not include the counts before restore.

Backup issues

The OpenShift® Container Platform cluster can have problems and become unusable. After you recover the cluster, rejoin the connections.
If you have a PolicyAssignment for an application on the hub and you create a PolicyAssignment for the same application on the spoke, then your attempt to assign a backup policy for the application fails. In both assignments, the application, backup policy, and short-form cluster name are the same. The current format of the PolicyAssignment CR name is appName-backupPolicyName-shortFormClusterName. The issue happens when the first string of the cluster names are identical. In this scenario, the creation gets rejected because the PolicyAssignment name already exists in OpenShift Container Platform.
For example:
Hub assignment creates app1-bp1-apps:
- Application - app1
- BackupPolicy - bp1
- AppCluster - apps.cluster1
Spoke assignment creates app1-bp1-apps (This is rejected by OpenShift Container Platform)
- Application - app1
- BackupPolicy - bp1
- AppCluster - apps.cluster2
Resolution:

To create the PolicyAssignment for the spoke application, delete the PolicyAssignment CR for the hub application assignment and attempt spoke application assignment again.
Sometimes, backups do not work as defined in the backup policies, especially when you set hourly policies. For example, if you set a policy for two hours and it does not run every two hours, then gaps exist in the backup history. The possible reason might be that when a pod crashed and restarted, jobs scheduled were not accounting for the time zone, causing gaps in run intervals.
Diagnosis
The following are the observed symptoms:
- Policies with custom every X hour at minute YY schedules: the first scheduled run of this policy will run at minute YY after X hours + time zone offset from UTC instead of at minute YY after X hours.
- Monthly and yearly policies run more frequently.
Resolution

You can start backups manually until the next scheduled time.

You can encounter an error when you attempt backup operation on IBM Storage Fusion Backup & Restore service that is deployed in IBM Cloud Satellite.

Backup operations fail with the following log entries:


level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=pods, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name>
level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=replicasets.apps, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name>
level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=deployments.apps, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name>

Cause: An issue exists with the default OADP plug-in and it must be disabled to continue.

Resolution

Do the following steps to disable the plug-in:

In the OpenShift console, go to Administration > CustomerResourceDefinitions.
Search for the CustomResourceDefiniton DataProtectionApplication.
In the Instances tab, locate the instance that is named velero.
Open the YAML file in edit mode for the instance.
Under the entry spec:velero:defaultPlugins, remove the line for openshift.
Save the YAML file.

Sometimes, backup jobs are stuck in a running state for a long time and are not canceled.
Resolution
Do the following steps to resolve the issue:
1. Ensure that all jobs are finished and the queue is empty before you perform some disruptive actions like node reboots.
2. If jobs are running for a long period of time and not making progress, follow the steps to delete the backup or restore CR directly.
  1. Log in to IBM Storage Fusion.
  2. Go to Backup & Restore > Jobs > Queue and get the name of the job that is stuck.
  3. Run the following command to delete backup job.
    oc delete fbackup <job_name>
  4. Run the following command to delete restore job.
    oc delete frestore <job_name>
Issues during policy creation
Sometimes, when you create a backup policy, the following error may occur:
```
Error: Policy daily-snapshot could not created. 
```
Resolution

Restart the isf-data-protection-operator-controller-manager-* pod in IBM Storage Fusion namespace. It triggers the recreation of the in-place-snapshot BackupStorageLocation CR.
Backing up to Storage Protect with Tape BSL fails while transferring data.
This issue occurs when you create a BSL using the Storage Protect type in the UI. The BSL successfully creates and connects to the S3 cloud storage that it represents. Using this BSL results in backups failing.
Note: If you are on IBM Storage Fusion 2.7.1 or higher, then this issue is not applicable.
Resolution
Run the following oc command, and replace the BSL name (bsl-to-patch) and IBM Storage Fusion namespace (ibm-spectrum-fusion-ns).
```
oc patch fbsl bsl-to-patch -n ibm-spectrum-fusion-ns --type json -p '[{ "op": "replace", "path": "/spec/type", "value": "s3" }]'
```

Backup of multiple VMs attempt is failed.

This issue occurs when some VMs are in a migrating state, and a snapshot of the VMs in the migrating state is not supported by OpenShift Container Platform.

Resolution

Follow the steps to resolve this issue:

Check whether the virtual machine is in migrating state:

Run the following command to check migrating VM.

oc get virtualmachineinstancemigrations -A

Example output:

NAMESPACE            NAME                                          PHASE         VMI
fb-bm1-fs-1-5g-10    rhel8-lesser-wildcat-migration-8fhbo          Failed        rhel8-lesser-wildcat
vm-centipede-bm2     centos-stream9-chilly-hawk-migration-57jyk    Failed        centos-stream9-chilly-hawk
vm-centos9-bm1-1     centos-stream9-instant-toad-migration-bfyz6   Failed        centos-stream9-instant-toad
vm-centos9-bm1-1     centos-stream9-instant-toad-migration-d9547   Failed        centos-stream9-instant-toad
vm-windows10-bm2-1   kubevirt-workload-update-4dm57                Failed        win10-zealous-unicorn
vm-windows10-bm2-1   kubevirt-workload-update-f2s5w                Failed        win10-zealous-unicorn
vm-windows10-bm2-1   kubevirt-workload-update-gt6nj                Failed        win10-zealous-unicorn
vm-windows10-bm2-1   kubevirt-workload-update-rjwmn                Failed        win10-zealous-unicorn
vm-windows10-bm2-1   kubevirt-workload-update-vfxfl                TargetReady   win10-zealous-unicorn
vm-windows10-bm2-1   kubevirt-workload-update-z2thw                Failed        win10-zealous-unicorn
vm-windows11-bm2-1   kubevirt-workload-update-9gr6v                Failed        win11-graceful-coyote
vm-windows11-bm2-1   kubevirt-workload-update-clbck                Failed        win11-graceful-coyote
vm-windows11-bm2-1   kubevirt-workload-update-j6pmx                Failed        win11-graceful-coyote
vm-windows11-bm2-1   kubevirt-workload-update-sfbbx                Pending       win11-graceful-coyote
vm-windows11-bm2-1   kubevirt-workload-update-th5dd                Failed        win11-graceful-coyote
vm-windows11-bm2-1   kubevirt-workload-update-zl679                Failed        win11-graceful-coyote
vm-windows11-bm2-2   kubevirt-workload-update-7dp6g                Failed        win11-conservative-moth
vm-windows11-bm2-2   kubevirt-workload-update-9nb9m                TargetReady   win11-conservative-moth
vm-windows11-bm2-2   kubevirt-workload-update-cdrf5                Failed        win11-conservative-moth
vm-windows11-bm2-2   kubevirt-workload-update-dm8fz                Failed        win11-conservative-moth
vm-windows11-bm2-2   kubevirt-workload-update-kwr6c                Failed        win11-conservative-moth
vm-windows11-bm2-2   kubevirt-workload-update-zt8wx                Failed        win11-conservative-moth

Exclude the migrating virtual machine from the backup. Reattempt it after the migration is complete.

Restore issues

Backup & Restore currently uses restic to move the Velero backups to repositories on S3 buckets. When a restic repository gets initialized, it creates a configuration file and several subdirectories for its snapshots. As restic does not update it after initialization, the modification timestamp of this configuration file is never updated. If you configure an expiration policy on the container, it can eventually delete the configuration file and subdirectories that are not yet modified. All restic commands check this configuration file to determine the initialization of the restic repository. If it does not exist, the restore jobs fail to find the repository and restore the backup. Subsequent backup jobs cannot find the repository, and initialization attempts can result in a repository with errors.
Sometimes, you may observe the following error message:
```
"exec <executable name>": exec format error
```
For example:
```
The pod log is empty except for this message: exec /filebrowser 
```
The example error can be due to the wrong architecture of the container. For example, an amd64 container on s390x nodes or an s90x container on amd64 nodes. As a resolution, check whether the container that you want to restore and the local node architecture match.

Restore of namespaces that contains admission webhooks fails.

Example error in IBM Storage Fusion restore job:

"Failed restore <some  resource>" "BMYBR0003
      RestorePvcsFailed There was an error when  processing the job in the Transaction Manager
      service"

Example error in Velero pod:

level=error msg="Namespace
      domino-platform, resource restore error: error restoring
      certificaterequests.cert-manager.io/domino-platform/hephaestus-buildkit-client-85k2v:
      Internal error occurred: failed calling webhook  "webhook.cert-manager.io": failed to call
      webhook: Post  "https://cert-manager-webhook.domino-platform.svc:443/mutate?timeout=10s\\":
      service "cert-manager-webhook" not found"

Resolution

Identify the admission webhooks that is applicable to the namespace being restored:
```
oc get mutatingwebhookconfigurations
oc describe mutatingwebhookconfigurations
```
Change the failure Policy parameter from Fail to Ignore to temporarily disable webhook validation prior to restore:
```
failurePolicy: Ignore
```

When you try to restore backups before upgrade, it fails with a BMYBR0003 error.

Diagnosis

After you upgrade, your jobs may fail:

Backup jobs with the status:

"Failed transferring data" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"

Restore jobs with the status:

"Failed restore <some resource>" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"

Confirm the issue in the logs of the manager container of the Data Mover pod.

A sample error message:

2023-07-26T03:39:47Z	ERROR	Failed with error.	{"controller": "guardiancopyrestore", "controllerGroup": "guardian.isf.ibm.com", "controllerKind": "GuardianCopyRestore", "GuardianCopyRestore": {"name":"52a2abfb-ea9b-422f-a60d-fed59527d38e-r","namespace":"ibm-backup-restore"}, "namespace": "ibm-backup-restore", "name": "52a2abfb-ea9b-422f-a60d-fed59527d38e-r", "reconcileID": "c8642f40-c086-413f-a10e-6d6a85531337", "attempt": 2, "total attempts": 3, "error": "EOF"}
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/util.Retry
	/workspace/controllers/util/utils.go:39
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/kafka.(*kafkaWriterConnection).PublishMessage
	/workspace/controllers/kafka/kafka_native_connection.go:71
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).updateOverallCrStatus
	/workspace/controllers/status.go:191
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).doRestore
	/workspace/controllers/guardiancopyrestore_controller.go:187
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).Reconcile
	/workspace/controllers/guardiancopyrestore_controller.go:92
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

Resolution:: Search for the guardian-dm-controller-manager and kill it. A new pod starts in a minute. After the pod reaches a healthy state, retry backup and restores.

A "Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs.

Cause

The "disk quota exceeded" error occurs whenever you restore from an object storage location having applications that use IBM Storage Scale PVC with a size less than 5 GB.

Resolution

Increase the IBM Storage Scale PVC size to a minimum of 5 GB and do a backup and restore operation.

Job issues

If you see your jobs stuck in 'New' state after node restart, then restart the job manager pod.
If you run more jobs at the same time, performance and scalability issues can occur.
Resolution
As a prerequisite, modify KafkaTopic objects from the hub cluster installation. It cannot be done on a spoke.
1. Log in to OpenShift Container Platform console.
2. Go to Operators > Installed Operators.
3. Select AMQ Streams operator.
4. Click Kafka Topic tab.
5. Do the following steps to edit backup, restore, and dm-job-info topic objects:
  1. Click the topic object and choose the YAML tab to edit the object.
  2. Under spec.partitions, change to 10.
  3. Click Save.
If you want to use commands to modify KafkaTopic objects, run the following commands:
```
oc patch -n ibm-backup-restore KafkaTopic backup --type merge -p '{"spec":{"partitions":10}}'
oc patch -n ibm-backup-restore KafkaTopic restore --type merge -p '{"spec":{"partitions":10}}'
oc patch -n ibm-backup-restore KafkaTopic dm-job-info --type merge -p '{"spec":{"partitions":10}}'
```
The ibm-backup-restore is the default namespace. Replace it with your custom namespace.
Do the following workaround steps:

Note: It must be done on a Hub cluster install. It cannot be done on a Spoke cluster.
Using commands:
Restart the application-controller on the affected cluster that the application is on. If the application is on the same cluster as the hub, restart the application-controller pod on the hub. If the application is on a spoke cluster, restart the application-controller on the spoke cluster. To restart using the command option, run the following command:
```
oc delete pod -n <namespace of ibm-backup-restore install> --selector app.kubernetes.io/component=application-controller
```
Using OpenShift Container Platform console:
1. Go to Workloads > Pods.
2. Select install namespace of Backup and restore install agent. The default is ibm-backup-restore.
3. Search for application-controller and find the application-controller line.
4. From the ellipsis overflow menu, click Delete pod.
5. Wait for the application-controller pod to restart and go to ready state. Now, missing application is available for backup in the IBM Storage Fusion user interface.

Known issues

In the Backup & Restore service page of the OpenShift Container Platform console, if the backup policy assignment to an application fails with a gateway timeout error, then use your IBM Storage Fusion user interface.
The Restore job fails as it is unable to restore the route. The solution is to ensure that the restore namespace + nameOfRestoreCluster are not longer than the original namespace and the nameOfCluster.

You cannot restore multiple namespaces to a single alternative namespace. If you attempt such a restore, then the job fails. Example transaction manager log:

023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:resre_application_w_recipe Line 172][INFO] - altNS: ns2, number of ns: 2

2023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:restore_application_w_recipe Line 176][ERROR] - Alternate namespace specified, but more than one namespace was backed up

You cannot restore to a cluster that does not have an identical storage class as the source cluster. However, the transaction manager still attempts to create PVCs with the non-existent storage class on the spoke cluster and eventually fails with Failed restore snapshot status.
After an upgrade, the backup information taken before IBM Storage Fusion 2.6 is not available in the Backed up applications page.
The new backed up applications page does not show the details of the application when you upgrade IBM Storage Fusion to the latest version while leaving the Backup & Restore service in the older version.
As a resolution, it is recommended to upgrade the Backup & Restore service to the latest version after a IBM Storage Fusion upgrade.
The backup applications table does not show the new backup times for the backed-up applications. Go to the Applications and Jobs view to see the last successful backup job for a given application. For applications on the hub, the Applications table has the correct last backup time.