Known issues and limitations for IBM Cloud Pak for Data

The following issues apply to the IBM Cloud Pak for Data platform. Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.

Customer-reported issues

Issues that are found after the release are posted on the IBM Support site.

General issues

After rebooting a cluster that uses OpenShift Data Foundation storage, some Cloud Pak for Data services aren't functional

Applies to: 5.0.0 and later

Diagnosing the problem
After rebooting the cluster, some Cloud Pak for Data custom resources remain in the InProgress state.

For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift® Data Foundation 4.1.4 release notes.

Workaround
Do the following steps:
  1. Find the nodes that have pods that are in an Error state:
    oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A  | grep -v -P "Completed|(\d+)\/\1"
  2. Mark each node as unschedulable.
    oc adm cordon <node_name>
  3. Delete the affected pods:
    oc get pod   | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0
  4. Mark each node as scheduled:
    oc adm uncordon <node_name>

The Assist me icon is not displayed in the web client

Applies to: Upgrades from Version 4.8.x

Fixed in: 5.0.3

If you upgrade IBM Cloud Pak for Data from Version 4.8.x to Version 5.0, the Assist me icon Assist me icon is not visible in the web client toolbar.

The issue occurs because the ASSIST_ME_ENABLED option is set to false.

Resolving the problem
To make Assist me available in the web client:
  1. Log in to Red Hat OpenShift Container Platform as a user with sufficient permissions to complete the task.
    ${OC_LOGIN}
    Remember: OC_LOGIN is an alias for the oc login command.
  2. Run the following command to set ASSIST_ME_ENABLED: true:
    oc patch cm product-configmap \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    --type=merge \
    --patch '{"data": {"ASSIST_ME_ENABLED": "true"}}'
  3. Confirm that the ASSIST_ME_ENABLED parameter is set to true:
    oc get cm product-configmap \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    -o jsonpath="{.data.ASSIST_ME_ENABLED}{'\n'}"

The delete-platform-ca-certs command does not remove certificate mounts from pods

Applies to: 5.0.0

Fixed in: 5.0.3

When you run the cpd-cli manage delete-platform-ca-certs command, the command does not remove the certificate mounts from pods.

Resolving the problem
To remove the certificate mounts from pods:
  1. Delete the cpd-custom-ca-certs secret:
    oc delete secret cpd-custom-ca-certs \
    --namespace=${PROJECT_CPD_INST_OPERANDS}
  2. Run the cpd-cli manage delete-platform-ca-certs command:
    cpd-cli manage delete-platform-ca-certs \
    --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
    --apply=true

When you add a secret to a vault, you cannot filter the list of users and groups to show only groups

Applies to: 5.0.0

Fixed in: 5.0.3

When you add a secret to a vault, you can optionally share the secret with other users. However, if you try to filter the list of users and groups to show only groups, the filter does not take effect.

Installation and upgrade issues

The Switch locations icon is not available if the apply-cr command times out

Applies to: 5.0.0, 5.0.1, and 5.0.2

Fixed in: 5.0.3

If you install solutions that are available in different Cloud Pak for Data experiences, the Switch locations icon Switcher icon is not available in the web client if the cpd-cli manage apply-cr command times out.

Resolving the problem
Re-run the cpd-cli manage apply-cr command.

Upgrades fail if the Data Foundation Rook Ceph cluster is unstable

Applies to: 5.0.0 and later

If the Red Hat OpenShift Data Foundation or IBM Storage Fusion Data Foundation Rook Ceph® cluster is unstable, upgrades fail.

One symptom is that pods will not start because of a FailedMount error. For example:

Warning  FailedMount  36s (x1456 over 2d1h)   kubelet  MountVolume.MountDevice failed for volume 
"pvc-73bf3705-43e9-40bd-87ed-c1e1656d6f12" : rpc error: code = Aborted desc = an operation with the given 
Volume ID 0001-0011-openshift-storage-0000000000000001-5e17508b-c295-4306-b684-eaa327aec2ab already exists
Diagnosing the problem
To confirm whether the Data Foundation Rook Ceph cluster is unstable:
  1. Ensure that the rook-ceph-tools pod is running.
    oc get pods -n openshift-storage | grep rook-ceph-tools
    Note: On IBM Storage Fusion HCI System or on environments that use hosted control planes, the pods are running in the openshift-storage-client project.
  2. Set the TOOLS_POD environment variable to the name of the rook-ceph-tools pod:
    export TOOLS_POD=<pod-name>
  3. Execute into the rook-ceph-tools pod:
    oc rsh -n openshift-storage ${TOOLS_POD}
  4. Run the following command to get the status of the Rook Ceph cluster:
    ceph status
    Confirm that the output includes the following line:
    health: HEALTH_WARN
  5. Exit the pod:
    exit
Resolving the problem
To resolve the problem:
  1. Get the name of the rook-ceph-mrg pods:
    oc get pods -n openshift-storage | grep rook-ceph-mgr
  2. Set the MGR_POD_A environment variable to the name of the rook-ceph-mgr-a pod:
    export MGR_POD_A=<rook-ceph-mgr-a-pod-name>
  3. Set the MGR_POD_B environment variable to the name of the rook-ceph-mgr-b pod:
    export MGR_POD_B=<rook-ceph-mgr-b-pod-name>
  4. Delete the rook-ceph-mgr-a pod:
    oc delete pods ${MGR_POD_A} -n openshift-storage
  5. Ensure that the rook-ceph-mgr-a pod is running before you move to the next step:
    oc get pods -n openshift-storage | grep rook-ceph-mgr
  6. Delete the rook-ceph-mgr-b pod:
    oc delete pods ${MGR_POD_B} -n openshift-storage
  7. Ensure that the rook-ceph-mgr-b pod is running:
    oc get pods -n openshift-storage | grep rook-ceph-mgr

Running the apply-olm command twice during an upgrade can remove required OLM resources

Applies to:
  • Upgrades from Version 4.7 to 5.0.0
  • Upgrades from Version 4.8 to 5.0.0

Upgrades to later 5.0 refreshes are not affected.

If you run the cpd-cli manage apply-olm two times, you might notice several problems:
  • The operator subscription is missing
  • The operator cluster service version (CSV) is missing
If you continue the upgrade by running the cpd-cli manage apply-cr command, you might notice additional problems:
  • The version information is missing from the spec section of the service custom resource
  • When you run the cpd-cli manage get-cr-status command, the values for the Version and Reconciled-version parameters are different.
Resolving the problem
To resolve the problem, you must re-run the cpd-cli manage apply-olm command a third time to ensure that the required resources are available. Then, re-run the cpd-cli manage apply-cr command.

After an upgrade from Cloud Pak for Data 4.7.3, FoundationDB can indicate a Failed status

Applies to: Upgrades from Version 4.7.3 to 5.0.0 and later

After upgrading Cloud Pak for Data from Version 4.7.3 to 5.0, the status of the FoundationDB cluster can indicate that it has failed (fdbStatus: Failed). The Failed status can occur even if FoundationDB is available and working correctly. This issue occurs when the FoundationDB resources do not get properly cleaned up by the upgrade.

This issue affects deployments of the following services.
  • IBM Knowledge Catalog
  • IBM Match 360
Diagnosing the problem

To determine if this problem has occurred:

Required role: To complete this task, you must be a cluster administrator.

  1. Check the FoundationDB cluster status.
    oc get fdbcluster -o yaml | grep fdbStatus

    If the returned status is Failed, proceed to the next step to determine if the pods are available.

  2. Check to see if the FoundationDB pods are up and running.
    oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep foundation

    The returned list of FoundationDB pods should all have a status of Running. If they are not running, then the problem is something other than this issue.

Resolving the problem

To resolve this issue, restart the FoundationDB controller (ibm-fdb-controller):

Required role: To complete this task, you must be a cluster administrator.

  1. Identify your FoundationDB controllers.
    oc get pods  -n ${PROJECT_CPD_INST_OPERATORS} | grep ibm-fdb-controller
    This command returns the names of two FoundationDB controllers in the following formats:
    • ibm-fdb-controller-manager-<INSTANCE-ID>
    • apple-fdb-controller-manager-<INSTANCE-ID>
  2. Delete the ibm-fdb-controller-manager to refresh it.
    oc delete pod ibm-fdb-controller-<INSTANCE-ID> -n ${PROJECT_CPD_INST_OPERATORS}
  3. Wait for the controller to restart. This can take approximately one minute.
  4. Check the status of your FoundationDB cluster:
    oc -n ${PROJECT_CPD_INST_OPERANDS} get FdbCluster -o yaml
    Confirm that the fdbStatus is now Completed.

After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable

Applies to: 5.0.0 and later

After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Match 360 cannot function correctly.

This issue affects deployments of the following services.
  • IBM Knowledge Catalog
  • IBM Match 360 with Watson
Diagnosing the problem
To identify the cause of this issue, check the FoundationDB status and details.
  1. Check the FoundationDB status.
    oc get fdbcluster -o yaml | grep fdbStatus

    If this command is successful, the returned status is Complete. If the status is InProgress or Failed, proceed to the workaround steps.

  2. If the status is Complete but FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.
    oc rsh sample-cluster-log-1 /bin/fdbcli

    To check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the fdb> prompt.

    status details
    • If you get a message that is similar to Could not communicate with a quorum of coordination servers, run the coordinators command with the IP addresses specified in the error message as input.
      oc get pod -o wide | grep storage
      > coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tls 

      If this step does not resolve the problem, proceed to the workaround steps.

    • If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.
Resolving the problem
To resolve this issue, restart the FoundationDB pods.

Required role: To complete this task, you must be a cluster administrator.

  1. Restart the FoundationDB cluster pods.
    oc get fdbcluster 
    oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete po

    Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

  2. Restart the FoundationDB operator pods.
    oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po
  3. After the pods finish restarting, check to ensure that FoundationDB is available.
    1. Check the FoundationDB status.
      oc get fdbcluster -o yaml | grep fdbStatus

      The returned status must be Complete.

    2. Check to ensure that the database is available.
      oc rsh sample-cluster-log-1 /bin/fdbcli

      If the database is still not available, complete the following steps.

      1. Log in to the ibm-fdb-controller pod.
      2. Run the fix-coordinator script.
        kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}

        Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

        Note: For more information about the fix-coordinator script, see the workaround steps from the resolved IBM Match 360 known issue item The FoundationDB cluster can become unavailable.

After you upgrade from Cloud Pak for Data 4.7.4, generating a bearer token fails in an IAM-enabled cluster

Applies to: Upgrades from Version 4.7.4 to 5.0.0 and later

If you upgrade from Cloud Pak for Data version 4.7.4 to Cloud Pak for Data 5.0.0 and later, the IAM access token API (/idprovider/v1/auth/identitytoken) fails. You cannot login to the user interface when the identitytoken API fails.

Diagnosing the problem
The following error is displayed in the log when you generate an IAM access token:
Failed to get access token, Liberty error: {"error_description":"CWWKS1406E: The token request had an invalid client credential. The request URI was \/oidc\/endpoint\/OP\/token.","error":"invalid_client"}"
Resolving the problem
  1. Log in to Red Hat OpenShift Container Platform as a cluster administrator.
    ${OC_LOGIN}
    Remember: OC_LOGIN is an alias for the oc login command.
  2. Run the following command to restart the oidc-client-registration job:
    oc -n delete job oidc-client-registration

Secrets are not visible in connections after upgrade

Applies to:
  • Upgrades from Version 4.7 to Version 5.0.0, 5.0.1, or 5.0.2
  • Upgrades from Version 4.8 to Version 5.0.0, 5.0.1, or 5.0.2

Fixed in: 5.0.3

If you use secrets when you create connections, the secrets are not visible in the connection details after you upgrade Cloud Pak for Data. This issue occurs when your vault uses a private CA signed certificate.

Resolving the problem
To see the secrets in the user interface:
  1. Change to the project where Cloud Pak for Data is installed:
    oc project ${PROJECT_CPD_INST_OPERANDS}
  2. Set the following environment variables:
    oc set env deployment/zen-core-api VAULT_BRIDGE_TLS_RENEGOTIATE=true
    oc set env deployment/zen-core-api VAULT_BRIDGE_TOLERATE_SELF_SIGNED=true

Node pinning is not applied to postgresql pods

Applies to: 5.0.0 and later

If you use node pinning to schedule pods on specific nodes, and your environment includes postgresql pods, the node affinity settings are not applied to the postgresql pods that are associated with your Cloud Pak for Data deployment.

The resource specification injection (RSI) webhook cannot patch postgresql pods because the EDB Postgres operator uses a PodDisruptionBudget resource to limit the number of concurrent disruptions to postgresql pods. The PodDisruptionBudget resource prevents postgresql pods from being evicted.

You must manually clean up remote physical location artifacts if the create-physical-location command fails

Applies to: 5.0.0

Fixed in: 5.0.1

If the cpd-cli manage create-physical-location command fails, the command leaves behind resources that you must clean up by running the cpd-cli manage delete-physical-location command:

cpd-cli manage delete-physical-location \
--physical_location_name=${REMOTE_PHYSICAL_LOCATION_ID} \
--management_ns=${REMOTE_PROJECT_MANAGEMENT} \
--cpd_hub_url=${CPD_HUB_URL} \
--cpd_hub_api_key=${CPD_HUB_API_KEY}

If you try to re-run the create-physical-location command against the same management project before you run the delete-physical-location command, the create-physical-location command returns the following error:

The physical-location-info-cm ConfigMap already exists in the <management-ns> project.
The physical location in the ConfigMap is called <remote-physical-location-id>
* If you need to re-run the create-physical-location command to finish creating the physical location, 
  you must specify <remote-physical-location-id>.
* If you want to create a new physical location on the cluster, you must specify a different project. 
  You cannot reuse an existing management project.

The ibm-nginx deployment does not scale fast enough when automatic scaling is configured

Applies to: 5.0.0 and later

If you configure automatic scaling for the IBM Cloud Pak for Data control plane, the ibm-nginx deployment might not scale fast enough. Some symptoms include:

  • Slow response times
  • High CPU requests are throttled
  • The deployment scales up and down even when the workload is steady

This problem typically occurs when you install watsonx Assistant or watsonx Orchestrate.

Resolving the problem
If you encounter the preceding symptoms, you must manually scale the ibm-nginx deployment:
oc patch zenservice lite-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {
    "Nginx": {
        "name": "ibm-nginx",
        "kind": "Deployment",
        "container": "ibm-nginx-container",
        "replicas": 5,
        "minReplicas": 2,
        "maxReplicas": 11,
        "guaranteedReplicas": 2,
        "metrics": [
            {
                "type": "Resource",
                "resource": {
                    "name": "cpu",
                    "target": {
                        "type": "Utilization",
                        "averageUtilization": 529
                    }
                }
            }
        ],
        "resources": {
            "limits": {
                "cpu": "1700m",
                "memory": "2048Mi",
                "ephemeral-storage": "500Mi"
            },
            "requests": {
                "cpu": "225m",
                "memory": "920Mi",
                "ephemeral-storage": "100Mi"
            }
        },
        "containerPolicies": [
            {
                "containerName": "*",
                "minAllowed": {
                    "cpu": "200m",
                    "memory": "256Mi"
                },
                "maxAllowed": {
                    "cpu": "2000m",
                    "memory": "2048Mi"
                },
                "controlledResources": [
                    "cpu",
                    "memory"
                ],
                "controlledValues": "RequestsAndLimits"
            }
        ]
    }
}}'

Backup and restore issues

Issues that apply to several or all backup and restore methods
Online backup and restore with the OADP backup and restore utility issues
Online backup and restore with IBM Storage Fusion issues
Online backup and restore with NetApp Astra Control Center issues
Data replication with Portworx issues
Offline backup and restore with the OADP backup and restore utility issues

OADP backup is missing EDB Postgres PVCs

Applies to: 5.0.0 and later

Diagnosing the problem
After an online or offline backup is taken with the OADP utility, EDB Postgres PVCs are missing in the PVC backup list.
Cause of the problem
EDB Postgres replica PVCs might be excluded from a backup when an EDB Postgres cluster switches primary instances.
Resolving the problem
Before you create a backup, run the following command:
oc label pvc,pods -l k8s.enterprisedb.io/cluster,velero.io/exclude-from-backup=true velero.io/exclude-from-backup- -n ${PROJECT_CPD_INST_OPERANDS}

Disk usage size error when running the du-pv command

Applies to: 5.0.0 and later

Note: Do not use this feature in a production environment.
Diagnosing the problem
When you run the du-pv command to estimate how much storage is needed to create a backup with the OADP utility, you see the following error message:
Total estimated volume usage size: 0

one or more error(s) occurred while trying to get disk usage size.  Please check reported errors in log file for details

The status of the cpdbr-agent pods is ImagePullBackoff:

oc get po -n ${OADP_NAMESPACE}
Example output:
NAME                                                READY   STATUS             RESTARTS   AGE
cpdbr-agent-9lprf                                   0/1     ImagePullBackOff   0          74s
cpdbr-agent-pf42f                                   0/1     ImagePullBackOff   0          74s
cpdbr-agent-trprx                                   0/1     ImagePullBackOff   0          74s
Cause of the problem
The --image-prefix option is not currently used by the cpdbr-agent install command. If you specify this option, it is ignored. Instead, the install command uses the default image at registry.access.redhat.com/ubi9/ubi-minimal:latest.
Resolving the problem
Do the following steps:
  1. Patch the cpdbr-agent daemonset with the desired fully-qualified image name:
    oc patch daemonset cpdbr-agent -n ${OADP_NAMESPACE} --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"<fully-qualified-image-name>"}]'
  2. Wait for the daemonset to reach a healthy state:
    oc rollout status daemonset cpdbr-agent -n ${OADP_NAMESPACE}
  3. Retry the dv-pv command.
Tip: For more information about this feature, see Optional: Estimating how much storage to allocate for backups.

After restore, watsonx Assistant custom resource is stuck in InProgress at 11/19 verified state

Applies to: 5.0.0-5.0.2

Fixed in: 5.0.3

Diagnosing the problem
This problem can occur after you restore an online backup to the same cluster or to a different cluster. Run the following command:
oc get <watsonx-Assistant-instance-name> -n ${PROJECT_CPD_INST_OPERANDS}
Example output:
NAME   VERSION   READY   READYREASON    UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE   AGE
wa     5.0.1     False   Initializing   True       VerifyWait       19/19      11/19                4h39m
Cause of the problem
Pods are unable to find the wa-global-etcd secret. Run the following command:
oc describe pod wa-store-<xxxxxxxxx>-<xxxxx> | tail -5
Example output:
Normal   QueuePosition  51m (x2 over 52m)     ibm-cpd-scheduler  Queue Position: 3
  Normal   QueuePosition  50m (x2 over 52m)     ibm-cpd-scheduler  Queue Position: 2
  Normal   QueuePosition  36m                   ibm-cpd-scheduler  Queue Position: 1
  Warning  FailedMount    6m49s (x22 over 50m)  kubelet            Unable to attach or mount volumes: unmounted volumes=[global-etcd], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
  Warning  FailedMount    74s (x33 over 52m)    kubelet            MountVolume.SetUp failed for volume "global-etcd" : secret "wa-global-etcd" not found
Resolving the problem
Delete certain deployments and recreate them by doing the following steps:
  1. Ensure that the watsonx Assistant operator is running.
  2. Create the INSTANCE environment variable and set it to the watsonx Assistant instance name:
    export INSTANCE=<watsonx-Assistant-instance-name>
  3. Run the following script:
    # Components to restart one by one
    SEQUENTIAL_DEPLOYMENTS=("ed" "dragonfly-clu-mm" "tfmm" "clu-triton-serving" "clu-serving" "nlu" "dialog" "store")
    
    # Components to restart together in parallel
    PARALLEL_DEPLOYMENTS=("analytics" "clu-embedding" "incoming-webhooks" "integrations" "recommends" "system-entities" "ui" "webhooks-connector" "gw-instance" "store-admin")
    
    for DEPLOYMENT in "${SEQUENTIAL_DEPLOYMENTS[@]}"; do
      echo "#Starting restart of $INSTANCE-$DEPLOYMENT."
    
      # Delete the deployment
      oc delete deployment $INSTANCE-$DEPLOYMENT
      # Wait until the deployment is completely deleted
      while oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do
        echo "Waiting for $INSTANCE-$DEPLOYMENT to be fully deleted..."
        sleep 5
      done
      
      # Ensure the deployment is recreated
      echo "Recreating $INSTANCE-$DEPLOYMENT."
      while ! oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do
        echo "Waiting for $INSTANCE-$DEPLOYMENT to be created..."
        sleep 5
      done
    
      echo "Waiting for $INSTANCE-$DEPLOYMENT to become ready..."
      oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true
    
      echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully."
    done
    
    for DEPLOYMENT in "${PARALLEL_DEPLOYMENTS[@]}"; do
      echo "#Starting restart of $INSTANCE-$DEPLOYMENT."
    
      # Delete the deployment
      oc delete deployment $INSTANCE-$DEPLOYMENT &
    
    done
    
    # Wait for all parallel delete operations to complete
    wait
    
    # Ensure parallel deployments are recreated
    for DEPLOYMENT in "${PARALLEL_DEPLOYMENTS[@]}"; do
      while ! oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do
        echo "Waiting for $INSTANCE-$DEPLOYMENT to be created..."
        sleep 5
      done
    
      echo "Waiting for $INSTANCE-$DEPLOYMENT to become ready..."
      oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true
    
      echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully."
    done
    
    echo "All deployments have been restarted successfully."

After restore, watsonx Assistant is stuck on the 17/19 deployed state or custom resource is stuck in InProgress state

Applies to: 5.0.1, 5.0.2

Fixed in: 5.0.3

Diagnosing the problem
This problem can occur after you restore an online backup to the same cluster or to a different cluster. Run the following command:
oc get wa -n ${PROJECT_CPD_INST_OPERANDS}
Example output:
NAME   VERSION   READY   READYREASON    UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE   AGE
wa     5.0.1     False   Initializing   True       VerifyWait       17/19      15/19                4h39m
Resolving the problem
Delete the wa-integrations-operand-secret and wa-integrations-datastore-connection-strings secrets by running the following commands:
oc delete secret wa-integrations-operand-secret -n ${PROJECT_CPD_INST_OPERANDS}
oc delete secret wa-integrations-datastore-connection-strings -n ${PROJECT_CPD_INST_OPERANDS}

After the secrets are deleted, the watsonx Assistant operator recreates them with the correct values, and the watsonx Assistant custom resource and pods are now in a good state.

OADP backup precheck command fails

Applies to: 5.0.0, 5.0.1

Fixed in: 5.0.2

Diagnosing the problem
This problem occurs when you do offline or online backup and restore with the OADP backup and restore utility. Run the backup precheck command:
cpd-cli oadp backup precheck --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS}

The following error message appears:

error getting csv list: : clusterserviceversions.operators.coreos.com is forbidden: User "system:serviceaccount:zen-cpdbrapi:cpdbr-api-sa" cannot list resource "clusterserviceversions" in API group "operators.coreos.com" at the cluster scope
Error: error getting csv list: : clusterserviceversions.operators.coreos.com is forbidden: User "system:serviceaccount:zen-cpdbrapi:cpdbr-api-sa" cannot list resource "clusterserviceversions" in API group "operators.coreos.com" at the cluster scope
[ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1
Cause of the problem
The cpdbr-api pod does not have the necessary permission to list clusterserviceversions.operators.coreos.com in all projects (namespaces) for the backup precheck command.
Resolving the problem
Add --exclude-checks OadpOperatorCSVto the backup precheck command:
cpd-cli oadp backup precheck \
--tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \
--exclude-checks OadpOperatorCSV

During or after a restore, pod shows PVC is missing

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem
During or after a restore, a pod shows that one or more PVCs are missing. For example:
oc describe pod c-db2oltp-wkc-db2u-0

Example output:

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  18m (x11076 over 16h)  ibm-cpd-scheduler  0/20 nodes are available: 20 persistentvolumeclaim "wkc-db2u-backups" not found. preemption: 0/20 nodes are available: 20 Preemption is not helpful for scheduling.
Cause of the problem
Velero does not back up PVCs that are in a Terminating state.
Resolving the problem
To work around the problem, before you restore a backup, ensure that no PVCs are in a Terminating state. To check for PVCs that are in a Terminating state after a backup is created, check the Velero pod logs for Skipping item because it's being deleted messages:
oc logs po -l deploy=velero -n <oadp-operator-ns>

Example output:

time="<timestamp>" level=info msg="Skipping item because it's being deleted." backup=oadp-operator/bkupocs661-tenant-online-b1 logSource="/remote-source/velero/app/pkg/backup/item_backupper.go:161" name=wkc-db2u-backups namespace=zen1 resource=persistentvolumeclaims
time="<timestamp>" level=info msg="Backed up 286 items out of an estimated total of 292 (estimate will change throughout the backup)" backup=oadp-operator/bkupocs661-tenant-online-b1 logSource="/remote-source/velero/app/pkg/backup/backup.go:404" name=wkc-db2u-backups namespace=zen1 progress= resource=persistentvolumeclaims

After restoring an online backup, status of Watson Discovery custom resource remains in InProgress state

Applies to: 5.0.0-5.0.2

Fixed in: 5.0.3

Diagnosing the problem
You see the following error, even though you did the Multicloud Object Gateway post-restore task. For example, if you used IBM Storage Fusion to do the backup and restore, you created the secrets that Watson Discovery uses to connect to Multicloud Object Gateway.
  - lastTransactionTime: <timestamp>
    message: Post task of online restore is in progress. Please ensure that MCG is
      correctly configured after restore.
    reason: PostRestoreInProgress
    status: "True"
    type: Message
Cause of the problem
The Watson Discovery post-restore task did not complete.
Resolving the problem
To work around the problem, do the following steps:
  1. Check that the Watson Discovery post-restore component exists:
    oc get wd wd -o jsonpath='{.status.componentStatus.deployedComponents[?(@=="post_restore")]}'
    If the post-restore component exists, the output of the command is:
    post_restore
  2. Check that the post-restore task is not unverified:
    oc get wd wd -o jsonpath='{.status.componentStatus.unverifiedComponents[?(@=="post_restore")]}'

    If the post-restore task is not unverified, no output is produced by the command.

  3. In this situation, some failure jobs do not rerun and must be deleted:
    oc delete job wd-discovery-enrichment-model-copy wd-discovery-orchestrator-setup
  4. Check that Watson Discovery is now ready:
    oc get wd
    Example output:
    NAME   VERSION   READY   READYREASON   UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE        DATASTOREQUIESCE   AGE
    wd     5.0.0     True    Stable        False      Stable           23/23      23/23      NOT_QUIESCED   NOT_QUIESCED       22h

After successful restore, the ibm-common-service-operator deployment fails to reach a Running state

Applies to: 5.0.0 and later

Diagnosing the problem
The following symptoms are seen:
  • Running the following command shows that the ibm-common-service-operator pod and deployment are not healthy:
    oc get pods -n ${PROJECT_CPD_INST_OPERATORS}
    Example output:
    ibm-common-service-operator-<...>                      0/1     CrashLoopBackOff        72 (4m46s ago)   6h11m

    Error logs show permission issues:

    oc logs ibm-common-service-operator-<...>
    Example output:
    ...
    # I0529 20:52:39.182025       1 request.go:665] Waited for 1.033737216s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/dashboard.opendatahub.io/v1alpha?timeout=32s
    # <date_timestamp>20:52:47.794Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
    # I0529 20:52:47.794980       1 main.go:130] Identifying Common Service Operator Role in the namespace cpd-operator
    # E0529 20:52:47.835106       1 util.go:465] Failed to fetch configmap kube-public/saas-config: configmaps "saas-config" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot get resource "configmaps" in API group "" in the namespace "kube-public"
    # I0529 20:52:47.837942       1 init.go:152] Single Deployment Status: false, MultiInstance Deployment status: true, SaaS Depolyment Status: false
    # I0529 20:52:49.188786       1 request.go:665] Waited for 1.340366538s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/cdi.kubevirt.io/v1beta1?timeout=32s
    # E0529 20:52:57.412736       1 init.go:1683] Failed to cleanup validatingWebhookConfig: validatingwebhookconfigurations.admissionregistration.k8s.io "ibm-common-service-validating-webhook-cpd-operator" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot delete resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
    # E0529 20:52:57.412762       1 main.go:153] Cleanup Webhook Resources failed: validatingwebhookconfigurations.admissionregistration.k8s.io "ibm-common-service-validating-webhook-cpd-operator" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot delete resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
  • Running the following command shows that the ibm-common-service-operator CSV is stuck in a Pending state:
    oc get csv -n ${PROJECT_CPD_INST_OPERATORS}
    Example output:
    NAME                      DISPLAY           VERSION   REPLACES   PHASE
    ibm-zen-operator.v6.0.0   IBM Zen Service   6.0.0                Pending

    Running the following command shows that status of the CommonService custom resource is Succeeded:

    oc get commonservice -n ${PROJECT_CPD_INST_OPERANDS} common-service -o json | jq .status.phase
  • OLM logs show the following error:
    oc logs -n openshift-operator-lifecycle-manager -l app=catalog-operator
    oc logs -n openshift-operator-lifecycle-manager -l app=olm-operator
    Example output:
    E0530 01:00:07.268889 1 queueinformer_operator.go:319] sync {"update" "cpd-operator/ibm-common-service-operator.v4.6.0"} failed: requirements were not met
    time="<timestamp>" level=info msg="requirements were not met" csv=cpd-platform-operator.v4.6.0 id=<...> namespace=cpd-operator phase=Pending
Cause of the problem
The root cause is from a known OLM issue where ClusterRoleBindings are missing, even though the InstallPlan shows it was created. For details, see the OLM issue ClusterRoleBinding is missing although InstallPlan shows it was created.
Resolving the problem
To work around the problem, clean up the Cloud Pak for Data instance and operator projects (namespaces) and retry the restore. For cleanup instructions, see Preparing to restore Cloud Pak for Data with the OADP utility.

Restore fails with Error from server (Forbidden): configmaps is forbidden error

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem
When restoring Cloud Pak for Data to a different cluster with IBM Storage Fusion, NetApp Astra Control Center, or Portworx, you see the following error message:
Time: <timestamp> level=error - oc get configmap -n kube-public - FAILED with:  
Error from server (Forbidden): configmaps is forbidden: User "system:serviceaccount:cpd-operator:cpdbr-tenant-service-sa" 
cannot list resource "configmaps" in API group "" in the namespace "kube-public"
End Time: <timestamp>
Cause of the problem
The command to uninstall the cpdbr service was run with the incorrect --tenant-operator-namespace parameter. For example, multiple Cloud Pak for Data instances were installed in the cluster, and while cleaning up one of the instances, the incorrect project was specified when uninstalling the cpdbr service.
Resolving the problem
To work around the problem, reinstall the cpdbr service in the project where it was mistakenly uninstalled. For details, see one of the following topics:

After a restore, unable to access the Cloud Pak for Data console

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem
You see the following symptoms:
  • Running the following command shows that the ibm-iam-operator pod and deployment are not healthy:
    oc get pods -n ${PROJECT_CPD_INST_OPERATORS}
    Example output:
    ibm-iam-operator-<...>                      0/1     CrashLoopBackOff        72 (4m46s ago)   6h11m
    Error logs show permission issues:
    oc logs ibm-iam-operator-<...>
  • Running the following command shows that the ibm-iam-operator CSV is stuck in a Pending state:
    oc get csv -n ${PROJECT_CPD_INST_OPERATORS}
    Example output:
    NAME                      DISPLAY           VERSION   REPLACES   PHASE
    ibm-iam-operator.v4.6.0   IBM IM Operator   4.6.0                Pending

    Running the following command shows that status of the CommonService custom resource is Succeeded:

    oc get commonservice -n ${PROJECT_CPD_INST_OPERANDS} common-service -o json | jq .status.phase
  • OLM logs show the following error:
    oc logs -n openshift-operator-lifecycle-manager -l app=catalog-operator
    oc logs -n openshift-operator-lifecycle-manager -l app=olm-operator
    Example output:
    E0530 01:00:07.268889 1 queueinformer_operator.go:319] sync {"update" "cpd-operator/ibm-iam-operator.v4.6.0"} failed: requirements were not met
    time="<timestamp>" level=info msg="requirements were not met" csv=ibm-iam-operator.v4.6.0 id=<...> namespace=cpd-operator phase=Pending
Cause of the problem
Insufficient permissions from missing ClusterRole and ClusterRoleBindings. The root cause is from a known OLM issue where ClusterRoleBindings are missing, even though the InstallPlan shows it was created. For details, see the OLM issue ClusterRoleBinding is missing although InstallPlan shows it was created.
Resolving the problem
To work around the problem, clean up the Cloud Pak for Data instance and operator projects (namespaces) and retry the restore. For cleanup instructions, see Preparing to restore Cloud Pak for Data with the OADP utility.

After a successful restore, the Cloud Pak for Data console points to the source cluster domain in its URL instead of the target cluster domain

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem
Get the Cloud Pak for Data console route by running the following command:
oc get route -n ${PROJECT_CPD_INST_OPERANDS}

The output of the command shows that the Cloud Pak for Data console route points to the source cluster domain rather than to the target cluster domain.

Cause of the problem
The ibmcloud-cluster-info ConfigMap from the source cluster is included in the restore when it is expected to be excluded and re-generated, causing the target restore cluster to use the source routes.
Resolving the problem
To work around the problem, do the following steps:
  1. Edit the fields in the ibmcloud-cluster-info ConfigMap to use the target cluster hostname:
    oc edit configmap ibmcloud-cluster-info -n ${PROJECT_CPD_INST_OPERANDS}
  2. Restart the ibm-zen-operator pod:
    oc delete po -l app.kubernetes.io/name=ibm-zen-operator -n ${PROJECT_CPD_INST_OPERANDS}
  3. Check that the routes are updated:
    oc get route -n ${PROJECT_CPD_INST_OPERANDS}

If restarting the ibm-zen-operator pod does not correctly update the routes, and the ibm-iam-operator deployment is not healthy, do the workaround that is described in the previous issue.

Unable to back up Watson Discovery when the service is scaled to the xsmall size

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem
The problem that you see depends on the backup and restore method that you are using. For example, if you are using IBM Storage Fusion, a Failed snapshot message appears during the backup process.
Cause of the problem
The xsmall size configuration uses 1 OpenSearch data node. The backup process requires 2 data nodes.
Resolving the problem
To work around the problem, increase the number of OpenSearch data nodes to 2. In the ${PROJECT_CPD_INST_OPERANDS} project (namespace), run the following command:
oc patch wd wd --type=merge --patch='{"spec":{"elasticsearch":{"dataNode":{"replicas":2}}}}'

In a Cloud Pak for Data deployment that has multiple OpenPages instances, only the first instance is successfully restored

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem
After the restore, the custom resource of the first is OpenPages instance is in a Completed state. The custom resources of the remaining OpenPages instances are in an InMaintenance state.
Cause of the problem
Hooks (prehooks, posthooks, etc.) are run only on the first OpenPages instance. Log files list only the results for one OpenPages instance when multiple were expected.
Resolving the problem
To work around the problem, do the following steps:
  1. Get the OpenPages instance ConfigMaps:
    oc get cm -n ${PROJECT_CPD_INST_OPERANDS} -l cpdfwk.module=openpages-aux
  2. Edit each OpenPages instance ConfigMap so that their .data.aux-meta.name fields match their .metadata.labels.["cpdfwk.name"] label:
    oc edit cm -n ${PROJECT_CPD_INST_OPERANDS} <configmap-name>

Unable to log in to Cloud Pak for Data with OpenShift cluster credentials after successfully restoring to a different cluster

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem
When Cloud Pak for Data is integrated with the Identity Management Service service, you cannot log in with OpenShift cluster credentials. You might be able to log in with LDAP or as cpdadmin.
Resolving the problem
To work around the problem, run the following commands:
oc delete cm platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm ibm-iam-bindinfo-oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm ibm-iam-bindinfo-platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}
oc delete pods -n ${PROJECT_CPD_INST_OPERATORS} -l app.kubernetes.io/instance=ibm-common-service-operator
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-auth-service
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-management
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-provider

After a restore, OperandRequest timeout error in the ZenService custom resource

Applies to: 5.0.0 and later

Diagnosing the problem
Get the status of the ZenService YAML:
oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yaml

In the output, you see the following error:

...
zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request":
      Timed out waiting on resource'
...
Check for failing operandrequests:
oc get operandrequests -A
For failing operandrequests, check their conditions for constraints not satisfiable messages:
oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name>
Cause of the problem
Subscription wait operations timed out. The problematic subscriptions show an error similar to the following example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0
      exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0
      and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0
      originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator
      requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0,
      subscription ibm-db2aaservice-cp4d-operator exists'

This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.

Workaround
Do the following steps:
  1. Delete the problematic clusterserviceversions and subscriptions, and restart the Operand Deployment Lifecycle Manager (ODLM) pod.

    For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.

  2. Delete Cloud Pak for Data instance projects (namespaces).

    For details, see Preparing to restore Cloud Pak for Data with the OADP utility.

  3. Retry the restore.

Online restore of Data Virtualization fails with post-hook errors

Applies to: 5.0.2, 5.0.3

Diagnosing the problem
Restoring an online backup of Data Virtualization on Portworx storage with the OADP backup and restore utility fails. In the CPD-CLI*.log file, you see errors such as in the following examples:
<time>     zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, op=<mode=post-restore,type=config-hook,method=rule>, status=error
time=<timestamp> level=error msg=error performing op postRestoreViaConfigHookRule for resource dv, msg: 1 error occurred:
   * : command timed out after 40m0s: timed out waiting for the condition
 func=cpdbr-oadp/pkg/quiesce.(*BasicPlanExecutor).applyPlanInternal file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1535
Cause of the problem
Db2 startup is slow, causing the Data Virtualization post-restore hook to time out.
Resolving the problem
To work around the problem, take various Data Virtualization components out of write-suspend mode.
  1. Take dvutils out of write-suspend mode:
    oc rsh c-db2u-dv-dvutils-0 bash
    /opt/dv/current/dv-utils.sh -o leavesafemode --is-bar
  2. Take the Data Virtualization hurricane pod out of write-suspend mode:
    oc rsh $(oc get pods | grep -i hurricane | cut -d' ' -f 1) bash
    su - db2inst1
    /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L
  3. Take Db2 out of write-suspend mode:
    oc rsh c-db2u-dv-db2u-0 bash
    su - db2inst1
    /db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L
  4. After a few minutes, verify that Db2 is no longer in write-suspend mode:
    db2 connect to bigsql

    If the command finishes successfully, Db2 is no longer in write-suspend mode.

  5. Restart the Data Virtualization caching pod by deleting the existing pod:
    oc delete pod $(oc get pods | grep -i c-db2u-dv-dvcaching | cut -d' ' -f 1)

Online backup of Analytics Engine powered by Apache Spark fails

Applies to: 5.0.1

Fixed in: 5.0.2

Diagnosing the problem
When you try to create a backup of a Cloud Pak for Data deployment that includes the Analytics Engine powered by Apache Spark service with the OADP utility, the backup fails at the step to create a backup of Cloud Pak for Data PVCs and volume data. In the log file, you see the following error:
Hook execution breakdown by status=error/timedout:

The following hooks either have errors or timed out

pre-backup (1):

        COMPONENT                       CONFIGMAP                               METHOD  STATUS  DURATION       
        analyticsengine-cnpsql-ckpt     cpd-analyticsengine-aux-edb-ckpt-cm     rule    error   1m17.502299591s

--------------------------------------------------------------------------------

** INFO [BACKUP CREATE/SUMMARY/END] *******************************************
Error: error running pre-backup hooks: Error running pre-processing rules.  Check the /root/install_automation/cpd-cli-linux-EE-14.0.1-353/cpd-cli-workspace/logs/CPD-CLI-<date>.log for errors.
[ERROR] <timestamp> RunPluginCommand:Execution error:  exit status 1
Cause of the problem
The EDB Postgres cluster spark-hb-cloud-native-postgresql remains fenced.
Resolving the problem
Unfence the cluster by doing the following steps:
  1. Edit the spark-hb-cloud-native-postgresql cluster:
    oc edit clusters.postgresql.k8s.enterprisedb.io spark-hb-cloud-native-postgresql
  2. Remove the following line:
    k8s.enterprisedb.io/fencedInstances: ""
  3. Retry the backup.
Tip: For more information about resolving problems with EDB Postgres clusters that remain fenced, see EDB Postgres cluster is in an unhealthy state after a failed online backup.

Watson Speech services status is stuck in InProgress after restore

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem
After an online restore with the OADP utility, the CPD-CLI*.log file shows speechStatus is in the InProgress state.
Cause of the problem
The speechStatus is in the InProgress state due to a race condition in the stt-async component. Pods that are associated with this component are stuck in 0/1 Running state. Run the following command to confirm this state:
oc get pods -l app.kubernetes.io/component=stt-async
Example output:
NAME                                   READY   STATUS    RESTARTS   AGE
speech-cr-stt-async-775d5b9d55-fpj8x   0/1     Running   0          60m

If one or more pods is in the 0/1 Running state for 20 minutes or more, this problem might occur.

Resolving the problem
For each pod in the 0/1 Running state, run the following command:
oc delete pod <stt-async-podname>

Common core services and dependent services in a failed state after an online restore

Applies to: 5.0.0

Diagnosing the problem
After you restore an online backup with the OADP backup and restore utility, the Common core services custom resource and the custom resource of dependent services remain in an InProgress state.
Cause of the problem
Intermittent Elasticsearch failure.
Workaround
To work around the problem, do the following steps:
  1. Make sure that the current project (namespace) is set to the project that contains the Common core services and Watson Knowledge Catalog deployment.
  2. Make sure that a valid backup is available by running the following command:
    oc exec elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request GET --url http://localhost:19200/_cat/snapshots/cloudpak  --header 'content-type: application/json'
  3. When a valid backup is present, the command returns output like in the following example:
    cloudpak_snapshot_<timestamp> SUCCESS <epoch_timestamp> <hh:mm:ss> <epoch_timestamp> <hh:mm:ss> 200ms 3 23 0 23
  4. If a snapshot is not present, the restore has unexpectedly failed. Contact IBM Support for assistance.
  5. If a valid snapshot is present, delete the indexes on the cluster:
    oc exec -n ${PROJECT_CPD_INST_OPERANDS} elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request DELETE --url 'http://localhost:19200/granite-3b,wkc,gs-system-index-wkc-v001,semantic' --header 'content-type: application/json'
  6. Scale the OpenSearch cluster down by quiescing:
    oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": true}}'
  7. Wait for the pods to scale down, checking the status with the following command:
    `watch "oc get pods | grep elasticsea"`
  8. When all the pods are gone, restart the cluster by unquiescing it:
    oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": false}}'

After you do these steps, Elasticsearch comes back up, and automatically kicks off recovery.

Backup fails for the platform with error in EDB Postgres cluster

Applies to: 5.0.0 and later

Diagnosing the problem
This problem occurs in all backup and restore methods. For example, when you use IBM Storage Fusion, the backup fails at the Hook: br-service hooks/pre-backup stage in the backup sequence.

In the cpdbr-oadp.log file, you see the following error:

time=<timestamp> level=info msg=cmd stderr: Error: cannot take a cold backup of the primary instance 
or a target primary instance if the k8s.enterprisedb.io/snapshotAllowColdBackupOnPrimary annotation is not set to enabled
Cause of the problem
Labels and annotations in the EDB Postgres cluster resources were not updated after a switchover of the EDB Postgres cluster's primary instance and replica.
Resolving the problem
Do the following steps:
  1. Save the following script as reset_br_annotations_labels.sh:
    # Check if the correct number of arguments is provided
    if [ "$#" -ne 2 ]; then
      echo "Usage: $0 <cluster-name> <operand-namespace>"
      exit 1
    fi
    
    cluster_name=$1
    zen_namespace=$2
    
    if ! oc whoami > /dev/null 2>&1; then
      echo "[$(date '+%Y-%m-%d %H:%M:%S')]] Session is invalid. Please log in."
      exit 1
    fi
    
    if ! oc get cluster.postgresql.k8s.enterprisedb.io ${cluster_name} -n ${zen_namespace}> /dev/null 2>&1; then
      echo "[$(date '+%Y-%m-%d %H:%M:%S')] Postgres EDB ${cluster_name} does not exist in namespace ${zen_namespace}"
      exit 1
    fi
    
    #The backup adapter annotation will removed temporarily and re-added 
    #backup_adapter_annotation: {"k8s.enterprisedb.io/externalBackupAdapterClusterConfig": "electedResourcesDecorators:\n  - key: \"icpdsupport/edb-elected\"\n    metadataType: \"label\"\n    value: \"true\"\nexcludedResourcesDecorators:\n  - key: \"icpdsupport/ignore-on-nd-backup\"\n    metadataType: \"label\"\n    value: \"true\"\nbackupInstanceDecorators:\n  - key: \"icpdsupport/edb-hooks\"\n    metadataType: \"label\"\n    value: \"true\"\npreBackupHookConfiguration:\n  container:\n    key: \"icpdsupport/edb-pre-backup-container\"\n  command:\n    key: \"icpdsupport/edb-pre-backup-command\"\n  onError:\n    key: \"icpdsupport/edb-pre-backup-on-error\"\npostBackupHookConfiguration:\n  container:\n    key: \"icpdsupport/edb-post-backup-container\"\n  command:\n    key: \"icpdsupport/edb-post-backup-command\""}
    
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] Fetching Backup Adapter configuration annotation from EDB cluster ${cluster_name}"
    annotation_value=$(oc get cluster.postgresql.k8s.enterprisedb.io ${cluster_name} -n ${zen_namespace} -o jsonpath='{.metadata.annotations.k8s\.enterprisedb\.io/externalBackupAdapterClusterConfig}')
    
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] Removing Backup Adapter configuration annotation from EDB cluster ${cluster_name}"
    oc annotate cluster.postgresql.k8s.enterprisedb.io ${cluster_name} -n ${zen_namespace} k8s.enterprisedb.io/externalBackupAdapterClusterConfig-
    
    #remove backup instance so it can be updated
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] Removing Backup Instance annotation from EDB cluster ${cluster_name}"
    oc annotate cluster.postgresql.k8s.enterprisedb.io ${cluster_name} -n ${zen_namespace} k8s.enterprisedb.io/backupInstance-
    
    velero_labels=$(oc get pvc,pods -l k8s.enterprisedb.io/cluster="${cluster_name}",velero.io/exclude-from-backup=true -n ${zen_namespace} -o  name | wc -l)
    if [ "$velero_labels" -gt 0 ]; then
      echo "[$(date '+%Y-%m-%d %H:%M:%S')] Removing velero label from related pvcs and pods of the EDB cluster ${cluster_name}"
      oc label pvc,pods -l k8s.enterprisedb.io/cluster="${cluster_name}",velero.io/exclude-from-backup=true velero.io/exclude-from-backup- -n ${zen_namespace}
    fi
    
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] Pausing for 30 seconds"
    sleep 30
    
    #Re-add backup annotation to cluster
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] Adding Backup Adapter configuration annotation to EDB cluster ${cluster_name}"
    oc annotate cluster.postgresql.k8s.enterprisedb.io ${cluster_name} -n ${zen_namespace} k8s.enterprisedb.io/externalBackupAdapterClusterConfig="${annotation_value}"
    
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] Reset of Backup Adapter configuration annotation / labels for EDB cluster ${cluster_name} is complete."
  2. Make the script executable:
    chmod 755 reset_br_annotations_labels.sh
  3. Get the EDB Postgres cluster name:
    oc get cluster -n ${PROJECT_CPD_INST_OPERANDS}
  4. Run the script with the following command:

    Replace <EDB_Cluster_Name> with the EDB Postgres cluster name.

    reset_br_annotations_labels.sh <EDB_Cluster_Name> ${PROJECT_CPD_INST_OPERANDS}
  5. Retry the backup.

Restoring an RSI-enabled backup fails

Applies to: 5.0.0-5.0.2

Fixed in: 5.0.3

Diagnosing the problem
Restoring an RSI-enabled backup with IBM Storage Fusion fails at the Hook: br-service-hooks-operators restore step. The cpdbr-tenant.log file shows the following error:
cannot create resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
Cause of the problem
Permissions are missing in the cpdbr-tenant-service-clusterrole clusterrole.
Resolving the problem
Do the following steps:
  1. Install cpd-cli 5.0.3.
  2. Upgrade the cpdbr service:
    • The cluster pulls images from the IBM Entitled Registry:
      Environments with the scheduling service
      cpd-cli oadp install \
      --upgrade=true \
      --component=cpdbr-tenant \
      --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
      --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \
      --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \
      --recipe-type=br \
      --log-level=debug \
      --verbose
      Environments without the scheduling service
      cpd-cli oadp install \
      --upgrade=true \
      --component=cpdbr-tenant \
      --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
      --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \
      --recipe-type=br \
      --log-level=debug \
      --verbose
    • The cluster pulls images from a private container registry:
      Environments with the scheduling service
      cpd-cli oadp install \
      --upgrade=true \
      --component=cpdbr-tenant \
      --namespace=${OADP_OPERATOR_NS} \
      --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
      --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \
      --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \
      --recipe-type=br \
      --log-level=debug \
      --verbose
      Environments without the scheduling service
      cpd-cli oadp install \
      --upgrade=true \
      --component=cpdbr-tenant \
      --namespace=${OADP_OPERATOR_NS} \
      --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
      --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \
      --recipe-type=br \
      --log-level=debug \
      --verbose
  3. Retry the restore.

Restore fails at Hook: br-service-hooks-operators restore step

Applies to: 5.0.1

Fixed in: 5.0.2

Diagnosing the problem
This problem occurs when using IBM Storage Fusion 2.7.2.
  1. The restore process fails at the Hook: br-service-hooks-operators restore step, and you see the following error message:
    Recipe failed
    BMYBR0003 There was an error when processing the job in the Transaction Manager service
  2. The ${PROJECT_CPD_INST_OPERANDS} project was not created during the restore.
  3. When you run the following commands, the IBM Storage Fusion application custom resource does not have the Cloud Pak for Data instance project listed under .spec.includeNamespaces.
    export PROJECT_FUSION=<fusion-namespace>
    Tip: By default, the IBM Storage Fusion project is ibm-spectrum-fusion-ns.
    oc get fapp -n ${PROJECT_FUSION} ${PROJECT_CPD_INST_OPERATORS} -o json | jq .spec
Cause of the problem
The backup is incomplete, causing the restore to fail.
Resolving the problem
Do the following steps:
  1. Install cpd-cli 5.0.2.
  2. Upgrade the cpdbr service:
    • The cluster pulls images from the IBM Entitled Registry:
      Environments with the scheduling service
      cpd-cli oadp install \
      --upgrade=true \
      --component=cpdbr-tenant \
      --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
      --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \
      --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \
      --recipe-type=br \
      --log-level=debug \
      --verbose
      Environments without the scheduling service
      cpd-cli oadp install \
      --upgrade=true \
      --component=cpdbr-tenant \
      --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
      --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \
      --recipe-type=br \
      --log-level=debug \
      --verbose
    • The cluster pulls images from a private container registry:
      Environments with the scheduling service
      cpd-cli oadp install \
      --upgrade=true \
      --component=cpdbr-tenant \
      --namespace=${OADP_OPERATOR_NS} \
      --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
      --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \
      --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \
      --recipe-type=br \
      --log-level=debug \
      --verbose
      Environments without the scheduling service
      cpd-cli oadp install \
      --upgrade=true \
      --component=cpdbr-tenant \
      --namespace=${OADP_OPERATOR_NS} \
      --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
      --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \
      --recipe-type=br \
      --log-level=debug \
      --verbose
  3. Patch policy assignments with the backup and restore recipe details.
    1. Log in to Red Hat OpenShift Container Platform as an instance administrator.
      ${OC_LOGIN}
      Remember: OC_LOGIN is an alias for the oc login command.
    2. Get each policy assignment name:
      export PROJECT_FUSION=<fusion-namespace>
      oc get policyassignment -n ${PROJECT_FUSION}
    3. If installed, patch the ${PROJECT_SCHEDULING_SERVICE} policy assignment:
      oc -n ${PROJECT_FUSION} patch policyassignment <cpd-scheduler-policy-assignment> --type merge -p '{"spec":{"recipe":{"name":"ibmcpd-scheduler", "namespace":"${PROJECT_SCHEDULING_SERVICE}", "apiVersion":"spp-data-protection.isf.ibm.com/v1alpha1"}}}'
    4. Patch the Cloud Pak for Data tenant policy assignment:
      oc -n ${PROJECT_FUSION} patch policyassignment <cpd-tenant-policy-assignment> --type merge -p '{"spec":{"recipe":{"name":"ibmcpd-tenant", "namespace":"${PROJECT_CPD_INST_OPERATORS}", "apiVersion":"spp-data-protection.isf.ibm.com/v1alpha1"}}}'
  4. Check that the IBM Storage Fusion application custom resource for the Cloud Pak for Data operator includes the following information:
    • All projects (namespaces) that are members of the Cloud Pak for Data instance, including:
      • The Cloud Pak for Data operators project (${PROJECT_CPD_INST_OPERATORS}).
      • The Cloud Pak for Data operands project (${PROJECT_CPD_INST_OPERANDS}).
      • All tethered projects, if they exist.
    • The PARENT_NAMESPACE variable, which is set to ${PROJECT_CPD_INST_OPERATORS}.
    1. To get the list of all projects that are members of the Cloud Pak for Data instance, run the following command:
      oc get -n ${PROJECT_FUSION} applications.application.isf.ibm.com ${PROJECT_CPD_INST_OPERATORS} -o jsonpath={'.spec.includedNamespaces'}
    2. To get the PARENT_NAMESPACE variable, run the following command:
      oc get -n ${PROJECT_FUSION} applications.application.isf.ibm.com ${PROJECT_CPD_INST_OPERATORS} -o jsonpath={'.spec.variables'}
  5. Take a new backup.

Data Virtualization restore fails at post-workload step

Applies to: 5.0.0-5.0.2

Fixed in: 5.0.3

Diagnosing the problem
When restoring an online backup of a Cloud Pak for Data deployment that includes Data Virtualization with IBM Storage Fusion, the restore fails at the Hook: br-service-hooks/post-workload step in the restore sequence. In the log file, you see the following error message:
time=<timestamp> level=info msg=   zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, 
op=<mode=post-restore,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult 
file=/go/src/cpdbr-oadp/pkg/quiesce/planexecutor.go:1137
Workaround
To work around the problem, do the following steps:
  1. Scale down the Data Virtualization hurricane pod:
    oc scale deployment c-db2u-dv-hurricane-dv --replicas=0
  2. Log in to the Data Virtualization head pod:
    oc rsh c-db2u-dv-db2u-0 bash
    su - db2inst1
  3. Create a backup copy of the users.json file:
    cp /mnt/blumeta0/db2_config/users.json /mnt/PV/versioned/logs/users.json.original
  4. Edit the users.json file:
    vi /mnt/blumeta0/db2_config/users.json
  5. Locate "locked":true and change it to "locked":false.
  6. Scale up the Data Virtualization hurricane pod:
    oc scale deployment c-db2u-dv-hurricane-dv --replicas=1
  7. Restart BigSQL from the Data Virtualization head pod:
    oc exec -it c-db2u-dv-db2u-0 -- su - db2inst1 -c "bigsql start"

    The Data Virtualization head and worker pods continue with the startup sequence.

  8. Wait until the Data Virtualization head and worker pods are fully started by running the following 2 commands:
    oc get pods | grep -i c-db2u-dv-dvcaching | grep 1/1 | grep -i Running
    oc exec -t c-db2u-dv-db2u-0 -- su - db2inst1 -c "ls -ls /tmp" | grep dv_setup_complete

    The Data Virtualization head and worker pods are fully started when these 2 commands return grep results instead of empty results.

  9. Re-create marker file that is needed by Data Virtualization's post-restore hook logic:
    oc exec -t c-db2u-dv-db2u-0 -- su - db2inst1 -c "touch /tmp/.ready_to_connectToDb"
  10. Re-run the post-restore hook.
    1. Get the cpdbr-tenant-service pod ID:
      oc get po -A | grep "cpdbr-tenant-service"
    2. Log in to the cpdbr-tenant-service pod:
      oc rsh -n ${PROJECT_CPD_INST_OPERATORS} <cpdbr-tenant-service pod id>
    3. Run the following commands:
      /cpdbr-scripts/cpdbr-oadp restore posthooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERATORS},${PROJECT_CPD_INST_OPERANDS} --log-level=debug --verbose
      /cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore-namespacescope --operators-namespace ${PROJECT_CPD_INST_OPERATORS} --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}

Restoring an online backup of Cloud Pak for Data on IBM Storage Scale Container Native storage fails

Applies to: IBM Storage Fusion 2.7.2 and later

Diagnosing the problem
When you restore an online backup with IBM Storage Fusion, the restore process fails at the Volume group: cpd-volumes step in the restore sequence.
Workaround
This problem occurs when you have Persistent Volume Claims (PVCs) that are smaller than 5Gi. To work around the problem, expand any PVC that is smaller than 5Gi to at least 5Gi before you create the backup. For details, see Volume Expansion in the IBM Storage Scale Container Storage Interface Driver documentation.
Note: You cannot manually expand Watson OpenScale PVCs. To manage PVC sizes for Watson OpenScale, see Managing persistent volume sizes for Watson OpenScale.

Backup failed at Volume group: cpd-volumes stage

Applies to: IBM Storage Fusion 2.7.2

Fixed in: IBM Storage Fusion 2.7.2 hotfix

Diagnosing the problem
In the backup sequence in IBM Storage Fusion 2.7.2, the backup fails at the Volume group: cpd-volumes stage.

The transaction manager log shows several error messages, such as the following examples:

<timestamp>[TM_0] - Error: Processing of volume cc-home-pvc failed.\n", "<timestamp>[VOL_12] -
Snapshot exception (410)\\nReason: Expired: too old resource version: 2575013 (2575014)
Workaround
Install the IBM Storage Fusion 2.7.2 hotfix. For details, see IBM Storage Fusion and IBM Storage Fusion HCI hotfix.

Backup of Cloud Pak for Data operators project fails at data transfer stage

Applies to: IBM Storage Fusion 2.7.2

Fixed in: IBM Storage Fusion 2.7.2 hotfix

Diagnosing the problem
In IBM Storage Fusion 2.7.2, the backup fails at the Data transfer stage, with the following error:
Failed transferring data
There was an error when processing the job in the Transaction Manager service
Cause
The length of a Persistent Volume Claim (PVC) name is more than 59 characters.
Workaround
Install the IBM Storage Fusion 2.7.2 hotfix. For details, see IBM Storage Fusion and IBM Storage Fusion HCI hotfix.

With the hotfix, PVC names can be up to 249 characters long.

Watson OpenScale etcd server fails to start after restoring from a backup

Applies to: 5.0.0 and later

Diagnosing the problem
After restoring a backup with NetApp Astra Control Center, the Watson OpenScale etcd cluster is in a Failed state.
Workaround
To work around the problem, do the following steps:
  1. Log in to Red Hat OpenShift Container Platform as a cluster administrator.
    ${OC_LOGIN}
    Remember: OC_LOGIN is an alias for the oc login command.
  2. Expand the size of the etcd PersistentVolumes by 1Gi.

    In the following example, the current PVC size is 10Gi, and the commands set the new PVC size to 11Gi.

    operatorPod=`oc get pod -n ${PROJECT_CPD_INST_OPERATORS} -l name=ibm-cpd-wos-operator | awk 'NR>1 {print $1}'`
    oc exec ${operatorPod} -n ${PROJECT_CPD_INST_OPERATORS} -- roles/service/files/etcdresizing_for_resizablepv.sh  -n ${PROJECT_CPD_INST_OPERANDS} -s 11Gi
  3. Wait for the reconciliation status of the Watson OpenScale custom resource to be in a Completed state:
    oc get WOService aiopenscale -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.status.wosStatus} {"\n"}'

    The status of the custom resource changes to Completed when the reconciliation finishes successfully.

Restore fails at the running post-restore script step

Applies to: 5.0.3

Diagnosing the problem
When you use Portworx asynchronous disaster recovery, activating applications fails when you run the post-restore script. In the restore_post_hooks_<timestamp>.log file, you see an error message such as in the following example:
Time: <timestamp> level=error -  cpd-tenant-restore-<timestamp>-r2 failed
/cpdbr-scripts/cpdbr/cpdbr-tenant.sh post-restore exit code=1
*** cpdbr-tenant.sh post-restore failed ***
command terminated with exit code 1
Resolving the problem
To work around the problem, prior to running the post-restore script, restore custom resource definitions by running the following command:
cpd-cli oadp restore create <restore-name-r2> \
--from-backup=cpd-tenant-backup-<timestamp>-b2 \
--include-resources='customresourcedefinitions' \
--include-cluster-resources=true \
--skip-hooks \
--log-level=debug \
--verbose

Cloud Pak for Data resources are not migrated

Applies to: 5.0.2

Fixed in: 5.0.3

Diagnosing the problem
When you use Portworx asynchronous disaster recovery, the migration finishes almost immediately, and no volumes or the expected number of resources are migrated. Run the following command:
storkctl get migrations -n ${PX_ADMIN_NS}
Tip: ${PX_ADMIN_NS} is usually kube-system.
Example output:
NAME                                                CLUSTERPAIR       STAGE   STATUS       VOLUMES   RESOURCES   CREATED               ELAPSED                       TOTAL BYTES TRANSFERRED
cpd-tenant-migrationschedule-interval-<timestamp>   mig-clusterpair   Final   Successful   0/0       0/0         <timestamp>   Volumes (0s) Resources (3s)   0
Cause of the problem
This problem occurs starting with stork 23.11.0. Backup exec rules are not run, and expected Cloud Pak for Data resources are not migrated.
Resolving the problem
To resolve the problem, downgrade stork to a version prior to 23.11.0. For more information about stork releases, see the stork Releases page.
  1. Scale down the Portworx operator so that it doesn't reset manual changes to the stork deployment:
    oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=0
  2. Edit the stork deployment image version to a version prior to 23.11.0:
    oc edit deploy -n ${PX_ADMIN_NS} stork
  3. If you need to scale up the Portworx operator, run the following command.
    Note: The Portworx operator will undo changes to the stork deployment and return to the original stork version.
    oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=1

Creating an offline backup in REST mode stalls

Applies to: 5.0.0 and later

Diagnosing the problem
This problem occurs when you try to create an offline backup in REST mode by using a custom --image-prefix value. The offline backup stalls with cpdbr-vol-mnt pods in the ImagePullBackOff state.
Cause of the problem
When you specify the --image-prefix option in the cpd-cli oadp backup create command, the default prefix registry.redhat.io/ubi9 is always used.
Resolving the problem
To work around the problem, create the backup in Kubernetes mode instead. To change to this mode, run the following command:
cpd-cli oadp client config set runtime-mode=

Common core services custom resource is in InProgress state after an offline restore to a different cluster

Applies to: 5.0.0, 5.0.1

Fixed in: 5.0.2

Diagnosing the problem
  1. Get the status of installed components by running the following command.
    cpd-cli manage get-cr-status \
    --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS}
  2. Check that the status of ccs-cr is InProgress.
Cause of the problem
The Common core services component failed to reconcile on the restored cluster, because the dsx-requisite-pre-install-job-<xxxx> pod job is failing.
Resolving the problem
To resolve the problem, follow the instructions that are described in the technote Failed dsx-requisite-pre-install-job during offline restore.

OpenPages offline backup fails with pre-hook error

Applies to: 5.0.1, 5.0.2

Fixed in: 5.0.3

Diagnosing the problem
The CPD-CLI*.log file shows pre-backup hook errors such as in the following example:
<time>  Hook execution breakdown by status=error/timedout:
<time>  
<time>  The following hooks either have errors or timed out
<time>  
<time>  pre-backup (1):
<time>  
<time>      	COMPONENT                     	CONFIGMAP                               	METHOD	STATUS	DURATION      
<time>      	openpages-openpagesinstance-cr	openpages-openpagesinstance-cr-aux-br-cm	rule  	error 	6m0.080179343s
<time>  
<time>  --------------------------------------------------------------------------------
<time>  
<time>  
<time>  ** INFO [BACKUP CREATE/SUMMARY/END] *******************************************
<time>  Error: error running pre-backup hooks: Error running pre-processing rules.  Check the /root/br/backup/cpd-cli-workspace/logs/CPD-CLI-<timestamp>.log for errors.
<time>  [ERROR] <timestamp> RunPluginCommand:Execution error:  exit status 1
<time>  nfs0717bak-tenant-offline-b1 k8s offline backup failed
Cause of the problem
Getting the OpenPages custom resource into the InMaintenance state timed out.
Workaround
Increase the pre-hooks timeout value in the openpages-openpagesinstance-cr-aux-br-cm ConfigMap.
  1. Edit the openpages-openpagesinstance-cr-aux-br-cm ConfigMap:
    oc edit cm openpages-openpagesinstance-cr-aux-br-cm -n ${PROJECT_CPD_INST_OPERANDS}
  2. Under pre-hooks, change the timeout value to 600s.
    pre-hooks:
          exec-rules:
          - resource-kind: OpenPagesInstance
            name: openpagesinstance-cr
            actions:
            - builtins:
                name: cpdbr.cpd.ibm.com/enable-maint
                params:
                  statusFieldName: openpagesStatus
                timeout: 600s

Offline backup pre-hooks fail on Separation of Duties cluster

Applies to: 5.0.0 and later

Diagnosing the problem
The CPD-CLI*.log file shows pre-backup hook errors such as in the following example:
<timestamp> level=info msg=   test-watsonxgovernce-instance/configmap/cpd-analyticsengine-aux-br-cm: component=analyticsengine-br, op=<mode=pre-backup,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137
...
time=<timestamp> level=info msg=   test-watsonxgovernce-instance/configmap/cpd-analyticsengine-cnpsql-aux-br-cm: component=analyticsengine-cnpsql-br, op=<mode=pre-backup,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137
Cause of the problem
The EDB Postgres pod for the Analytics Engine powered by Apache Spark service is in a CrashLoopBackOff state.
Workaround
To work around the problem, follow the instructions in the IBM Support document Unable to upgrade Spark due to Enterprise database issues.

Offline restore to a different cluster fails due to management-ingress-ibmcloud-cluster-info ConfigMap not found in PodVolumeRestore

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem
After an offline backup is created, but before doing a restore, check if the management-ingress-ibmcloud-cluster-info ConfigMap was backed up by running the following commands:
cpd-cli oadp backup status --details <backup_name1> | grep management-ingress-ibmcloud-cluster-info
cpd-cli oadp backup status --details <backup_name2> | grep management-ingress-ibmcloud-cluster-info

During or after the restore, pods that mount the missing ConfigMap show errors. For example:

oc describe po c-db2oltp-wkc-db2u-0 -n ${PROJECT_CPD_INST_OPERANDS}
Example output:
Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  41m (x512 over 17h)  kubelet  MountVolume.SetUp failed for volume "management-ingress-ibmcloud-cluster-info" : configmap "management-ingress-ibmcloud-cluster-info" not found
  Warning  FailedMount  62s (x518 over 17h)  kubelet  Unable to attach or mount volumes: unmounted volumes=[management-ingress-ibmcloud-cluster-info], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
Cause of the problem
When a related ibmcloud-cluster-info ConfigMap gets excluded as part of backup hooks, the management-ingress-ibmcloud-cluster-info ConfigMap copies the exclude labeling and unintentionally gets excluded from the backup.
Workaround
To work around the problem, do the following steps:
  1. Log in to Red Hat OpenShift Container Platform as a cluster administrator.
    ${OC_LOGIN}
    Remember: OC_LOGIN is an alias for the oc login command.
  2. Apply the following patch to ensure that the management-ingress-ibmcloud-cluster-info ConfigMap is not excluded from the backup:
    oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f - << EOF
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cpdbr-management-ingress-exclude-fix-br
      labels:
        cpdfwk.aux-kind: br
        cpdfwk.component: cpdbr-patch
        cpdfwk.module: cpdbr-management-ingress-exclude-fix
        cpdfwk.name: cpdbr-management-ingress-exclude-fix-br-cm
        cpdfwk.managed-by: ibm-cpd-sre
        cpdfwk.vendor: ibm
        cpdfwk.version: 1.0.0
    data:
      aux-meta: |
        name: cpdbr-management-ingress-exclude-fix-br
        description: |
          This configmap defines offline backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info
          configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap prehooks.
          This is a temporary workaround until a complete fix is implemented.
        version: 1.0.0
        component: cpdbr-patch
        aux-kind: br
        priority-order: 99999 # This should happen at the end of backup prehooks
      backup-meta: |
        pre-hooks:
          exec-rules:
          # Remove lingering velero exclude label from offline prehooks
          - resource-kind: configmap
            name: management-ingress-ibmcloud-cluster-info
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/label-resources
                  params:
                    action: remove
                    key: velero.io/exclude-from-backup
                    value: "true"
                  timeout: 360s
          # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation
          - resource-kind: configmap
            name: management-ingress-ibmcloud-cluster-info
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/label-resources
                  params:
                    action: remove
                    key: icpdsupport/ignore-on-nd-backup
                    value: "true"
                  timeout: 360s
        post-hooks:
          exec-rules: 
          - resource-kind: # do nothing for posthooks
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cpdbr-management-ingress-exclude-fix-ckpt
      labels:
        cpdfwk.aux-kind: checkpoint
        cpdfwk.component: cpdbr-patch
        cpdfwk.module: cpdbr-management-ingress-exclude-fix
        cpdfwk.name: cpdbr-management-ingress-exclude-fix-ckpt-cm
        cpdfwk.managed-by: ibm-cpd-sre
        cpdfwk.vendor: ibm
        cpdfwk.version: 1.0.0
    data:
      aux-meta: |
        name: cpdbr-management-ingress-exclude-fix-ckpt
        description: |
          This configmap defines online backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info
          configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap checkpoint operation.
          This is a temporary workaround until a complete fix is implemented.
        version: 1.0.0
        component: cpdbr-patch
        aux-kind: ckpt
        priority-order: 99999 # This should happen at the end of backup prehooks
      backup-meta: |
        pre-hooks:
          exec-rules:
          # Remove lingering velero exclude label from offline prehooks
          - resource-kind: configmap
            name: management-ingress-ibmcloud-cluster-info
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/label-resources
                  params:
                    action: remove
                    key: velero.io/exclude-from-backup
                    value: "true"
                  timeout: 360s
          # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation
          - resource-kind: configmap
            name: management-ingress-ibmcloud-cluster-info
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/label-resources
                  params:
                    action: remove
                    key: icpdsupport/ignore-on-nd-backup
                    value: "true"
                  timeout: 360s
        post-hooks:
          exec-rules: 
          - resource-kind: # do nothing for posthooks
      checkpoint-meta: |
        exec-hooks:
          exec-rules: 
          - resource-kind: # do nothing for checkpoint
    EOF

Unable to restore offline backup of OpenPages to different cluster

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem
In the CPD-CLI*.log file, you see an error like in the following example:
CPD-CLI-<timestamp>.log:time=<timestamp> level=error msg=failed to wait for statefulset openpages--78c5-ib-12ce in namespace <cpd_instance_ns>: 
timed out waiting for the condition func=cpdbr-oadp/pkg/kube.waitForStatefulSetPods file=/a/workspace/oadp-upload/pkg/kube/statefulset.go:173
Cause of the problem
The second RabbitMQ pod (ending in -1) is in a CrashLoopBackOff state. Run the following command:
oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep openpages
Example output:
openpages--78c5-ib-12ce-0                                1/1     Running                 0                 23h
openpages--78c5-ib-12ce-1                                0/1     CrashLoopBackOff        248 (3m57s ago)   23h
openpages-openpagesinstance-cr-sts-0                     1/2     Running                 91 (12m ago)      23h
openpages-openpagesinstance-cr-sts-1                     1/2     Running                 91 (12m ago)      23h
Workaround
To work around the problem, do the following steps:
  1. Log in to Red Hat OpenShift Container Platform as a cluster administrator.
    ${OC_LOGIN}
    Remember: OC_LOGIN is an alias for the oc login command.
  2. Check the OpenPages logs for the following error in the second RabbitMQ pod:
    ===========
    Exception during startup:
    
    exit:{boot_failed,{exit_status,1}}
    
        peer:start_it/2, line 639
        rabbit_peer_discovery:query_node_props/1, line 408
        rabbit_peer_discovery:sync_desired_cluster/3, line 189
        rabbit_db:init/0, line 65
        rabbit_boot_steps:-run_step/2-lc$^0/1-0-/2, line 51
        rabbit_boot_steps:run_step/2, line 58
        rabbit_boot_steps:-run_boot_steps/1-lc$^0/1-0-/1, line 22
        rabbit_boot_steps:run_boot_steps/1, line 23
    
  3. If you see this error, check the Erlang cookie value at the top of the OpenPages logs. For example, run the following command:
    oc logs openpages--78c5-ib-12ce-1
    Example output:
    Defaulted container "openpages-openpagesinstance-cr-<instance_id>-ibm-rabbitmq" out of: openpages-openpagesinstance-cr-<instance_id>-ibm-rabbitmq, copy-rabbitmq-config (init)
    ----------------------
    +FkpbwejzK2RXfmPLQAnITroiieu3uGa3vkRA2k6t+8=
    ----------------------
    <timestamp> [warning] <0.156.0> Overriding Erlang cookie using the value set in the environment

    The plus sign (+) at the beginning of the cookie value is the source of the problem.

  4. Regenerate a new token:
    openssl rand -base64 32 | tr -d '\\n' | base64 | tr -d '\\n'
  5. Decode from Base64 format, and make sure that the cookie value does not begin with a plus sign (+).
  6. Replace the cookie value in the auth secret.
    1. Edit the auth secret:
      oc edit secret openpages-openpagesinstance-cr-<instance_id>-rabbitmq-auth-secret
    2. Replace the rabbitmq-erlang-cookie value with the new value.
  7. Delete the StatefulSet, or scale down and then scale up to get all the pods to pick up the new cookie.

Security issues

Security scans return an Inadequate Account Lockout Mechanism message

Applies to: 5.0.0 and later

Diagnosing the problem
If you run a security scan against Cloud Pak for Data, the scan returns the following message.
Inadequate Account Lockout Mechanism
Resolving the problem
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.

The Kubernetes version information is disclosed

Applies to: 5.0.0 and later

Diagnosing the problem
If you run an Aqua Security scan against your cluster, the scan returns the following issue:
Resolving the problem
This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.