Known issues and limitations for IBM Cloud Pak for Data
The following issues apply to the IBM Cloud Pak for Data platform. Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.
- Customer-reported issues
- General issues
- Installation and upgrade issues
- Backup and restore issues
- Flight service issues
- Security issues
The following issues apply to IBM Cloud Pak for Data services.
Customer-reported issues
Issues that are found after the release are posted on the IBM Support site.
General issues
- After rebooting a cluster that uses OpenShift Data Foundation storage, some Cloud Pak for Data services aren't functional
- The Assist me icon is not displayed in the web client
- The delete-platform-ca-certs command does not remove certificate mounts from pods
- When you add a secret to a vault, you cannot filter the list of users and groups to show only groups
After rebooting a cluster that uses OpenShift Data Foundation storage, some Cloud Pak for Data services aren't functional
Applies to: 5.0.0 and later
- Diagnosing the problem
- After rebooting the cluster, some Cloud Pak for Data
custom resources remain in the
InProgress
state.For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift® Data Foundation 4.1.4 release notes.
- Workaround
- Do the following steps:
- Find the nodes that have pods that are in an
Error
state:oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A | grep -v -P "Completed|(\d+)\/\1"
- Mark each node as
unschedulable.
oc adm cordon <node_name>
- Delete the affected
pods:
oc get pod | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0
- Mark each node as
scheduled:
oc adm uncordon <node_name>
- Find the nodes that have pods that are in an
The Assist me icon is not displayed in the web client
Applies to: Upgrades from Version 4.8.x
Fixed in: 5.0.3
If you upgrade IBM Cloud Pak for Data from Version 4.8.x to Version 5.0, the Assist me icon is not visible in the web client toolbar.
The issue occurs because the ASSIST_ME_ENABLED
option is set to
false
.
- Resolving the problem
- To make Assist me available in the web client:
-
Log in to Red Hat OpenShift Container Platform as a user with sufficient permissions to complete the task.
${OC_LOGIN}
Remember:OC_LOGIN
is an alias for theoc login
command. - Run the following command to set
ASSIST_ME_ENABLED: true
:oc patch cm product-configmap \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch '{"data": {"ASSIST_ME_ENABLED": "true"}}'
- Confirm that the
ASSIST_ME_ENABLED
parameter is set totrue
:oc get cm product-configmap \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ -o jsonpath="{.data.ASSIST_ME_ENABLED}{'\n'}"
-
The delete-platform-ca-certs
command does not
remove certificate mounts from pods
Applies to: 5.0.0
Fixed in: 5.0.3
When you run the cpd-cli
manage
delete-platform-ca-certs
command, the command does not remove the
certificate mounts from pods.
- Resolving the problem
- To remove the certificate mounts from pods:
- Delete the
cpd-custom-ca-certs
secret:oc delete secret cpd-custom-ca-certs \ --namespace=${PROJECT_CPD_INST_OPERANDS}
- Run the
cpd-cli manage delete-platform-ca-certs
command:cpd-cli manage delete-platform-ca-certs \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \ --apply=true
- Delete the
When you add a secret to a vault, you cannot filter the list of users and groups to show only groups
Applies to: 5.0.0
Fixed in: 5.0.3
When you add a secret to a vault, you can optionally share the secret with other users. However, if you try to filter the list of users and groups to show only groups, the filter does not take effect.
Installation and upgrade issues
- The Switch locations icon is not available if the apply-cr command times out
- Upgrades fail if the Data Foundation Rook Ceph cluster is unstable
- Running the apply-olm command twice during an upgrade can remove required OLM resources
- After an upgrade from Cloud Pak for Data 4.7.3, FoundationDB can indicate a Failed status
- After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
- After you upgrade from Cloud Pak for Data 4.7.4, generating a bearer token fails in an IAM-enabled cluster
- Secrets are not visible in connections after upgrade
- Node pinning is not applied to postgresql pods
- You must manually clean up remote physical location artifacts if the create-physical-location command fails
- The ibm-nginx deployment does not scale fast enough when automatic scaling is configured
The Switch locations icon is not available if the apply-cr
command times out
Applies to: 5.0.0, 5.0.1, and 5.0.2
Fixed in: 5.0.3
If you install solutions that are available in different Cloud Pak for Data experiences, the Switch
locations icon is not available in the web client if the cpd-cli
manage
apply-cr
command times out.
- Resolving the problem
- Re-run the
cpd-cli manage apply-cr
command.
Upgrades fail if the Data Foundation Rook Ceph cluster is unstable
Applies to: 5.0.0 and later
If the Red Hat OpenShift Data Foundation or IBM Storage Fusion Data Foundation Rook Ceph® cluster is unstable, upgrades fail.
One symptom is that pods will not start because of a FailedMount
error. For
example:
Warning FailedMount 36s (x1456 over 2d1h) kubelet MountVolume.MountDevice failed for volume
"pvc-73bf3705-43e9-40bd-87ed-c1e1656d6f12" : rpc error: code = Aborted desc = an operation with the given
Volume ID 0001-0011-openshift-storage-0000000000000001-5e17508b-c295-4306-b684-eaa327aec2ab already exists
- Diagnosing the problem
- To confirm whether the Data Foundation
Rook Ceph cluster is unstable:
- Ensure that the
rook-ceph-tools
pod is running.oc get pods -n openshift-storage | grep rook-ceph-tools
Note: On IBM Storage Fusion HCI System or on environments that use hosted control planes, the pods are running in theopenshift-storage-client
project. - Set the
TOOLS_POD
environment variable to the name of therook-ceph-tools
pod:export TOOLS_POD=<pod-name>
- Execute into the
rook-ceph-tools
pod:oc rsh -n openshift-storage ${TOOLS_POD}
- Run the following command to get the status of the Rook Ceph
cluster:
ceph status
Confirm that the output includes the following line:health: HEALTH_WARN
- Exit the pod:
exit
- Ensure that the
- Resolving the problem
- To resolve the problem:
- Get the name of the
rook-ceph-mrg
pods:oc get pods -n openshift-storage | grep rook-ceph-mgr
- Set the
MGR_POD_A
environment variable to the name of therook-ceph-mgr-a
pod:export MGR_POD_A=<rook-ceph-mgr-a-pod-name>
- Set the
MGR_POD_B
environment variable to the name of therook-ceph-mgr-b
pod:export MGR_POD_B=<rook-ceph-mgr-b-pod-name>
- Delete the
rook-ceph-mgr-a
pod:oc delete pods ${MGR_POD_A} -n openshift-storage
- Ensure that the
rook-ceph-mgr-a
pod is running before you move to the next step:oc get pods -n openshift-storage | grep rook-ceph-mgr
- Delete the
rook-ceph-mgr-b
pod:oc delete pods ${MGR_POD_B} -n openshift-storage
- Ensure that the
rook-ceph-mgr-b
pod is running:oc get pods -n openshift-storage | grep rook-ceph-mgr
- Get the name of the
Running the apply-olm
command twice during an
upgrade can remove required OLM
resources
- Upgrades from Version 4.7 to 5.0.0
- Upgrades from Version 4.8 to 5.0.0
Upgrades to later 5.0 refreshes are not affected.
cpd-cli
manage
apply-olm
two times, you might notice several problems:- The operator subscription is missing
- The operator cluster service version (CSV) is missing
cpd-cli
manage
apply-cr
command, you might notice additional problems:- The
version
information is missing from thespec
section of the service custom resource - When you run the
cpd-cli manage get-cr-status
command, the values for theVersion
andReconciled-version
parameters are different.
- Resolving the problem
- To resolve the problem, you must re-run the
cpd-cli manage apply-olm
command a third time to ensure that the required resources are available. Then, re-run thecpd-cli manage apply-cr
command.
After an upgrade from Cloud Pak for Data 4.7.3, FoundationDB can indicate a Failed
status
Applies to: Upgrades from Version 4.7.3 to 5.0.0 and later
After upgrading Cloud Pak for Data from Version 4.7.3
to 5.0, the status of the FoundationDB cluster can indicate that it has failed
(fdbStatus: Failed
). The Failed
status can occur even if FoundationDB is available and working correctly. This issue
occurs when the FoundationDB resources do not get
properly cleaned up by the upgrade.
- IBM Knowledge Catalog
- IBM Match 360
- Diagnosing the problem
-
To determine if this problem has occurred:
Required role: To complete this task, you must be a cluster administrator.
- Check the FoundationDB cluster
status.
oc get fdbcluster -o yaml | grep fdbStatus
If the returned status is
Failed
, proceed to the next step to determine if the pods are available. - Check to see if the FoundationDB pods are up and
running.
oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep foundation
The returned list of FoundationDB pods should all have a status of
Running
. If they are not running, then the problem is something other than this issue.
- Check the FoundationDB cluster
status.
- Resolving the problem
-
To resolve this issue, restart the FoundationDB controller (
ibm-fdb-controller
):Required role: To complete this task, you must be a cluster administrator.
- Identify your FoundationDB
controllers.
This command returns the names of two FoundationDB controllers in the following formats:oc get pods -n ${PROJECT_CPD_INST_OPERATORS} | grep ibm-fdb-controller
ibm-fdb-controller-manager-<INSTANCE-ID>
apple-fdb-controller-manager-<INSTANCE-ID>
- Delete the
ibm-fdb-controller-manager
to refresh it.oc delete pod ibm-fdb-controller-<INSTANCE-ID> -n ${PROJECT_CPD_INST_OPERATORS}
- Wait for the controller to restart. This can take approximately one minute.
- Check the status of your FoundationDB
cluster:
Confirm that theoc -n ${PROJECT_CPD_INST_OPERANDS} get FdbCluster -o yaml
fdbStatus
is nowCompleted
.
- Identify your FoundationDB
controllers.
After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
Applies to: 5.0.0 and later
After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Match 360 cannot function correctly.
- IBM Knowledge Catalog
- IBM Match 360 with Watson
- Diagnosing the problem
- To identify the cause of this issue, check the FoundationDB status and details.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatus
If this command is successful, the returned status is
Complete
. If the status isInProgress
orFailed
, proceed to the workaround steps. - If the status is
Complete
but FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.oc rsh sample-cluster-log-1 /bin/fdbcli
To check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the
fdb>
prompt.status details
- If you get a message that is similar to Could not communicate with a quorum of
coordination servers, run the
coordinators
command with the IP addresses specified in the error message as input.oc get pod -o wide | grep storage > coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tls
If this step does not resolve the problem, proceed to the workaround steps.
- If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.
- If you get a message that is similar to Could not communicate with a quorum of
coordination servers, run the
- Check the FoundationDB
status.
- Resolving the problem
- To resolve this issue, restart the FoundationDB
pods.
Required role: To complete this task, you must be a cluster administrator.
- Restart the FoundationDB cluster
pods.
oc get fdbcluster oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete po
Replace ${CLUSTER_NAME} in the command with the name of your
fdbcluster
instance. - Restart the FoundationDB operator
pods.
oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po
- After the pods finish restarting, check to ensure that FoundationDB is available.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatus
The returned status must be
Complete
. - Check to ensure that the database is
available.
oc rsh sample-cluster-log-1 /bin/fdbcli
If the database is still not available, complete the following steps.
- Log in to the
ibm-fdb-controller
pod. - Run the
fix-coordinator
script.kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}
Replace ${CLUSTER_NAME} in the command with the name of your
fdbcluster
instance.Note: For more information about thefix-coordinator
script, see the workaround steps from the resolved IBM Match 360 known issue item The FoundationDB cluster can become unavailable.
- Log in to the
- Check the FoundationDB
status.
- Restart the FoundationDB cluster
pods.
After you upgrade from Cloud Pak for Data 4.7.4, generating a bearer token fails in an IAM-enabled cluster
Applies to: Upgrades from Version 4.7.4 to 5.0.0 and later
If you upgrade from Cloud Pak for Data version 4.7.4
to Cloud Pak for Data
5.0.0 and later, the IAM access token API
(/idprovider/v1/auth/identitytoken
) fails. You cannot login to the user interface
when the identitytoken
API fails.
- Diagnosing the problem
-
The following error is displayed in the log when you generate an IAM access token:
Failed to get access token, Liberty error: {"error_description":"CWWKS1406E: The token request had an invalid client credential. The request URI was \/oidc\/endpoint\/OP\/token.","error":"invalid_client"}"
- Resolving the problem
-
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}
Remember:OC_LOGIN
is an alias for theoc login
command. - Run the following command to restart the
oidc-client-registration
job:oc -n delete job oidc-client-registration
-
Secrets are not visible in connections after upgrade
- Upgrades from Version 4.7 to Version 5.0.0, 5.0.1, or 5.0.2
- Upgrades from Version 4.8 to Version 5.0.0, 5.0.1, or 5.0.2
Fixed in: 5.0.3
If you use secrets when you create connections, the secrets are not visible in the connection details after you upgrade Cloud Pak for Data. This issue occurs when your vault uses a private CA signed certificate.
- Resolving the problem
- To see the secrets in the user interface:
- Change to the project where Cloud Pak for Data is
installed:
oc project ${PROJECT_CPD_INST_OPERANDS}
- Set the following environment
variables:
oc set env deployment/zen-core-api VAULT_BRIDGE_TLS_RENEGOTIATE=true oc set env deployment/zen-core-api VAULT_BRIDGE_TOLERATE_SELF_SIGNED=true
- Change to the project where Cloud Pak for Data is
installed:
Node pinning is not applied to postgresql
pods
Applies to: 5.0.0 and later
If you use node pinning to schedule pods on specific nodes, and your environment includes
postgresql
pods, the node affinity settings are not applied to the
postgresql
pods that are associated with your Cloud Pak for Data deployment.
The resource specification injection (RSI) webhook cannot patch postgresql
pods
because the EDB Postgres operator uses a
PodDisruptionBudget
resource to limit the number of concurrent disruptions to
postgresql
pods. The PodDisruptionBudget
resource prevents
postgresql
pods from being evicted.
You must manually clean up remote physical location artifacts if the create-physical-location
command fails
Applies to: 5.0.0
Fixed in: 5.0.1
If the cpd-cli
manage
create-physical-location
command fails, the command leaves behind
resources that you must clean up by running the cpd-cli
manage
delete-physical-location
command:
cpd-cli manage delete-physical-location \
--physical_location_name=${REMOTE_PHYSICAL_LOCATION_ID} \
--management_ns=${REMOTE_PROJECT_MANAGEMENT} \
--cpd_hub_url=${CPD_HUB_URL} \
--cpd_hub_api_key=${CPD_HUB_API_KEY}
If you try to re-run the create-physical-location
command against the same management project before you run the delete-physical-location
command, the create-physical-location
command returns the following error:
The physical-location-info-cm ConfigMap already exists in the <management-ns> project.
The physical location in the ConfigMap is called <remote-physical-location-id>
* If you need to re-run the create-physical-location command to finish creating the physical location,
you must specify <remote-physical-location-id>.
* If you want to create a new physical location on the cluster, you must specify a different project.
You cannot reuse an existing management project.
The ibm-nginx
deployment does not scale fast enough when automatic scaling
is configured
Applies to: 5.0.0 and later
If you configure automatic scaling for the IBM Cloud Pak for Data control plane, the ibm-nginx
deployment might not scale fast enough. Some symptoms include:
- Slow response times
- High CPU requests are throttled
- The deployment scales up and down even when the workload is steady
This problem typically occurs when you install watsonx Assistant or watsonx Orchestrate.
- Resolving the problem
- If you encounter the preceding symptoms, you must manually scale the
ibm-nginx
deployment:oc patch zenservice lite-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": { "Nginx": { "name": "ibm-nginx", "kind": "Deployment", "container": "ibm-nginx-container", "replicas": 5, "minReplicas": 2, "maxReplicas": 11, "guaranteedReplicas": 2, "metrics": [ { "type": "Resource", "resource": { "name": "cpu", "target": { "type": "Utilization", "averageUtilization": 529 } } } ], "resources": { "limits": { "cpu": "1700m", "memory": "2048Mi", "ephemeral-storage": "500Mi" }, "requests": { "cpu": "225m", "memory": "920Mi", "ephemeral-storage": "100Mi" } }, "containerPolicies": [ { "containerName": "*", "minAllowed": { "cpu": "200m", "memory": "256Mi" }, "maxAllowed": { "cpu": "2000m", "memory": "2048Mi" }, "controlledResources": [ "cpu", "memory" ], "controlledValues": "RequestsAndLimits" } ] } }}'
Backup and restore issues
- Issues that apply to several or all backup and restore methods
-
- Backup fails for the platform with error in EDB Postgres cluster
- OADP backup is missing EDB Postgres PVCs
- Disk usage size error when running the du-pv command
- After restore, watsonx Assistant custom resource is stuck in InProgress at 11/19 verified state
- After restore, watsonx Assistant is stuck on the 17/19 deployed state or custom resource is stuck in InProgress state
- OADP backup precheck command fails
- During or after a restore, pod shows PVC is missing
- After restoring an online backup, status of Watson Discovery custom resource remains in InProgress state
- After successful restore, the ibm-common-service-operator deployment fails to reach a Running state
- Restore fails with Error from server (Forbidden): configmaps is forbidden error
- After a restore, unable to access the Cloud Pak for Data console
- After a successful restore, the Cloud Pak for Data console points to the source cluster domain in its URL instead of the target cluster domain
- Unable to back up Watson Discovery when the service is scaled to the xsmall size
- In a Cloud Pak for Data deployment that has multiple OpenPages instances, only the first instance is successfully restored
- Unable to log in to Cloud Pak for Data with OpenShift cluster credentials after successfully restoring to a different cluster
- After a restore, OperandRequest timeout error in the ZenService custom resource
- Online backup and restore with the OADP backup and restore utility issues
- Online backup and restore with IBM Storage Fusion issues
-
- Restoring an RSI-enabled backup fails
- Restore fails at Hook: br-service-hooks-operators restore step
- Data Virtualization restore fails at post-workload step
- Restoring an online backup of Cloud Pak for Data on IBM Storage Scale Container Native storage fails
- Backup failed at Volume group: cpd-volumes stage
- Backup of Cloud Pak for Data operators project fails at data transfer stage
- Online backup and restore with NetApp Astra Control Center issues
- Data replication with Portworx issues
- Offline backup and restore with the OADP backup and restore utility issues
-
- Creating an offline backup in REST mode stalls
- Common core services custom resource is in InProgress state after an offline restore to a different cluster
- OpenPages offline backup fails with pre-hook error
- Offline backup pre-hooks fail on Separation of Duties cluster
- Offline restore to a different cluster fails due to management-ingress-ibmcloud-cluster-info ConfigMap not found in PodVolumeRestore
- Unable to restore offline backup of OpenPages to different cluster
OADP backup is missing EDB Postgres PVCs
Applies to: 5.0.0 and later
- Diagnosing the problem
- After an online or offline backup is taken with the OADP utility, EDB Postgres PVCs are missing in the PVC backup list.
- Cause of the problem
- EDB Postgres replica PVCs might be excluded from a backup when an EDB Postgres cluster switches primary instances.
- Resolving the problem
- Before you create a backup, run the following
command:
oc label pvc,pods -l k8s.enterprisedb.io/cluster,velero.io/exclude-from-backup=true velero.io/exclude-from-backup- -n ${PROJECT_CPD_INST_OPERANDS}
For more information, see the following topics:
Disk usage size error when running the du-pv
command
Applies to: 5.0.0 and later
- Diagnosing the problem
- When you run the
du-pv
command to estimate how much storage is needed to create a backup with the OADP utility, you see the following error message:Total estimated volume usage size: 0 one or more error(s) occurred while trying to get disk usage size. Please check reported errors in log file for details
The status of the cpdbr-agent pods is
ImagePullBackoff
:oc get po -n ${OADP_NAMESPACE}
Example output:NAME READY STATUS RESTARTS AGE cpdbr-agent-9lprf 0/1 ImagePullBackOff 0 74s cpdbr-agent-pf42f 0/1 ImagePullBackOff 0 74s cpdbr-agent-trprx 0/1 ImagePullBackOff 0 74s
- Cause of the problem
- The
--image-prefix
option is not currently used by the cpdbr-agent install command. If you specify this option, it is ignored. Instead, the install command uses the default image atregistry.access.redhat.com/ubi9/ubi-minimal:latest
. - Resolving the problem
- Do the following steps:
- Patch the cpdbr-agent daemonset with the desired fully-qualified image
name:
oc patch daemonset cpdbr-agent -n ${OADP_NAMESPACE} --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"<fully-qualified-image-name>"}]'
- Wait for the daemonset to reach a healthy
state:
oc rollout status daemonset cpdbr-agent -n ${OADP_NAMESPACE}
- Retry the
dv-pv
command.
Tip: For more information about this feature, see Optional: Estimating how much storage to allocate for backups. - Patch the cpdbr-agent daemonset with the desired fully-qualified image
name:
After restore, watsonx Assistant custom
resource is stuck in InProgress
at 11/19
verified state
Applies to: 5.0.0-5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- This problem can occur after you restore an online backup to the same cluster or to a different
cluster. Run the following
command:
oc get <watsonx-Assistant-instance-name> -n ${PROJECT_CPD_INST_OPERANDS}
Example output:NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE AGE wa 5.0.1 False Initializing True VerifyWait 19/19 11/19 4h39m
- Cause of the problem
- Pods are unable to find the wa-global-etcd secret. Run the following
command:
oc describe pod wa-store-<xxxxxxxxx>-<xxxxx> | tail -5
Example output:Normal QueuePosition 51m (x2 over 52m) ibm-cpd-scheduler Queue Position: 3 Normal QueuePosition 50m (x2 over 52m) ibm-cpd-scheduler Queue Position: 2 Normal QueuePosition 36m ibm-cpd-scheduler Queue Position: 1 Warning FailedMount 6m49s (x22 over 50m) kubelet Unable to attach or mount volumes: unmounted volumes=[global-etcd], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition Warning FailedMount 74s (x33 over 52m) kubelet MountVolume.SetUp failed for volume "global-etcd" : secret "wa-global-etcd" not found
- Resolving the problem
- Delete certain deployments and recreate them by doing the following steps:
- Ensure that the watsonx Assistant operator is running.
- Create the
INSTANCE
environment variable and set it to the watsonx Assistant instance name:export INSTANCE=<watsonx-Assistant-instance-name>
- Run the following
script:
# Components to restart one by one SEQUENTIAL_DEPLOYMENTS=("ed" "dragonfly-clu-mm" "tfmm" "clu-triton-serving" "clu-serving" "nlu" "dialog" "store") # Components to restart together in parallel PARALLEL_DEPLOYMENTS=("analytics" "clu-embedding" "incoming-webhooks" "integrations" "recommends" "system-entities" "ui" "webhooks-connector" "gw-instance" "store-admin") for DEPLOYMENT in "${SEQUENTIAL_DEPLOYMENTS[@]}"; do echo "#Starting restart of $INSTANCE-$DEPLOYMENT." # Delete the deployment oc delete deployment $INSTANCE-$DEPLOYMENT # Wait until the deployment is completely deleted while oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do echo "Waiting for $INSTANCE-$DEPLOYMENT to be fully deleted..." sleep 5 done # Ensure the deployment is recreated echo "Recreating $INSTANCE-$DEPLOYMENT." while ! oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do echo "Waiting for $INSTANCE-$DEPLOYMENT to be created..." sleep 5 done echo "Waiting for $INSTANCE-$DEPLOYMENT to become ready..." oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully." done for DEPLOYMENT in "${PARALLEL_DEPLOYMENTS[@]}"; do echo "#Starting restart of $INSTANCE-$DEPLOYMENT." # Delete the deployment oc delete deployment $INSTANCE-$DEPLOYMENT & done # Wait for all parallel delete operations to complete wait # Ensure parallel deployments are recreated for DEPLOYMENT in "${PARALLEL_DEPLOYMENTS[@]}"; do while ! oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do echo "Waiting for $INSTANCE-$DEPLOYMENT to be created..." sleep 5 done echo "Waiting for $INSTANCE-$DEPLOYMENT to become ready..." oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully." done echo "All deployments have been restarted successfully."
After restore, watsonx Assistant is stuck on
the 17/19
deployed state or custom resource is stuck in InProgress
state
Applies to: 5.0.1, 5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- This problem can occur after you restore an online backup to the same cluster or to a different
cluster. Run the following
command:
oc get wa -n ${PROJECT_CPD_INST_OPERANDS}
Example output:NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE AGE wa 5.0.1 False Initializing True VerifyWait 17/19 15/19 4h39m
- Resolving the problem
- Delete the
wa-integrations-operand-secret
andwa-integrations-datastore-connection-strings
secrets by running the following commands:oc delete secret wa-integrations-operand-secret -n ${PROJECT_CPD_INST_OPERANDS}
oc delete secret wa-integrations-datastore-connection-strings -n ${PROJECT_CPD_INST_OPERANDS}
After the secrets are deleted, the watsonx Assistant operator recreates them with the correct values, and the watsonx Assistant custom resource and pods are now in a good state.
OADP backup precheck command fails
Applies to: 5.0.0, 5.0.1
Fixed in: 5.0.2
- Diagnosing the problem
- This problem occurs when you do offline or online backup and restore with the OADP backup and restore utility. Run the backup
precheck
command:
cpd-cli oadp backup precheck --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS}
The following error message appears:
error getting csv list: : clusterserviceversions.operators.coreos.com is forbidden: User "system:serviceaccount:zen-cpdbrapi:cpdbr-api-sa" cannot list resource "clusterserviceversions" in API group "operators.coreos.com" at the cluster scope Error: error getting csv list: : clusterserviceversions.operators.coreos.com is forbidden: User "system:serviceaccount:zen-cpdbrapi:cpdbr-api-sa" cannot list resource "clusterserviceversions" in API group "operators.coreos.com" at the cluster scope [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1
- Cause of the problem
- The cpdbr-api pod does not have the necessary permission to list clusterserviceversions.operators.coreos.com in all projects (namespaces) for the backup precheck command.
- Resolving the problem
- Add
--exclude-checks OadpOperatorCSV
to the backup precheck command:cpd-cli oadp backup precheck \ --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \ --exclude-checks OadpOperatorCSV
During or after a restore, pod shows PVC is missing
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- During or after a restore, a pod shows that one or more PVCs are missing. For
example:
oc describe pod c-db2oltp-wkc-db2u-0
Example output:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 18m (x11076 over 16h) ibm-cpd-scheduler 0/20 nodes are available: 20 persistentvolumeclaim "wkc-db2u-backups" not found. preemption: 0/20 nodes are available: 20 Preemption is not helpful for scheduling.
- Cause of the problem
- Velero does not back up PVCs that are in a
Terminating
state. - Resolving the problem
- To work around the problem, before you restore a backup, ensure that no PVCs are in a
Terminating
state. To check for PVCs that are in aTerminating
state after a backup is created, check the Velero pod logs forSkipping item because it's being deleted
messages:oc logs po -l deploy=velero -n <oadp-operator-ns>
Example output:
time="<timestamp>" level=info msg="Skipping item because it's being deleted." backup=oadp-operator/bkupocs661-tenant-online-b1 logSource="/remote-source/velero/app/pkg/backup/item_backupper.go:161" name=wkc-db2u-backups namespace=zen1 resource=persistentvolumeclaims time="<timestamp>" level=info msg="Backed up 286 items out of an estimated total of 292 (estimate will change throughout the backup)" backup=oadp-operator/bkupocs661-tenant-online-b1 logSource="/remote-source/velero/app/pkg/backup/backup.go:404" name=wkc-db2u-backups namespace=zen1 progress= resource=persistentvolumeclaims
After restoring an online backup, status of Watson Discovery custom resource remains in
InProgress
state
Applies to: 5.0.0-5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- You see the following error, even though you did the Multicloud Object Gateway post-restore task. For example, if you used
IBM Storage Fusion to do the backup and
restore, you created the secrets that Watson Discovery uses to
connect to Multicloud Object Gateway.
- lastTransactionTime: <timestamp> message: Post task of online restore is in progress. Please ensure that MCG is correctly configured after restore. reason: PostRestoreInProgress status: "True" type: Message
- Cause of the problem
- The Watson Discovery post-restore task did not complete.
- Resolving the problem
- To work around the problem, do the following steps:
- Check that the Watson Discovery post-restore
component
exists:
oc get wd wd -o jsonpath='{.status.componentStatus.deployedComponents[?(@=="post_restore")]}'
If the post-restore component exists, the output of the command is:post_restore
- Check that the post-restore task is not
unverified:
oc get wd wd -o jsonpath='{.status.componentStatus.unverifiedComponents[?(@=="post_restore")]}'
If the post-restore task is not unverified, no output is produced by the command.
- In this situation, some failure jobs do not rerun and must be
deleted:
oc delete job wd-discovery-enrichment-model-copy wd-discovery-orchestrator-setup
- Check that Watson Discovery is now
ready:
oc get wd
Example output:NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTOREQUIESCE AGE wd 5.0.0 True Stable False Stable 23/23 23/23 NOT_QUIESCED NOT_QUIESCED 22h
- Check that the Watson Discovery post-restore
component
exists:
After successful restore, the ibm-common-service-operator deployment
fails to reach a Running
state
Applies to: 5.0.0 and later
- Diagnosing the problem
- The following symptoms are seen:
- Running the following command shows that the ibm-common-service-operator
pod and deployment are not
healthy:
Example output:oc get pods -n ${PROJECT_CPD_INST_OPERATORS}
ibm-common-service-operator-<...> 0/1 CrashLoopBackOff 72 (4m46s ago) 6h11m
Error logs show permission issues:
oc logs ibm-common-service-operator-<...>
Example output:... # I0529 20:52:39.182025 1 request.go:665] Waited for 1.033737216s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/dashboard.opendatahub.io/v1alpha?timeout=32s # <date_timestamp>20:52:47.794Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"} # I0529 20:52:47.794980 1 main.go:130] Identifying Common Service Operator Role in the namespace cpd-operator # E0529 20:52:47.835106 1 util.go:465] Failed to fetch configmap kube-public/saas-config: configmaps "saas-config" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot get resource "configmaps" in API group "" in the namespace "kube-public" # I0529 20:52:47.837942 1 init.go:152] Single Deployment Status: false, MultiInstance Deployment status: true, SaaS Depolyment Status: false # I0529 20:52:49.188786 1 request.go:665] Waited for 1.340366538s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/cdi.kubevirt.io/v1beta1?timeout=32s # E0529 20:52:57.412736 1 init.go:1683] Failed to cleanup validatingWebhookConfig: validatingwebhookconfigurations.admissionregistration.k8s.io "ibm-common-service-validating-webhook-cpd-operator" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot delete resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope # E0529 20:52:57.412762 1 main.go:153] Cleanup Webhook Resources failed: validatingwebhookconfigurations.admissionregistration.k8s.io "ibm-common-service-validating-webhook-cpd-operator" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot delete resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
- Running the following command shows that the ibm-common-service-operator
CSV is stuck in a
Pending
state:
Example output:oc get csv -n ${PROJECT_CPD_INST_OPERATORS}
NAME DISPLAY VERSION REPLACES PHASE ibm-zen-operator.v6.0.0 IBM Zen Service 6.0.0 Pending
Running the following command shows that status of the CommonService custom resource is
Succeeded
:oc get commonservice -n ${PROJECT_CPD_INST_OPERANDS} common-service -o json | jq .status.phase
- OLM logs show the following
error:
oc logs -n openshift-operator-lifecycle-manager -l app=catalog-operator
oc logs -n openshift-operator-lifecycle-manager -l app=olm-operator
Example output:E0530 01:00:07.268889 1 queueinformer_operator.go:319] sync {"update" "cpd-operator/ibm-common-service-operator.v4.6.0"} failed: requirements were not met time="<timestamp>" level=info msg="requirements were not met" csv=cpd-platform-operator.v4.6.0 id=<...> namespace=cpd-operator phase=Pending
- Running the following command shows that the ibm-common-service-operator
pod and deployment are not
healthy:
- Cause of the problem
- The root cause is from a known OLM issue where ClusterRoleBindings are missing, even though the InstallPlan shows it was created. For details, see the OLM issue ClusterRoleBinding is missing although InstallPlan shows it was created.
- Resolving the problem
- To work around the problem, clean up the Cloud Pak for Data instance and operator projects (namespaces) and retry the restore. For cleanup instructions, see Preparing to restore Cloud Pak for Data with the OADP utility.
Restore fails with Error from server (Forbidden): configmaps is forbidden
error
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- When restoring Cloud Pak for Data to a different
cluster with IBM Storage Fusion, NetApp Astra Control Center, or Portworx, you see the following error
message:
Time: <timestamp> level=error - oc get configmap -n kube-public - FAILED with: Error from server (Forbidden): configmaps is forbidden: User "system:serviceaccount:cpd-operator:cpdbr-tenant-service-sa" cannot list resource "configmaps" in API group "" in the namespace "kube-public" End Time: <timestamp>
- Cause of the problem
- The command to uninstall the
cpdbr service was run with the incorrect
--tenant-operator-namespace
parameter. For example, multiple Cloud Pak for Data instances were installed in the cluster, and while cleaning up one of the instances, the incorrect project was specified when uninstalling the cpdbr service. - Resolving the problem
- To work around the problem, reinstall the cpdbr service in the project where it was mistakenly uninstalled. For details, see one of the following topics:
After a restore, unable to access the Cloud Pak for Data console
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- You see the following symptoms:
- Running the following command shows that the ibm-iam-operator pod and
deployment are not
healthy:
Example output:oc get pods -n ${PROJECT_CPD_INST_OPERATORS}
ibm-iam-operator-<...> 0/1 CrashLoopBackOff 72 (4m46s ago) 6h11m
Error logs show permission issues:oc logs ibm-iam-operator-<...>
- Running the following command shows that the ibm-iam-operator CSV is stuck
in a
Pending
state:
Example output:oc get csv -n ${PROJECT_CPD_INST_OPERATORS}
NAME DISPLAY VERSION REPLACES PHASE ibm-iam-operator.v4.6.0 IBM IM Operator 4.6.0 Pending
Running the following command shows that status of the CommonService custom resource is
Succeeded
:oc get commonservice -n ${PROJECT_CPD_INST_OPERANDS} common-service -o json | jq .status.phase
- OLM logs show the following
error:
oc logs -n openshift-operator-lifecycle-manager -l app=catalog-operator
oc logs -n openshift-operator-lifecycle-manager -l app=olm-operator
Example output:E0530 01:00:07.268889 1 queueinformer_operator.go:319] sync {"update" "cpd-operator/ibm-iam-operator.v4.6.0"} failed: requirements were not met time="<timestamp>" level=info msg="requirements were not met" csv=ibm-iam-operator.v4.6.0 id=<...> namespace=cpd-operator phase=Pending
- Running the following command shows that the ibm-iam-operator pod and
deployment are not
healthy:
- Cause of the problem
- Insufficient permissions from missing ClusterRole and ClusterRoleBindings. The root cause is from a known OLM issue where ClusterRoleBindings are missing, even though the InstallPlan shows it was created. For details, see the OLM issue ClusterRoleBinding is missing although InstallPlan shows it was created.
- Resolving the problem
- To work around the problem, clean up the Cloud Pak for Data instance and operator projects (namespaces) and retry the restore. For cleanup instructions, see Preparing to restore Cloud Pak for Data with the OADP utility.
After a successful restore, the Cloud Pak for Data console points to the source cluster domain in its URL instead of the target cluster domain
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- Get the Cloud Pak for Data console route by running
the following
command:
oc get route -n ${PROJECT_CPD_INST_OPERANDS}
The output of the command shows that the Cloud Pak for Data console route points to the source cluster domain rather than to the target cluster domain.
- Cause of the problem
- The ibmcloud-cluster-info ConfigMap from the source cluster is included in the restore when it is expected to be excluded and re-generated, causing the target restore cluster to use the source routes.
- Resolving the problem
- To work around the problem, do the following steps:
- Edit the fields in the ibmcloud-cluster-info ConfigMap to use the target
cluster
hostname:
oc edit configmap ibmcloud-cluster-info -n ${PROJECT_CPD_INST_OPERANDS}
- Restart the ibm-zen-operator
pod:
oc delete po -l app.kubernetes.io/name=ibm-zen-operator -n ${PROJECT_CPD_INST_OPERANDS}
- Check that the routes are
updated:
oc get route -n ${PROJECT_CPD_INST_OPERANDS}
If restarting the ibm-zen-operator pod does not correctly update the routes, and the ibm-iam-operator deployment is not healthy, do the workaround that is described in the previous issue.
- Edit the fields in the ibmcloud-cluster-info ConfigMap to use the target
cluster
hostname:
Unable to back up Watson Discovery when the
service is scaled to the xsmall
size
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- The problem that you see depends on the backup and restore method that you are using. For example, if you are using IBM Storage Fusion, a Failed snapshot message appears during the backup process.
- Cause of the problem
- The
xsmall
size configuration uses 1 OpenSearch data node. The backup process requires 2 data nodes. - Resolving the problem
- To work around the problem, increase the number of OpenSearch data nodes to 2. In the
${PROJECT_CPD_INST_OPERANDS}
project (namespace), run the following command:oc patch wd wd --type=merge --patch='{"spec":{"elasticsearch":{"dataNode":{"replicas":2}}}}'
In a Cloud Pak for Data deployment that has multiple OpenPages instances, only the first instance is successfully restored
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- After the restore, the custom resource of the first is OpenPages instance is in a
Completed
state. The custom resources of the remaining OpenPages instances are in anInMaintenance
state. - Cause of the problem
- Hooks (prehooks, posthooks, etc.) are run only on the first OpenPages instance. Log files list only the results for one OpenPages instance when multiple were expected.
- Resolving the problem
- To work around the problem, do the following steps:
- Get the OpenPages instance
ConfigMaps:
oc get cm -n ${PROJECT_CPD_INST_OPERANDS} -l cpdfwk.module=openpages-aux
- Edit each OpenPages instance ConfigMap
so that their
.data.aux-meta.name
fields match their.metadata.labels.["cpdfwk.name"]
label:oc edit cm -n ${PROJECT_CPD_INST_OPERANDS} <configmap-name>
- Get the OpenPages instance
ConfigMaps:
Unable to log in to Cloud Pak for Data with OpenShift cluster credentials after successfully restoring to a different cluster
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- When Cloud Pak for Data is integrated with the
Identity Management Service service, you cannot log in with
OpenShift cluster credentials. You might
be able to log in with LDAP or as
cpdadmin
. - Resolving the problem
- To work around the problem, run the following
commands:
oc delete cm platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm ibm-iam-bindinfo-oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm ibm-iam-bindinfo-platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}
oc delete pods -n ${PROJECT_CPD_INST_OPERATORS} -l app.kubernetes.io/instance=ibm-common-service-operator
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-auth-service
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-management
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-provider
After a restore, OperandRequest timeout error in the ZenService custom resource
Applies to: 5.0.0 and later
- Diagnosing the problem
- Get the status of the ZenService
YAML:
oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yaml
In the output, you see the following error:
... zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request": Timed out waiting on resource' ...
Check for failing operandrequests:oc get operandrequests -A
For failing operandrequests, check their conditions forconstraints not satisfiable
messages:oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name>
- Cause of the problem
- Subscription wait operations timed out. The problematic subscriptions show an error similar to
the following
example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0 exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0 and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0 originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0, subscription ibm-db2aaservice-cp4d-operator exists'
This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.
- Workaround
- Do the following steps:
- Delete the problematic clusterserviceversions and subscriptions, and restart the
Operand Deployment Lifecycle Manager (ODLM) pod.
For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.
- Delete Cloud Pak for Data instance projects
(namespaces).
For details, see Preparing to restore Cloud Pak for Data with the OADP utility.
- Retry the restore.
- Delete the problematic clusterserviceversions and subscriptions, and restart the
Operand Deployment Lifecycle Manager (ODLM) pod.
Online restore of Data Virtualization fails with post-hook errors
Applies to: 5.0.2, 5.0.3
- Diagnosing the problem
- Restoring an online backup of Data Virtualization
on Portworx storage with the OADP backup and restore utility fails. In the
CPD-CLI*.log file, you see errors such as in the following
examples:
<time> zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, op=<mode=post-restore,type=config-hook,method=rule>, status=error
time=<timestamp> level=error msg=error performing op postRestoreViaConfigHookRule for resource dv, msg: 1 error occurred: * : command timed out after 40m0s: timed out waiting for the condition func=cpdbr-oadp/pkg/quiesce.(*BasicPlanExecutor).applyPlanInternal file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1535
- Cause of the problem
- Db2 startup is slow, causing the Data Virtualization post-restore hook to time out.
- Resolving the problem
- To work around the problem, take various Data Virtualization components out of write-suspend mode.
- Take dvutils out of write-suspend
mode:
oc rsh c-db2u-dv-dvutils-0 bash
/opt/dv/current/dv-utils.sh -o leavesafemode --is-bar
- Take the Data Virtualization hurricane pod out of
write-suspend
mode:
oc rsh $(oc get pods | grep -i hurricane | cut -d' ' -f 1) bash
su - db2inst1
/usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L
- Take Db2 out of write-suspend
mode:
oc rsh c-db2u-dv-db2u-0 bash
su - db2inst1
/db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L
- After a few minutes, verify that Db2 is
no longer in write-suspend
mode:
db2 connect to bigsql
If the command finishes successfully, Db2 is no longer in write-suspend mode.
- Restart the Data Virtualization caching pod by
deleting the existing
pod:
oc delete pod $(oc get pods | grep -i c-db2u-dv-dvcaching | cut -d' ' -f 1)
- Take dvutils out of write-suspend
mode:
Online backup of Analytics Engine powered by Apache Spark fails
Applies to: 5.0.1
Fixed in: 5.0.2
- Diagnosing the problem
- When you try to create a backup of a Cloud Pak for Data deployment that includes the Analytics Engine powered by Apache Spark service with the OADP utility, the
backup fails at the step to create a backup of Cloud Pak for Data PVCs and volume data. In the log file, you see
the following error:
Hook execution breakdown by status=error/timedout: The following hooks either have errors or timed out pre-backup (1): COMPONENT CONFIGMAP METHOD STATUS DURATION analyticsengine-cnpsql-ckpt cpd-analyticsengine-aux-edb-ckpt-cm rule error 1m17.502299591s -------------------------------------------------------------------------------- ** INFO [BACKUP CREATE/SUMMARY/END] ******************************************* Error: error running pre-backup hooks: Error running pre-processing rules. Check the /root/install_automation/cpd-cli-linux-EE-14.0.1-353/cpd-cli-workspace/logs/CPD-CLI-<date>.log for errors. [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1
- Cause of the problem
- The EDB Postgres cluster spark-hb-cloud-native-postgresql remains fenced.
- Resolving the problem
- Unfence the cluster by doing the following steps:
- Edit the spark-hb-cloud-native-postgresql
cluster:
oc edit clusters.postgresql.k8s.enterprisedb.io spark-hb-cloud-native-postgresql
- Remove the following
line:
k8s.enterprisedb.io/fencedInstances: ""
- Retry the backup.
Tip: For more information about resolving problems with EDB Postgres clusters that remain fenced, see EDB Postgres cluster is in an unhealthy state after a failed online backup. - Edit the spark-hb-cloud-native-postgresql
cluster:
Watson Speech services status is stuck in
InProgress
after restore
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- After an online restore with the OADP
utility, the CPD-CLI*.log file shows
speechStatus
is in theInProgress
state. - Cause of the problem
- The
speechStatus
is in theInProgress
state due to a race condition in the stt-async component. Pods that are associated with this component are stuck in0/1 Running
state. Run the following command to confirm this state:oc get pods -l app.kubernetes.io/component=stt-async
Example output:NAME READY STATUS RESTARTS AGE speech-cr-stt-async-775d5b9d55-fpj8x 0/1 Running 0 60m
If one or more pods is in the
0/1 Running
state for 20 minutes or more, this problem might occur. - Resolving the problem
- For each pod in the
0/1 Running
state, run the following command:oc delete pod <stt-async-podname>
Common core services and dependent services in a failed state after an online restore
Applies to: 5.0.0
- Diagnosing the problem
- After you restore an online backup with the OADP backup and restore utility, the Common core services custom resource and the custom resource of
dependent services remain in an
InProgress
state. - Cause of the problem
- Intermittent Elasticsearch failure.
- Workaround
- To work around the problem, do the following steps:
- Make sure that the current project (namespace) is set to the project that contains the Common core services and Watson Knowledge Catalog deployment.
- Make sure that a valid backup is available by running the following
command:
oc exec elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request GET --url http://localhost:19200/_cat/snapshots/cloudpak --header 'content-type: application/json'
- When a valid backup is present, the command returns output like in the following
example:
cloudpak_snapshot_<timestamp> SUCCESS <epoch_timestamp> <hh:mm:ss> <epoch_timestamp> <hh:mm:ss> 200ms 3 23 0 23
- If a snapshot is not present, the restore has unexpectedly failed. Contact IBM Support for assistance.
- If a valid snapshot is present, delete the indexes on the
cluster:
oc exec -n ${PROJECT_CPD_INST_OPERANDS} elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request DELETE --url 'http://localhost:19200/granite-3b,wkc,gs-system-index-wkc-v001,semantic' --header 'content-type: application/json'
- Scale the OpenSearch cluster down by
quiescing:
oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": true}}'
- Wait for the pods to scale down, checking the status with the following
command:
`watch "oc get pods | grep elasticsea"`
- When all the pods are gone, restart the cluster by unquiescing
it:
oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": false}}'
After you do these steps, Elasticsearch comes back up, and automatically kicks off recovery.
Backup fails for the platform with error in EDB Postgres cluster
Applies to: 5.0.0 and later
- Diagnosing the problem
- This problem occurs in all backup and restore methods. For example, when you use IBM Storage Fusion, the backup fails at the
Hook: br-service hooks/pre-backup stage in the backup sequence.
In the cpdbr-oadp.log file, you see the following error:
time=<timestamp> level=info msg=cmd stderr: Error: cannot take a cold backup of the primary instance or a target primary instance if the k8s.enterprisedb.io/snapshotAllowColdBackupOnPrimary annotation is not set to enabled
- Cause of the problem
- Labels and annotations in the EDB Postgres cluster resources were not updated after a switchover of the EDB Postgres cluster's primary instance and replica.
- Resolving the problem
- Do the following steps:
- Save the following script as
reset_br_annotations_labels.sh:
# Check if the correct number of arguments is provided if [ "$#" -ne 2 ]; then echo "Usage: $0 <cluster-name> <operand-namespace>" exit 1 fi cluster_name=$1 zen_namespace=$2 if ! oc whoami > /dev/null 2>&1; then echo "[$(date '+%Y-%m-%d %H:%M:%S')]] Session is invalid. Please log in." exit 1 fi if ! oc get cluster.postgresql.k8s.enterprisedb.io ${cluster_name} -n ${zen_namespace}> /dev/null 2>&1; then echo "[$(date '+%Y-%m-%d %H:%M:%S')] Postgres EDB ${cluster_name} does not exist in namespace ${zen_namespace}" exit 1 fi #The backup adapter annotation will removed temporarily and re-added #backup_adapter_annotation: {"k8s.enterprisedb.io/externalBackupAdapterClusterConfig": "electedResourcesDecorators:\n - key: \"icpdsupport/edb-elected\"\n metadataType: \"label\"\n value: \"true\"\nexcludedResourcesDecorators:\n - key: \"icpdsupport/ignore-on-nd-backup\"\n metadataType: \"label\"\n value: \"true\"\nbackupInstanceDecorators:\n - key: \"icpdsupport/edb-hooks\"\n metadataType: \"label\"\n value: \"true\"\npreBackupHookConfiguration:\n container:\n key: \"icpdsupport/edb-pre-backup-container\"\n command:\n key: \"icpdsupport/edb-pre-backup-command\"\n onError:\n key: \"icpdsupport/edb-pre-backup-on-error\"\npostBackupHookConfiguration:\n container:\n key: \"icpdsupport/edb-post-backup-container\"\n command:\n key: \"icpdsupport/edb-post-backup-command\""} echo "[$(date '+%Y-%m-%d %H:%M:%S')] Fetching Backup Adapter configuration annotation from EDB cluster ${cluster_name}" annotation_value=$(oc get cluster.postgresql.k8s.enterprisedb.io ${cluster_name} -n ${zen_namespace} -o jsonpath='{.metadata.annotations.k8s\.enterprisedb\.io/externalBackupAdapterClusterConfig}') echo "[$(date '+%Y-%m-%d %H:%M:%S')] Removing Backup Adapter configuration annotation from EDB cluster ${cluster_name}" oc annotate cluster.postgresql.k8s.enterprisedb.io ${cluster_name} -n ${zen_namespace} k8s.enterprisedb.io/externalBackupAdapterClusterConfig- #remove backup instance so it can be updated echo "[$(date '+%Y-%m-%d %H:%M:%S')] Removing Backup Instance annotation from EDB cluster ${cluster_name}" oc annotate cluster.postgresql.k8s.enterprisedb.io ${cluster_name} -n ${zen_namespace} k8s.enterprisedb.io/backupInstance- velero_labels=$(oc get pvc,pods -l k8s.enterprisedb.io/cluster="${cluster_name}",velero.io/exclude-from-backup=true -n ${zen_namespace} -o name | wc -l) if [ "$velero_labels" -gt 0 ]; then echo "[$(date '+%Y-%m-%d %H:%M:%S')] Removing velero label from related pvcs and pods of the EDB cluster ${cluster_name}" oc label pvc,pods -l k8s.enterprisedb.io/cluster="${cluster_name}",velero.io/exclude-from-backup=true velero.io/exclude-from-backup- -n ${zen_namespace} fi echo "[$(date '+%Y-%m-%d %H:%M:%S')] Pausing for 30 seconds" sleep 30 #Re-add backup annotation to cluster echo "[$(date '+%Y-%m-%d %H:%M:%S')] Adding Backup Adapter configuration annotation to EDB cluster ${cluster_name}" oc annotate cluster.postgresql.k8s.enterprisedb.io ${cluster_name} -n ${zen_namespace} k8s.enterprisedb.io/externalBackupAdapterClusterConfig="${annotation_value}" echo "[$(date '+%Y-%m-%d %H:%M:%S')] Reset of Backup Adapter configuration annotation / labels for EDB cluster ${cluster_name} is complete."
- Make the script
executable:
chmod 755 reset_br_annotations_labels.sh
- Get the EDB Postgres cluster
name:
oc get cluster -n ${PROJECT_CPD_INST_OPERANDS}
- Run the script with the following command:
Replace
<EDB_Cluster_Name>
with the EDB Postgres cluster name.reset_br_annotations_labels.sh <EDB_Cluster_Name> ${PROJECT_CPD_INST_OPERANDS}
- Retry the backup.
- Save the following script as
reset_br_annotations_labels.sh:
Restoring an RSI-enabled backup fails
Applies to: 5.0.0-5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- Restoring an RSI-enabled backup with IBM Storage Fusion fails at the
Hook: br-service-hooks-operators restore
step. The cpdbr-tenant.log file shows the following error:cannot create resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
- Cause of the problem
- Permissions are missing in the cpdbr-tenant-service-clusterrole clusterrole.
- Resolving the problem
- Do the following steps:
- Install cpd-cli 5.0.3.
- Upgrade the cpdbr service:
- The cluster pulls images from the IBM
Entitled Registry:
- Environments with the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \ --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \ --recipe-type=br \ --log-level=debug \ --verbose
- Environments without the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \ --recipe-type=br \ --log-level=debug \ --verbose
- The cluster pulls images from a private container registry:
- Environments with the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --namespace=${OADP_OPERATOR_NS} \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \ --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \ --recipe-type=br \ --log-level=debug \ --verbose
- Environments without the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --namespace=${OADP_OPERATOR_NS} \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \ --recipe-type=br \ --log-level=debug \ --verbose
- The cluster pulls images from the IBM
Entitled Registry:
- Retry the restore.
Restore fails at Hook: br-service-hooks-operators restore
step
Applies to: 5.0.1
Fixed in: 5.0.2
- Diagnosing the problem
- This problem occurs when using IBM Storage Fusion 2.7.2.
- The restore process fails at the
Hook: br-service-hooks-operators restore
step, and you see the following error message:Recipe failed BMYBR0003 There was an error when processing the job in the Transaction Manager service
- The ${PROJECT_CPD_INST_OPERANDS} project was not created during the restore.
- When you run the following commands, the IBM Storage Fusion application custom resource does not
have the Cloud Pak for Data instance project listed under
.spec.includeNamespaces
.export PROJECT_FUSION=<fusion-namespace>
Tip: By default, the IBM Storage Fusion project isibm-spectrum-fusion-ns
.oc get fapp -n ${PROJECT_FUSION} ${PROJECT_CPD_INST_OPERATORS} -o json | jq .spec
- The restore process fails at the
- Cause of the problem
- The backup is incomplete, causing the restore to fail.
- Resolving the problem
- Do the following steps:
- Install cpd-cli 5.0.2.
- Upgrade the cpdbr service:
- The cluster pulls images from the IBM
Entitled Registry:
- Environments with the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \ --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \ --recipe-type=br \ --log-level=debug \ --verbose
- Environments without the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \ --recipe-type=br \ --log-level=debug \ --verbose
- The cluster pulls images from a private container registry:
- Environments with the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --namespace=${OADP_OPERATOR_NS} \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \ --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \ --recipe-type=br \ --log-level=debug \ --verbose
- Environments without the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --namespace=${OADP_OPERATOR_NS} \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \ --recipe-type=br \ --log-level=debug \ --verbose
- The cluster pulls images from the IBM
Entitled Registry:
- Patch policy assignments with the backup and restore recipe details.
- Log in to Red Hat
OpenShift Container Platform as an instance
administrator.
${OC_LOGIN}
Remember:OC_LOGIN
is an alias for theoc login
command. - Get each policy assignment
name:
export PROJECT_FUSION=<fusion-namespace>
oc get policyassignment -n ${PROJECT_FUSION}
- If installed, patch the
${PROJECT_SCHEDULING_SERVICE}
policy assignment:oc -n ${PROJECT_FUSION} patch policyassignment <cpd-scheduler-policy-assignment> --type merge -p '{"spec":{"recipe":{"name":"ibmcpd-scheduler", "namespace":"${PROJECT_SCHEDULING_SERVICE}", "apiVersion":"spp-data-protection.isf.ibm.com/v1alpha1"}}}'
- Patch the Cloud Pak for Data tenant policy
assignment:
oc -n ${PROJECT_FUSION} patch policyassignment <cpd-tenant-policy-assignment> --type merge -p '{"spec":{"recipe":{"name":"ibmcpd-tenant", "namespace":"${PROJECT_CPD_INST_OPERATORS}", "apiVersion":"spp-data-protection.isf.ibm.com/v1alpha1"}}}'
- Log in to Red Hat
OpenShift Container Platform as an instance
administrator.
- Check that the IBM Storage Fusion
application custom resource for the Cloud Pak for Data
operator includes the following information:
- All projects (namespaces) that are members of the Cloud Pak for Data instance, including:
- The Cloud Pak for Data operators project
(
${PROJECT_CPD_INST_OPERATORS}
). - The Cloud Pak for Data operands project
(
${PROJECT_CPD_INST_OPERANDS}
). - All tethered projects, if they exist.
- The Cloud Pak for Data operators project
(
- The
PARENT_NAMESPACE
variable, which is set to${PROJECT_CPD_INST_OPERATORS}
.
- To get the list of all projects that are members of the Cloud Pak for Data instance, run the following
command:
oc get -n ${PROJECT_FUSION} applications.application.isf.ibm.com ${PROJECT_CPD_INST_OPERATORS} -o jsonpath={'.spec.includedNamespaces'}
- To get the
PARENT_NAMESPACE
variable, run the following command:oc get -n ${PROJECT_FUSION} applications.application.isf.ibm.com ${PROJECT_CPD_INST_OPERATORS} -o jsonpath={'.spec.variables'}
- All projects (namespaces) that are members of the Cloud Pak for Data instance, including:
- Take a new backup.
Data Virtualization restore fails at post-workload step
Applies to: 5.0.0-5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- When restoring an online backup of a Cloud Pak for Data deployment that includes Data Virtualization with
IBM Storage Fusion, the restore fails at the
Hook: br-service-hooks/post-workload step in the restore sequence. In the log
file, you see the following error
message:
time=<timestamp> level=info msg= zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, op=<mode=post-restore,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/go/src/cpdbr-oadp/pkg/quiesce/planexecutor.go:1137
- Workaround
- To work around the problem, do the following steps:
- Scale down the Data Virtualization hurricane
pod:
oc scale deployment c-db2u-dv-hurricane-dv --replicas=0
- Log in to the Data Virtualization head
pod:
oc rsh c-db2u-dv-db2u-0 bash
su - db2inst1
- Create a backup copy of the users.json
file:
cp /mnt/blumeta0/db2_config/users.json /mnt/PV/versioned/logs/users.json.original
- Edit the users.json
file:
vi /mnt/blumeta0/db2_config/users.json
- Locate
"locked":true
and change it to"locked":false
. - Scale up the Data Virtualization hurricane
pod:
oc scale deployment c-db2u-dv-hurricane-dv --replicas=1
- Restart BigSQL from the Data Virtualization head
pod:
oc exec -it c-db2u-dv-db2u-0 -- su - db2inst1 -c "bigsql start"
The Data Virtualization head and worker pods continue with the startup sequence.
- Wait until the Data Virtualization head and worker
pods are fully started by running the following 2
commands:
oc get pods | grep -i c-db2u-dv-dvcaching | grep 1/1 | grep -i Running
oc exec -t c-db2u-dv-db2u-0 -- su - db2inst1 -c "ls -ls /tmp" | grep dv_setup_complete
The Data Virtualization head and worker pods are fully started when these 2 commands return
grep
results instead of empty results. - Re-create marker file that is needed by Data Virtualization's post-restore hook
logic:
oc exec -t c-db2u-dv-db2u-0 -- su - db2inst1 -c "touch /tmp/.ready_to_connectToDb"
- Re-run the post-restore hook.
- Get the cpdbr-tenant-service pod
ID:
oc get po -A | grep "cpdbr-tenant-service"
- Log in to the cpdbr-tenant-service
pod:
oc rsh -n ${PROJECT_CPD_INST_OPERATORS} <cpdbr-tenant-service pod id>
- Run the following
commands:
/cpdbr-scripts/cpdbr-oadp restore posthooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERATORS},${PROJECT_CPD_INST_OPERANDS} --log-level=debug --verbose
/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore-namespacescope --operators-namespace ${PROJECT_CPD_INST_OPERATORS} --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}
- Get the cpdbr-tenant-service pod
ID:
- Scale down the Data Virtualization hurricane
pod:
Restoring an online backup of Cloud Pak for Data on IBM Storage Scale Container Native storage fails
Applies to: IBM Storage Fusion 2.7.2 and later
- Diagnosing the problem
- When you restore an online backup with IBM Storage Fusion, the restore process fails at the Volume group: cpd-volumes step in the restore sequence.
- Workaround
- This problem occurs when you have Persistent Volume Claims (PVCs) that are smaller than 5Gi. To
work around the problem, expand any PVC that is smaller than 5Gi to at least 5Gi before you create
the backup. For details, see Volume Expansion in the IBM Storage Scale Container Storage Interface Driver documentation.Note: You cannot manually expand Watson OpenScale PVCs. To manage PVC sizes for Watson OpenScale, see Managing persistent volume sizes for Watson OpenScale.
Backup failed at Volume group: cpd-volumes stage
Applies to: IBM Storage Fusion 2.7.2
Fixed in: IBM Storage Fusion 2.7.2 hotfix
- Diagnosing the problem
- In the backup sequence in IBM Storage Fusion 2.7.2, the backup fails at the Volume group: cpd-volumes stage.
The transaction manager log shows several error messages, such as the following examples:
<timestamp>[TM_0] - Error: Processing of volume cc-home-pvc failed.\n", "<timestamp>[VOL_12] -
Snapshot exception (410)\\nReason: Expired: too old resource version: 2575013 (2575014)
- Workaround
- Install the IBM Storage Fusion 2.7.2 hotfix. For details, see IBM Storage Fusion and IBM Storage Fusion HCI hotfix.
Backup of Cloud Pak for Data operators project fails at data transfer stage
Applies to: IBM Storage Fusion 2.7.2
Fixed in: IBM Storage Fusion 2.7.2 hotfix
- Diagnosing the problem
- In IBM Storage Fusion 2.7.2, the backup
fails at the Data transfer stage, with the following
error:
Failed transferring data There was an error when processing the job in the Transaction Manager service
- Cause
- The length of a Persistent Volume Claim (PVC) name is more than 59 characters.
- Workaround
- Install the IBM Storage Fusion 2.7.2
hotfix. For details, see IBM Storage Fusion and
IBM Storage Fusion HCI hotfix.
With the hotfix, PVC names can be up to 249 characters long.
Watson OpenScale etcd server fails to start after restoring from a backup
Applies to: 5.0.0 and later
- Diagnosing the problem
- After restoring a backup with NetApp Astra Control Center, the Watson
OpenScale
etcd cluster is in a
Failed
state. - Workaround
- To work around the problem, do the following steps:
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}
Remember:OC_LOGIN
is an alias for theoc login
command. - Expand the size of the etcd PersistentVolumes by 1Gi.
In the following example, the current PVC size is 10Gi, and the commands set the new PVC size to 11Gi.
operatorPod=`oc get pod -n ${PROJECT_CPD_INST_OPERATORS} -l name=ibm-cpd-wos-operator | awk 'NR>1 {print $1}'` oc exec ${operatorPod} -n ${PROJECT_CPD_INST_OPERATORS} -- roles/service/files/etcdresizing_for_resizablepv.sh -n ${PROJECT_CPD_INST_OPERANDS} -s 11Gi
- Wait for the reconciliation status of the Watson
OpenScale custom resource to be in a
Completed
state:oc get WOService aiopenscale -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.status.wosStatus} {"\n"}'
The status of the custom resource changes to
Completed
when the reconciliation finishes successfully.
-
Restore fails at the running post-restore script step
Applies to: 5.0.3
- Diagnosing the problem
- When you use Portworx
asynchronous disaster recovery, activating
applications fails when you run the post-restore script. In the
restore_post_hooks_<timestamp>.log file, you see an error
message such as in the following
example:
Time: <timestamp> level=error - cpd-tenant-restore-<timestamp>-r2 failed /cpdbr-scripts/cpdbr/cpdbr-tenant.sh post-restore exit code=1 *** cpdbr-tenant.sh post-restore failed *** command terminated with exit code 1
- Resolving the problem
- To work around the problem, prior to running the post-restore script, restore custom resource
definitions by running the following
command:
cpd-cli oadp restore create <restore-name-r2> \ --from-backup=cpd-tenant-backup-<timestamp>-b2 \ --include-resources='customresourcedefinitions' \ --include-cluster-resources=true \ --skip-hooks \ --log-level=debug \ --verbose
Cloud Pak for Data resources are not migrated
Applies to: 5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- When you use Portworx
asynchronous disaster recovery, the migration finishes almost immediately, and no volumes or the
expected number of resources are migrated. Run the following
command:
storkctl get migrations -n ${PX_ADMIN_NS}
Tip:${PX_ADMIN_NS}
is usually kube-system.Example output:NAME CLUSTERPAIR STAGE STATUS VOLUMES RESOURCES CREATED ELAPSED TOTAL BYTES TRANSFERRED cpd-tenant-migrationschedule-interval-<timestamp> mig-clusterpair Final Successful 0/0 0/0 <timestamp> Volumes (0s) Resources (3s) 0
- Cause of the problem
- This problem occurs starting with stork 23.11.0. Backup exec rules are not run, and expected Cloud Pak for Data resources are not migrated.
- Resolving the problem
- To resolve the problem, downgrade stork to a version prior to 23.11.0. For
more information about stork releases, see the stork Releases page.
- Scale down the Portworx operator
so that it doesn't reset manual changes to the stork
deployment:
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=0
- Edit the stork deployment image version to a version prior to
23.11.0:
oc edit deploy -n ${PX_ADMIN_NS} stork
- If you need to scale up the Portworx operator, run the following command.Note: The Portworx operator will undo changes to the stork deployment and return to the original stork version.
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=1
- Scale down the Portworx operator
so that it doesn't reset manual changes to the stork
deployment:
Creating an offline backup in REST mode stalls
Applies to: 5.0.0 and later
- Diagnosing the problem
- This problem occurs when you try to create an offline backup in REST mode by using a custom
--image-prefix
value. The offline backup stalls with cpdbr-vol-mnt pods in theImagePullBackOff
state. - Cause of the problem
- When you specify the
--image-prefix
option in thecpd-cli oadp backup create
command, the default prefixregistry.redhat.io/ubi9
is always used. - Resolving the problem
- To work around the problem, create the backup in Kubernetes mode instead. To change to this mode,
run the following
command:
cpd-cli oadp client config set runtime-mode=
Common core services custom resource
is in InProgress
state after an offline restore to a different cluster
Applies to: 5.0.0, 5.0.1
Fixed in: 5.0.2
- Diagnosing the problem
-
- Get the status of installed components by running the following
command.
cpd-cli manage get-cr-status \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS}
- Check that the status of ccs-cr is
InProgress
.
- Get the status of installed components by running the following
command.
- Cause of the problem
- The Common core services component failed to reconcile on the restored cluster, because the dsx-requisite-pre-install-job-<xxxx> pod job is failing.
- Resolving the problem
- To resolve the problem, follow the instructions that are described in the technote Failed dsx-requisite-pre-install-job during offline restore.
OpenPages offline backup fails with pre-hook error
Applies to: 5.0.1, 5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- The CPD-CLI*.log file shows pre-backup hook errors such as in the following
example:
<time> Hook execution breakdown by status=error/timedout: <time> <time> The following hooks either have errors or timed out <time> <time> pre-backup (1): <time> <time> COMPONENT CONFIGMAP METHOD STATUS DURATION <time> openpages-openpagesinstance-cr openpages-openpagesinstance-cr-aux-br-cm rule error 6m0.080179343s <time> <time> -------------------------------------------------------------------------------- <time> <time> <time> ** INFO [BACKUP CREATE/SUMMARY/END] ******************************************* <time> Error: error running pre-backup hooks: Error running pre-processing rules. Check the /root/br/backup/cpd-cli-workspace/logs/CPD-CLI-<timestamp>.log for errors. <time> [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1 <time> nfs0717bak-tenant-offline-b1 k8s offline backup failed
- Cause of the problem
- Getting the OpenPages custom resource
into the
InMaintenance
state timed out. - Workaround
- Increase the pre-hooks timeout value in the
openpages-openpagesinstance-cr-aux-br-cm ConfigMap.
- Edit the openpages-openpagesinstance-cr-aux-br-cm
ConfigMap:
oc edit cm openpages-openpagesinstance-cr-aux-br-cm -n ${PROJECT_CPD_INST_OPERANDS}
- Under
pre-hooks
, change the timeout value to 600s.pre-hooks: exec-rules: - resource-kind: OpenPagesInstance name: openpagesinstance-cr actions: - builtins: name: cpdbr.cpd.ibm.com/enable-maint params: statusFieldName: openpagesStatus timeout: 600s
- Edit the openpages-openpagesinstance-cr-aux-br-cm
ConfigMap:
Offline backup pre-hooks fail on Separation of Duties cluster
Applies to: 5.0.0 and later
- Diagnosing the problem
- The CPD-CLI*.log file shows pre-backup hook errors such as in the following
example:
<timestamp> level=info msg= test-watsonxgovernce-instance/configmap/cpd-analyticsengine-aux-br-cm: component=analyticsengine-br, op=<mode=pre-backup,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137 ... time=<timestamp> level=info msg= test-watsonxgovernce-instance/configmap/cpd-analyticsengine-cnpsql-aux-br-cm: component=analyticsengine-cnpsql-br, op=<mode=pre-backup,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137
- Cause of the problem
- The EDB Postgres pod for the
Analytics Engine powered by Apache Spark service is in a
CrashLoopBackOff
state. - Workaround
- To work around the problem, follow the instructions in the IBM Support document Unable to upgrade Spark due to Enterprise database issues.
Offline restore to a different cluster fails due to management-ingress-ibmcloud-cluster-info ConfigMap not found in PodVolumeRestore
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- After an offline backup is created, but before doing a restore, check if the
management-ingress-ibmcloud-cluster-info ConfigMap was backed up by running the
following
commands:
cpd-cli oadp backup status --details <backup_name1> | grep management-ingress-ibmcloud-cluster-info
cpd-cli oadp backup status --details <backup_name2> | grep management-ingress-ibmcloud-cluster-info
During or after the restore, pods that mount the missing ConfigMap show errors. For example:
oc describe po c-db2oltp-wkc-db2u-0 -n ${PROJECT_CPD_INST_OPERANDS}
Example output:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 41m (x512 over 17h) kubelet MountVolume.SetUp failed for volume "management-ingress-ibmcloud-cluster-info" : configmap "management-ingress-ibmcloud-cluster-info" not found Warning FailedMount 62s (x518 over 17h) kubelet Unable to attach or mount volumes: unmounted volumes=[management-ingress-ibmcloud-cluster-info], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
- Cause of the problem
- When a related ibmcloud-cluster-info ConfigMap gets excluded as part of
backup hooks, the management-ingress-ibmcloud-cluster-info ConfigMap copies the
exclude
labeling and unintentionally gets excluded from the backup. - Workaround
- To work around the problem, do the following steps:
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}
Remember:OC_LOGIN
is an alias for theoc login
command. - Apply the following patch to ensure that the
management-ingress-ibmcloud-cluster-info ConfigMap is not excluded from the
backup:
oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f - << EOF apiVersion: v1 kind: ConfigMap metadata: name: cpdbr-management-ingress-exclude-fix-br labels: cpdfwk.aux-kind: br cpdfwk.component: cpdbr-patch cpdfwk.module: cpdbr-management-ingress-exclude-fix cpdfwk.name: cpdbr-management-ingress-exclude-fix-br-cm cpdfwk.managed-by: ibm-cpd-sre cpdfwk.vendor: ibm cpdfwk.version: 1.0.0 data: aux-meta: | name: cpdbr-management-ingress-exclude-fix-br description: | This configmap defines offline backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap prehooks. This is a temporary workaround until a complete fix is implemented. version: 1.0.0 component: cpdbr-patch aux-kind: br priority-order: 99999 # This should happen at the end of backup prehooks backup-meta: | pre-hooks: exec-rules: # Remove lingering velero exclude label from offline prehooks - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: velero.io/exclude-from-backup value: "true" timeout: 360s # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: icpdsupport/ignore-on-nd-backup value: "true" timeout: 360s post-hooks: exec-rules: - resource-kind: # do nothing for posthooks --- apiVersion: v1 kind: ConfigMap metadata: name: cpdbr-management-ingress-exclude-fix-ckpt labels: cpdfwk.aux-kind: checkpoint cpdfwk.component: cpdbr-patch cpdfwk.module: cpdbr-management-ingress-exclude-fix cpdfwk.name: cpdbr-management-ingress-exclude-fix-ckpt-cm cpdfwk.managed-by: ibm-cpd-sre cpdfwk.vendor: ibm cpdfwk.version: 1.0.0 data: aux-meta: | name: cpdbr-management-ingress-exclude-fix-ckpt description: | This configmap defines online backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap checkpoint operation. This is a temporary workaround until a complete fix is implemented. version: 1.0.0 component: cpdbr-patch aux-kind: ckpt priority-order: 99999 # This should happen at the end of backup prehooks backup-meta: | pre-hooks: exec-rules: # Remove lingering velero exclude label from offline prehooks - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: velero.io/exclude-from-backup value: "true" timeout: 360s # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: icpdsupport/ignore-on-nd-backup value: "true" timeout: 360s post-hooks: exec-rules: - resource-kind: # do nothing for posthooks checkpoint-meta: | exec-hooks: exec-rules: - resource-kind: # do nothing for checkpoint EOF
-
Unable to restore offline backup of OpenPages to different cluster
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error like in the following
example:
CPD-CLI-<timestamp>.log:time=<timestamp> level=error msg=failed to wait for statefulset openpages--78c5-ib-12ce in namespace <cpd_instance_ns>: timed out waiting for the condition func=cpdbr-oadp/pkg/kube.waitForStatefulSetPods file=/a/workspace/oadp-upload/pkg/kube/statefulset.go:173
- Cause of the problem
- The second RabbitMQ pod (ending in
-1
) is in aCrashLoopBackOff
state. Run the following command:
Example output:oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep openpages
openpages--78c5-ib-12ce-0 1/1 Running 0 23h openpages--78c5-ib-12ce-1 0/1 CrashLoopBackOff 248 (3m57s ago) 23h openpages-openpagesinstance-cr-sts-0 1/2 Running 91 (12m ago) 23h openpages-openpagesinstance-cr-sts-1 1/2 Running 91 (12m ago) 23h
- Workaround
- To work around the problem, do the following steps:
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}
Remember:OC_LOGIN
is an alias for theoc login
command. - Check the OpenPages logs for the
following error in the second RabbitMQ pod:
=========== Exception during startup: exit:{boot_failed,{exit_status,1}} peer:start_it/2, line 639 rabbit_peer_discovery:query_node_props/1, line 408 rabbit_peer_discovery:sync_desired_cluster/3, line 189 rabbit_db:init/0, line 65 rabbit_boot_steps:-run_step/2-lc$^0/1-0-/2, line 51 rabbit_boot_steps:run_step/2, line 58 rabbit_boot_steps:-run_boot_steps/1-lc$^0/1-0-/1, line 22 rabbit_boot_steps:run_boot_steps/1, line 23
- If you see this error, check the Erlang cookie value at the top of the OpenPages logs. For example, run the following
command:
Example output:oc logs openpages--78c5-ib-12ce-1
Defaulted container "openpages-openpagesinstance-cr-<instance_id>-ibm-rabbitmq" out of: openpages-openpagesinstance-cr-<instance_id>-ibm-rabbitmq, copy-rabbitmq-config (init) ---------------------- +FkpbwejzK2RXfmPLQAnITroiieu3uGa3vkRA2k6t+8= ---------------------- <timestamp> [warning] <0.156.0> Overriding Erlang cookie using the value set in the environment
The plus sign (+) at the beginning of the cookie value is the source of the problem.
- Regenerate a new
token:
openssl rand -base64 32 | tr -d '\\n' | base64 | tr -d '\\n'
- Decode from Base64 format, and make sure that the cookie value does not begin with a plus sign (+).
- Replace the cookie value in the auth secret.
- Edit the auth
secret:
oc edit secret openpages-openpagesinstance-cr-<instance_id>-rabbitmq-auth-secret
- Replace the
rabbitmq-erlang-cookie
value with the new value.
- Edit the auth
secret:
- Delete the StatefulSet, or scale down and then scale up to get all the pods to pick up the new cookie.
-
Flight service issues
Security issues
Security scans return an Inadequate Account Lockout Mechanism message
Applies to: 5.0.0 and later
- Diagnosing the problem
-
If you run a security scan against Cloud Pak for Data, the scan returns the following message.
Inadequate Account Lockout Mechanism
- Resolving the problem
-
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.
The Kubernetes version information is disclosed
Applies to: 5.0.0 and later
- Diagnosing the problem
- If you run an Aqua Security scan against your cluster, the scan returns the following issue:
- Resolving the problem
- This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.