Troubleshooting

It is necessary to collect specific information about your deployments to measure their well-being and to diagnose a problem before you open a support case on Cloud Pak for Business Automation.

Before you begin

You must collect specific data about your environment and your Cloud Pak installation before you contact IBM support for assistance with a Cloud Pak for Business Automation issue. You must provide a detailed description of the problem and your environment.

When you run diagnostic commands, run them from an empty directory to package the files more cleanly. Run the commands from the namespace in which you observe the problematic container or component. For more information, see Collecting data to diagnose issues.

The OpenShift must-gather CLI command collects information from your cluster, which can be used to debug issues. You can specify one or more images when you run the command by including the --image argument. When you specify an image, the tool collects data that is related to that image.

A must-gather extension image for all IBM Cloud Paks is also available at: opencloudio/must-gather.

You can collect logs by running the following command:

oc adm must-gather --image=quay.io/opencloudio/must-gather:latest

For more information about collecting the logs, see Collecting support information about the cluster.

About this task

The ibm-cp4a-operator locates the Cloud Pak base images and has Ansible® roles to handle the reconciliation logic and declare a set of playbook tasks for each component. The roles declare all the variables and defaults for how the role is executed.

The operator deployment creates a container on your cluster for the operator. The following diagram shows how the operator watches for events, triggers an Ansible role when a custom resource changes, and then reconciles the resources for the deployed applications.

Depending on the type of operator, different logs are more useful. Use the following table to choose the Ansible or Go logs.

Table 1. Operator types
Capability	Type of operator	Operator name
CP4BA (multi-pattern)	Ansible	ibm-cp4a-operator
CP4BA FileNet Content Manager	Ansible	ibm-content-operator
Automation Foundation	Ansible	ibm-foundation-operator
CP4BA Workflow Process Server	Go	ibm-cp4a-wfps-operator
CP4BA Process Federation Server	Go	ibm-cp4a-pfs-operator

Getting the logs of the Go-based operators

To get the log for go-based operators, run the following command:

kubectl logs deployment/$operator_name -n $project_name > operator.log

Getting the logs of the Ansible-based operators

To get the log of the latest reconciliation for Ansible-based operators, run the following command:

# <Must set> Set your project name here 
export project_name=$your_project_name

# <Must set> Set target operator name here 
export operator_name=$operator_name

operator_pod_name=$(kubectl get pod|grep $operator_name | awk '{print $1}')
kubectl exec -i $operator_pod_name -n $project_name -- /bin/bash -c 'cat /tmp/ansible-operator/runner/icp4a.ibm.com/v1/*/*/*/artifacts/latest/stdout' > operator-ansible.log

Optional: Export the history of the Ansible logs.

Ansible operators keep a backup of the logs under /logs/$operator_pod_name/ansible-operator/runner/<group>/<version>/<kind>/<namespace>/<name>/artifacts. The log contains information on the first 10 reconciles, including the latest reconcile. The following commands copy the logs to a local directory. Select the operator name for which you want to export the log,

# <Must set> Set your project name here 
export project_name=$your_project_name

export deployment_name=$(kubectl get icp4acluster | awk '{print $1}' | grep -v "NAME")

# Below can export CP4BA Operator's Ansible log to /tmp/$operator_pod_name-log, do not need this when you install from Content Operator
export operator_pod_name=$(kubectl get pod|grep ibm-cp4a-operator | awk '{print $1}')
kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/icp4a.ibm.com/v1/ICP4ACluster/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-log

# Below can export Content Operator's Ansible log to /tmp/$operator_pod_name-log, only need this when Content pattern involved.
export operator_pod_name=$(kubectl get pod|grep ibm-content-operator | awk '{print $1}')
kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/icp4a.ibm.com/v1/Content/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-log

# Below can export Foundation Operator's Ansible log to /tmp/$operator_pod_name-log, do not need this when you install from CP4BA Operator
export operator_pod_name=$(kubectl get pod|grep icp4a-foundation-operator | awk '{print $1}')
kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/icp4a.ibm.com/v1/Foundation/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-log

Note: If you see "Cannot stat: No such file or directory" when you export the Ansible logs, it means that either no log that is generated from the current operator or the current operator is in its first reconcile.

Optional: Edit the verbosity of the Ansible logs.

If the operator log does not provide the level of detail that you need, you can gather more details by adding an annotation like the following example to your custom resource YAML:

metadata:
 ...
   annotations:
     ansible.sdk.operatorframework.io/verbosity: "3"
spec:

For the verbosity value, the normal rules for Ansible verbosity apply, where higher values mean more output. Acceptable values range from 0 (only the most severe messages are output) to 7 (all debugging messages are output). After you update the custom resource YAML, reapply the YAML for the changes to take effect.

Getting information about pending pods

If some pods are pending, choose one of the pods, and run the following command to get more information.

kubectl describe pod <podname>

Getting information about secrets

Kubernetes secrets are used extensively, so output about them might also be useful.

kubectl get secrets

Getting information about events

Kubernetes events are objects that provide more insight into what is happening inside a cluster, such as what decisions the scheduler makes or why some pods are evicted from a node. To get information about these events, run the following command.

kubectl get events > events.log

You can also add the verbose parameter to any kubectl command.

kubectl -v=9 get pods

Enabling Liberty tracing for Liberty-based CP4BA pods

For FNCM, BAN, and ADP pods use the following steps to enable a WebSphere Application Server (WAS) Liberty logging trace specification:

Create a custom_server.xml file with a custom Liberty trace specification. A Liberty trace specification can vary and depends on why you are enabling it. The specification might come from IBM support or Liberty support.
Copy the custom_server.xml file into the target pod under the /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides directory. This directory is mapped to a PVC where the configuration file can be persisted.
WAS Liberty immediately detects the server configuration file and creates a trace.log file in the default directory, or in a custom directory if you specified one in the custom_server.xml file.

Recreating the image pull secret

If your Docker registry secret expires, you can delete the secret and re-create it:

oc delete secret admin.registrykey -n <namespace>
oc create secret docker-registry admin.registrykey --docker-server=image-registry.openshift-image-registry.svc:5000 --docker-username=kubeadmin --docker-password=$(oc whoami -t)

Applying changes by restarting pods

In some cases, changes that you make in the custom resource YAML by using the operator or directly in the environment are not automatically propagated to all pods. For example, modifications to data source information or changes to Kubernetes secrets are not seen by running pods until the pods are restarted.

If changes applied by the operator or other modifications that are made in the environment do not provide the expected result, restart the pods by scaling the impacted deployments down to 0 then up to the number that you want to have Kubernetes (OpenShift) terminate the existing pods and create new ones.

Directory mount failure prevents pod readiness

If a pod stays in a CreateContainerError state, and the description of the problem includes similar text to the following message then remove the failing mounted path.

Warning  Failed  43m  kubelet  Error: container create failed: time="2021-03-03T07:26:47Z" level=warning msg="unable to terminate initProcess" error="exit status 1"
time="2021-03-03T07:26:47Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: rootfs_linux.go:60: mounting \"/var/lib/kubelet/pods/473b091d-acff-437b-b568-2383604dac01/volume-subpaths/config-volume/icp4adeploy-cmis-deploy/3\" to rootfs at **\"/var/lib/containers/storage/overlay/d011608f6df4bbfcc26c7d60568915caf7932124e61924b1a75802e6884ea060/merged/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml\" caused: not a directory"**

The problem occurs when a folder is generated instead of an XML file. A null folder is created to mount the file to the deployment and this raises the error.

You can remove a problematic folder from a deployment in two ways:

If you can access the persistent volume, go to the mounted path and delete it. You can get the path to the folder by running the following command.
```
oc describe pv $pv_name
```
If you cannot access the persistent volume, edit the deployment by removing the failed mount.
1. Edit the deployment by running the oc edit deployment <deployment_name> command. The following lines show an example mountPath:
```
- mountPath: /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml
          name: config-volume
          subPath: ibm_oidc_rp.xml
```
2. You can then access the pod when it is Running by using the oc exec -it command.
```
oc exec -it icp4adeploy-cmis-deploy-5cd4774f78-mg6pw bash
```
3. Delete the file with the rm command.
```
rm /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml
```

When the folder is removed, you can wait for the operator to reconcile the change or add the removed mount path back manually to fix it.

Cannot log in to the Zen console

After installation, you might not be able to log in to the Zen console by using the default cluster administrator admin user name. The cause of this problem is that the name admin also exists in the LDAP directory.

To resolve the login issue, use the following steps.

Change the name of the admin user in the platform-auth-idp-credentials secret.
Change the cluster-wide role biding oidc-admin-binding to the new admin user name.
Log in to the OpenShift console by using the new admin user name.
Add any new users that you need in the console.

Zen issues with NGINX configuration

If you see a "3.5.0.0 (xxxxxxxxxx) message instead of the user interface when you try to access a component like ACCE or Navigator, use the following workaround to restart the pods for both the IBM NGINX and the Zen Watcher:

Delete the IBM NGINX pod by running the following command. Replace $namespace with the name of your target project.
```
oc delete po -l component=ibm-nginx -n $namespace
```
The names of the deleted pods are returned:
```
pod "ibm-nginx-6d958c8cd6-dhllb" deleted
pod "ibm-nginx-6d958c8cd6-n9qqh" deleted
```

Delete the Zen Watcher pod to restart it:

oc delete po -l component=zen-watcher -n $namespace

The name of the deleted pod is returned:

pod "zen-watcher-6c89d9fc7c-qw7rm" deleted

Platform UI (Zen) becomes corrupted if the ZenService is deleted without uninstalling Cloud Pak for Business Automation

If Zen is corrupted, uninstall Cloud Pak for Business Automation, delete the Zen associated PVs, and reinstall. The following errors are symptoms of a corrupted Zen.

The roles and user role mappings are lost when the Zen PVs are removed.
You might see "<no data>" in some UIs if the translation data is missing due to the zen-translation jobs not running.

For more information about uninstalling Cloud Pak for Business Automation, see Uninstalling capabilities.

Issues trying to install after you uninstalled

If you see issues when you install a new instance on a cluster that you already used for a Cloud Pak deployment, check if the IBM Automation Foundation dependencies are properly deleted.

For more information, see Uninstallation does not remove all components.

Profile size does not scale down

When you decrease the pattern profile size after installation, from large to medium or from medium to small, IBM Automation foundation and IBM Cloud Pak foundational services do not scale down with the profile size change. This behavior is expected. For more information about profile sizes, see System requirements.

Operator pod in OOMKilled status

If you see the Cloud Pak for Business Automation operator pod with a status OOMKilled, it means that the resources that are allocated to the operator pod is not enough for the workload. You can modify the csv to give the operator more resources. The following example can be adjusted to get the operator pod up and running again.

oc patch csv ibm-cp4a-operator.v22.1.0 --type=json -p '[
{
"op":"replace",
"path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/cpu",
"value": "4"
},
{
"op":"replace",
"path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/memory",
"value": "8Gi"
},
{
"op":"replace",
"path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/requests/cpu",
"value": "1500m"
},
{
"op":"replace",
"path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/requests/memory",
"value": "1600Mi"
},
]'

Nginx deployment failed due to leftover Zen resources in cluster

If a Cloud Pak for Business Automation deployment shows a CrashLoopBackOff message for Nginx, then it indicates that a Nginx configuration is not cleaned up properly and the Nginx pods cannot be started (in CrashLoopBackOff).

oc get deploy | grep nginx
ibm-nginx                                         0/2     2            0           12h
ibm-nginx-tester                                  0/1     1            0           12h

oc get po | grep nginx
ibm-nginx-568667548b-6n4cw                                        0/1     CrashLoopBackOff   147        12h
ibm-nginx-568667548b-q9d8r                                        0/1     CrashLoopBackOff   147        12h
ibm-nginx-tester-684f8f9844-p6gp5                                 0/1     CrashLoopBackOff   147        12h
setup-nginx-job-nqzgd                                             0/1     Completed          0          12h

To work around the problem, you must make sure that all the CP4BA generated .conf files are deleted from the PV of Nginx.

To delete the configuration files, use the following steps:

Save the following template to a remove-zen-extension-pod.yaml file.

kind: Pod
apiVersion: v1
metadata:
  name: remove-zen-extension-pod
spec:
  containers:
    - name: remove-zen-extension-pod
      image: busybox
      securityContext:                                                                                                                                                                                                                    
        privileged: true                                                                                                                                                                                                                  
        runAsUser: 0   
      volumeMounts:
      - mountPath: "/data"
        name: my-volume
      command: [ "sleep", "1000000" ]
  volumes:
    - name: my-volume
      persistentVolumeClaim:
        claimName: user-home-pvc

Run the "oc apply -f remove-zen-extension-pod.yaml" to create the pod.
Make sure that the remove-zen-extension-pod is up and running, and then log in to the pod by running the "oc rsh remove-zen-extension-pod" command.
In the pod, delete all of the CP4BA generated .conf files under both "/data/_global_/upstream-conf.d/" and "/data/_global_/nginx-conf.d/".
```
METANAME="icp4adeploy" 
rm -rf /data/_global_/upstream-conf.d/${METANAME}*
rm -rf /data/_global_/nginx-conf.d/${METANAME}*
```
Where METANAME is the value of the metadata.name parameter in the custom resource of your CP4BA deployment. The default name is icp4adeploy.
Restart the Nginx pods that showed the CrashLoopBackOff error, and when the new Nginx pods are up and running delete the remove-zen-extension-pod by running the "oc delete pod remove-zen-extension-pod" command.

What to do next

The custom resource can be configured to enable and disable specific logging parameters, log levels, log formats, and where these logs are stored for the various capabilities. If you need more information about specific Cloud Pak capabilities, go to the relevant troubleshooting topics.