Troubleshooting
If your "demo" deployment is not working as you expect, check out the listed issues and try the mitigation or workarounds.
The troubleshooting information is divided into the following sections:
- Directory mount failure prevents pod readiness
- Cluster admin setup script issues
- Db2® issues
- Generated routes do not work
- Case init job failure
Directory mount failure prevents pod readiness
If a pod stays in a CreateContainerError state, and the description of the problem includes similar text to the following message then remove the problematic mounted path.
Warning Failed 43m kubelet Error: container create failed: time="2021-03-03T07:26:47Z" level=warning msg="unable to terminate initProcess" error="exit status 1"
time="2021-03-03T07:26:47Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: rootfs_linux.go:60: mounting \"/var/lib/kubelet/pods/473b091d-acff-437b-b568-2383604dac01/volume-subpaths/config-volume/icp4adeploy-cmis-deploy/3\" to rootfs at **\"/var/lib/containers/storage/overlay/d011608f6df4bbfcc26c7d60568915caf7932124e61924b1a75802e6884ea060/merged/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml\" caused: not a directory"**
The problem occurs when a folder is generated instead of an XML file. A null folder is created to mount the file to the deployment and this raises the error.
You can remove a problematic folder from a deployment in two ways:
- If you can access the persistent volume, go to the mounted path and delete it. You can get the
path to the folder by running the following command.
oc describe pv $pv_name
- If you cannot access the persistent volume, edit the deployment by removing the failed mount.
- Edit the deployment by running the
oc edit deployment <deployment_name>
command. The following lines show an examplemountPath
:- mountPath: /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml name: config-volume subPath: ibm_oidc_rp.xml
- You can then access the pod when it is Running by using the
oc exec -it
command.oc exec -it icp4adeploy-cmis-deploy-5cd4774f78-mg6pw bash
- Delete the file with the
rm
command.rm /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml
- Edit the deployment by running the
When the folder is removed, you can wait for the operator to reconcile the change or add the removed mount path back manually to fix it.
Cluster admin setup script issues
During the execution of the cp4a-clusteradmin-setup.sh script, the CRD fails to deploy. If the following message is seen in the output, the user ('XYZ' in the example) does not have cluster-admin permission:
Start to create CRD, service account and role ... Error from server (Forbidden): error when retrieving current configuration of: "/root/git/cert-kubernetes/descriptors/ibm_cp4a_crd.yaml": customresourcedefinitions.apiextensions.k8s.io "icp4aclusters.icp4a.ibm.com" is forbidden: User "XYZ" cannot get customresourcedefinitions.apiextensions.k8s.io at the cluster scope: no RBAC policy matched
- Log out of the current session (non-admin).
- Log in to OCP with the OCP cluster admin user. Using the OpenShift
CLI:
oc login -u dbaadmin
Where
dbaadmin
is the cluster admin user.
Db2 issues
For 21.0.1 Db2 is installed as part of the prerequisites of the patterns. The following issues can be resolved by matching the source of the problem with the proposed solution to make Db2 operational again.
- Reconciler error because Db2 cannot create schema
-
If you try to install a second deployment on the same cluster, you might encounter an installation issue. You might see in the operator log a reconciliation error if the deployment is trying to use a worker node where Db2 is already running. To resolve the issue, delete the first deployment.
- Intermittent issue where Db2 process is not listening on port 50000
-
If the message "
not listening on port 50000
" is found in the logs:- Get the current running Db2 pod. Using the OpenShift CLI:
oc get pod
- Go to the pod. Using the OpenShift CLI:
oc exec -it <db2 pod> bash
- Switch to the db2inst1 user:
su - db2inst1
- Reapply the configuration:
db2 update dbm cfg using SVCENAME 50000
- Restart Db2:
db2stop db2start
- Get the current running Db2 pod. Using the OpenShift CLI:
- Db2 pod failed to start where db2u-release-db2u-0 pod shows 0/1 Ready
- This issue has the following symptoms in the Db2 pods:
[5357278.440940] db2u_root_entrypoint.sh[20]: + sudo /opt/ibm/db2/V11.5.0.0/adm/db2licm -a /db2u/license/db2u-lic [5357278.531782] db2u_root_entrypoint.sh[20]: LIC1416N The license could not be added automatically. Return code: "-100". [5357278.535893] db2u_root_entrypoint.sh[20]: + [[ 156 -ne 0 ]] [5357278.536085] db2u_root_entrypoint.sh[20]: + echo '(*) Unable to apply db2 license.' [5357278.536177] db2u_root_entrypoint.sh[20]: (*) Unable to apply db2 license.
To mitigate the issue, you have a number of options:
- Option 1: Kill Db2
- Option 2: Clean up Db2 and redeploy
- Option 3: Delete the project
- Option 4: Reboot the cluster
Option 1: Kill Db2
- Run the following command to get the worker node that db2u is running on. Using the OpenShift
CLI:
oc get nodes -o wide
- Run an ssh command as root on the worker node that hosts Db2u:
ssh root@<worker node>
- Run the following command to kill the orphaned db2u semaphores:
ipcrm -S 0x61a8
- Clean up the affected project/namespace:
The following OCP CLI command gets the custom resource name:
oc get icp4acluster
Delete the custom resource:
oc delete icpa4acluster $name
Where $name is the result from the previous command.
Delete the operator deployment.
oc delete <operator-deployment-name>
- Run the deployment script to start again.
Option 2: Clean Db2 and redeploy
- Get the custom resource name for icp4acluster. Using the OpenShift
CLI:
oc get icp4acluster
- Delete the CR. Using the OpenShift CLI:
oroc delete icp4acluster $name
Theoc delete -f $cr.yaml
$cr.yaml
is generated in the ./tmp directory, so you also need to delete the operator deployment by running the following OCP CLI command:oc delete <operator-deployment-name>
- Make sure that nothing is leftover by running the following OCP CLI commands:
oc get sts oc get jobs oc get deployment oc get pvc | grep db2
- Run the deployment script to start again.
Option 3: Delete the project/namespace
If options 1 or 2 do not work, delete the project and redeployment by running the following OCP CLI command:
oc delete project $project_name
Option 4: Restart the entire cluster
- If none of the other options work, get the names of the nodes and restart them. Using the
OpenShift CLI:
oc get no --no-headers | awk '{print $1}'
- Restart all of the nodes listed (restart the worker nodes first, then the infrastructure node, and then the master node).
- db2-release-db2u-restore-morph-job-xxxxx shows "Running", but fails to be "Completed"
- Run the following OCP CLI command to check and confirm this
issue:
oc get pod
The command outputs a table that shows the STATUS and READY columns:
NAME READY STATUS db2-release-db2u-restore-morph-job-xxxxx 1/1 Running
If the STATUS does not change to
Completed
after a few minutes.- Delete the Db2 pod by running the
oc delete
command:oc delete pod db2-release-db2u-restore-morph-job-xxxxx
- Confirm that the Db2 job is terminated and a new pod is up and running:
When the job readsoc get pod -w
Completed
, the pattern can continue to deploy.
- Delete the Db2 pod by running the
- db2-release-db2u-restore-morph-job-xxxxx failed on bare metal nodes
- If your deployment uses bare metal nodes on your ROKS cluster, then you need to make the
following updates to workaround the failing
db2-release-0
pod on a bare metal node.- Get the node information on the cluster by running the following
commands:
oc get nodes oc get nodes --show-labels
It is important to identify which nodes are bare metal.
- To make sure that the bare metal nodes cannot be scheduled run the following
command:
oc adm cordon <node-name>
- Delete the db2morph
job:
oc delete job db2-release-db2u-restore-morph-job
- Delete the Db2 release pod:
oc delete pod db2u-release-db2u-0
- Make sure that the new Db2 release pod moved to a non-bare metal
node:
oc get pods -o wide | grep db2u-release-db2u-0
- Delete the operator pod to recreate the morph
job:
oc get pods | grep ibm-cp4a-operator oc delete pod <operator-pod-name>
After the morph job is created and the operator starts deploying the RR/UMS pods, you can make your bare metal nodes schedule-able again.
- To make a bare metal node schedule-able, run the following
command:
oc adm uncordon <node-name>
- Get the node information on the cluster by running the following
commands:
- db2-release-db2u pods cannot be accessed after deployment
- Open the operator log to view the deployment progress.
oc logs <operator pod name> -c operator -n <project-name>
Search for the string "db2u-release-db2u-statefulset pod is ready" in the log. The log might show the status of the db2u-release-db2u-statefulset pod asRETRYING
.TASK [prerequisites : check if db2u-release-db2u-statefulset pod is ready] *****^M ^[[1;30mtask path: /opt/ansible/roles/prerequisites/tasks/db2/db2-deploy.yml:141^[[0m^M ^[[1;30m^[[ Monday 04 May 2020 23:29:45 +0000 (0:00:00.095) 0:01:09.060 ************ ^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (35 retries left).^[[0m^M ^[[1;30m^[[ ^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (34 retries left).^[[0m^M ^[[1;30m^[[ ^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (33 retries left).^[[0m^M ^[[1;30m^[[ ^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (32 retries left).^[[0m^M ^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (3 retries left).^[[0m^M ^[[1;30m^[[ ^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (2 retries left).^[[0m^M ^[[1;30m^[[ ^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (1 retries left).^[[0m^M ^[[1;30m^[[
If you do see the
RETRYING
message, the shell script that runs inside the db2u pod is timing out. If the pods are not in aready
state after 20 to 25 minutes, delete Db2 and redeploy. - Database issues after a cluster reboot
- A cluster reboot can cause permissions issues with Db2. Check
~/sqllib/security/db2chpw
and~/sqllib/security/db2ckpw
in the Db2 pod for-r-s--x--x
permission. If these are not set properly, use the following instructions to fix them: https://www.ibm.com/support/pages/database-connection-fails-authentication-error-sql1639n.Use the following commands:oc exec -it db2u-release-db2u-0 bash sudo wvcli system disable -m "Disable HA before Db2 maintenance" su db2inst1 db2stop exit cd /mnt/blumeta0/home/db2inst1/sqllib/security chmod 4411 db2chpw db2ckpw cd /opt/ibm/db2/V11.5.0.0/instance/ ./db2iupdt db2inst1 sudo wvcli system enable -m "Enable HA after Db2 maintenance"
Afterward, delete the Content Platform Engine pod so that the cluster can recreate the pod.
- Project database limit for the Document Processing pattern
- The evaluation deployment for Document Processing includes one project database. This configuration supports the creation of only one Document Processing project.
Generated routes do not work
In some environments, route URLs contain the string apps.
. However, the
cp4a-clusteradmin-setup.sh script returns the hostname of the infrastructure
node without this string. If you entered the hostname in the
cp4a-post-deployment.sh script in an environment that uses
apps.
, the routes do not work.
Workaround: When you run the cp4a-deployment.sh script, add
apps.
to the infrastructure hostname.
For example, if the cp4a-clusteradmin-setup.sh script outputs the
infrastructure hostname as ocp-master.tec.uk.ibm.com
, enter
ocp-master.apps.tec.uk.ibm.com
when you run the
cp4a-post-deployment.sh script.
oc get route
--all-namespaces
, and extract the common pattern URL for the routes.Case init job failure
- If the case init job restarts several times but fails, do the following steps.
- Check the case init job pod logs by running a command similar to the following command:
If the result has the following error, the case init job is running into a Content Platform Engine timeout.oc logs --previous case init job pod
CPE_URL=http://bawps-cpe-svc:9080/wsi/FNCEWS40MTOM Certificate was added to keystore log4j:WARN No appenders could be found for logger (filenet_error.api.com.filenet.apiimpl.util.ConfigValueLookup). log4j:WARN Please initialize the log4j system properly. CPE URI :http://bawps-cpe-svc:9080/wsi/FNCEWS40MTOM [Perf Log] No interval found. Auditor disabled. P8DOMAIN starting setup DOS and TOS executing setupTOS java.lang.RuntimeException: The case management add-ons cannot be installed in Content Engine. The installation of the AddOn 20.0.0.1 Case Management Target Object Store Extensions into the object store TARGET failed. The installation report follows: <ImportErrors><ClassDefinitions><ReplicableClassDefinition><Id>6d18ffeb-7be8-41ac-9322-38a72743a10d</Id><Name>Health Condition</Name><ExceptionMessage>The database access failed with the following error: ErrorCode 0, Message 'addSync: caught Exception' ObjectStore: "TARGET", SQL: "SELECT security_id FROM OS2USER.TableDefinition WHERE (object_id = ?)"</ExceptionMessage><ExceptionCode>DB_ERROR</ExceptionCode><HRESULT>0x800710d9</HRESULT></ReplicableClassDefinition></ClassDefinitions></ImportErrors> The case management add-ons cannot be installed in Content Engine. The installation of the AddOn 20.0.0.1 Case Management Target Object Store Extensions into the object store TARGET failed. The installation report follows: <ImportErrors><ClassDefinitions><ReplicableClassDefinition><Id>6d18ffeb-7be8-41ac-9322-38a72743a10d</Id><Name>Health Condition</Name><ExceptionMessage>The database access failed with the following error: ErrorCode 0, Message 'addSync: caught Exception' ObjectStore: "TARGET", SQL: "SELECT security_id FROM OS2USER.TableDefinition WHERE (object_id = ?)"</ExceptionMessage><ExceptionCode>DB_ERROR</ExceptionCode><HRESULT>0x800710d9</HRESULT></ReplicableClassDefinition></ClassDefinitions></ImportErrors> java.lang.RuntimeException: The case management add-ons cannot be installed in Content Engine. The installation of the AddOn 20.0.0.1 Case Management Target Object Store Extensions into the object store TARGET failed. The installation report follows: <ImportErrors><ClassDefinitions><ReplicableClassDefinition><Id>6d18ffeb-7be8-41ac-9322-38a72743a10d</Id><Name>Health Condition</Name><ExceptionMessage>The database access failed with the following error: ErrorCode 0, Message 'addSync: caught Exception' ObjectStore: "TARGET", SQL: "SELECT security_id FROM OS2USER.TableDefinition WHERE (object_id = ?)"</ExceptionMessage><ExceptionCode>DB_ERROR</ExceptionCode><HRESULT>0x800710d9</HRESULT></ReplicableClassDefinition></ClassDefinitions></ImportErrors> at com.ibm.casemgmt.config.ContentEngineHelper.setUpCMTOS(ContentEngineHelper.java:1833) at com.ibm.ecm.icm.config.init.repository.ConfigureObjectStore.setupTOS(ConfigureObjectStore.java:99) at com.ibm.ecm.icm.config.init.test.ConfigureContentEngine.installAddons(ConfigureContentEngine.java:48) at com.ibm.ecm.icm.config.init.test.InitCaseManager.main(InitCaseManager.java:19)
- Add a Liberty configuration file to overwrite the timeout, with the following content:
For more information, see Tuning IBM WebSphere® Liberty for FileNet® Content Manager components.<server> <transaction clientInactivityTimeout="1800s" propogatedOrBMTTranLifetimeTimeout="1800s" totalTranLifetimeTimeout="1800s"/> </server>
- If the case init job stops generating new pods, delete the case init job and let the operator re-create it.
- Check the case init job pod logs by running a command similar to the following command: