In most cases with an Operational Decision Manager instance on OpenShift, a
problem is likely to be in your pods, replicas, or database.
About this task
On Kubernetes, you manage Operational Decision Manager services in the following
elements:
- A pod that serves as a wrapper around a single docker container.
- The replicas that are used by Deployments as a mechanism to create, delete, and update pods. Ordinarily,
you do not have to worry about managing the replicas that deployments create. Deployments own and
manage their replicas. You can specify how many pods to run concurrently by setting
.spec.replicaCount.
- The database in which the data is persisted.
Procedure
-
List all the container images that are running in your cluster by using the following
command:
-
Check the current state and recent events of your pods to see whether they are all
running.
You can get more targeted information about a pod by running the oc get or
oc describe commands:
oc get pod POD_NAME --output=yaml
oc describe pod POD_NAME
- If the pods are not created, check the status of the
replica sets and inspect the events for one of them by running the following command:
oc get replicaset
oc describe replicaset REPLICASET_NAME
-
If a pod is stuck in Pending, look at the output of the
oc
describe command.
Find the messages from the scheduler about why it cannot schedule your pod. The most likely
reason is that you do not have enough resources. The CPU or memory in your cluster might be full. In
this case, you need to delete pods, adjust resource requests, or add new nodes to your cluster.
-
If a pod is stuck in the Waiting state or
Init:ImagePullBackOff, look at the output of the
oc
describe command.
The most common cause of stuck Waiting pods is a failure to pull
the image. If you installed from the command line, check that the name of the image is correct and
that you pushed the image to the repository, and try to pull the image.
-
If a pod is stuck Running and does not turn to
Ready after a while, it might be that the health check takes longer
than the readiness and liveness timeout values. When you edit an existing deployment, another pod is
created with the new timeout values.
Note: If a
Decision Center pod is
stuck in
Running and does not change to
Ready after a while, the
database configuration might be broken and the pod cannot start. To resolve the problem, clean the
database, for example, by running the command
db2 drop db
<ODMDB>, and increase the
readinessProbe and
livenessProbe values in the custom resource (CR) file.
The readiness by
default times out in approximately five minutes. If there is some cluster slowness, network or
database latency, double the number in the CR:
odm_configuration:
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 90
timeoutSeconds: 15
livenessProbe:
initialDelaySeconds: 300
periodSeconds: 10
failureThreshold: 90
timeoutSeconds: 15
- If a pod is Evicted, it might have reached the ephemeral
storage limit. Check the logs for the following error message:
Pod ephemeral local storage usage exceeds the total limit of containers 500Mi
To fix the issue, increase the resources.limits.ephemeral-storage value of
the corresponding component. See IBM
Operational Decision Manager parameters to find the default values.
-
If a pod fails or is otherwise unhealthy, look at the logs of the current pod:
oc logs POD_NAME
If your pod previously failed, add
the
previous argument to access these logs:
oc logs --previous POD_NAME
Otherwise, you can run
commands inside that pod with
exec:
oc exec POD_NAME -- CMD ARG1 ARG2 ... ARGN
For example, to get a shell to the running pod:
oc exec -ti POD_NAME -- /bin/bash
In your shell, list the root directory and use other commands to view the configuration:
ls
cat /config/server.xml
cat /config/datasource.xml
cat /proc/mounts
cat /proc/1/maps
-
With an operator, you can set the debug configuration parameter to add
more tracing at any time.
In the odm_configuration part of the custom resource YAML file, set the
value of the debug parameter to true:
odm_configuration:
debug: true
To apply the modified CR file, run the following command:
oc apply -f <modifiedCrWithDebugFlag>
To view the trace in the operator pods, run the following command:
oc logs <OperatorPodId>
Results
If nothing looks wrong in your configuration and you continue to get no response when you try to
access your service, see Debug Services.
What to do next
If the Operational Decision Manager instance is
working correctly, but the application is not working as you expect, inspect the Operational Decision Manager logs. You might have to change
the logging levels to get more details on the suspected problem.
Operational Decision Manager runs on the Liberty
profile, which uses a unified logging component for handling messages. The logging component also
provides First Failure Data Capture (FFDC) services, and unifies the messages that are written to
System.out, System.err, and java.util.logging
with other messages. The logging component is controlled through the server configuration.
You customize the logging properties by adding logging elements to a server configuration file,
and then creating a Kubernetes configMap to apply to the configuration. The
configuration of the log level uses the following format:
<component> = <level>
The <component> is the component for which to set a log level, and the
<level> is one of the valid logger levels. For more information, see Configuring logging.