Troubleshooting

In most cases with an Operational Decision Manager instance on OpenShift, a problem is likely to be in your pods, replicas, or database.

About this task

On Kubernetes, you manage Operational Decision Manager services in the following elements:

  • A pod that serves as a wrapper around a single docker container.
  • The replicas that are used by Deployments as a mechanism to create, delete, and update pods. Ordinarily, you do not have to worry about managing the replicas that deployments create. Deployments own and manage their replicas. You can specify how many pods to run concurrently by setting .spec.replicaCount.
  • The database in which the data is persisted.

Procedure

  1. List all the container images that are running in your cluster by using the following command:
    oc get pods
  2. Check the current state and recent events of your pods to see whether they are all running.

    You can get more targeted information about a pod by running the oc get or oc describe commands:

    oc get pod POD_NAME --output=yaml
    oc describe pod POD_NAME
    1. If the pods are not created, check the status of the replica sets and inspect the events for one of them by running the following command:
      oc get replicaset
      oc describe replicaset REPLICASET_NAME
    2. If a pod is stuck in Pending, look at the output of the oc describe command.
      Find the messages from the scheduler about why it cannot schedule your pod. The most likely reason is that you do not have enough resources. The CPU or memory in your cluster might be full. In this case, you need to delete pods, adjust resource requests, or add new nodes to your cluster.
    3. If a pod is stuck in the Waiting state or Init:ImagePullBackOff, look at the output of the oc describe command.
      The most common cause of stuck Waiting pods is a failure to pull the image. If you installed from the command line, check that the name of the image is correct and that you pushed the image to the repository, and try to pull the image.
    4. If a pod is stuck Running and does not turn to Ready after a while, it might be that the health check takes longer than the readiness and liveness timeout values. When you edit an existing deployment, another pod is created with the new timeout values.
      Note: If a Decision Center pod is stuck in Running and does not change to Ready after a while, the database configuration might be broken and the pod cannot start. To resolve the problem, clean the database, for example, by running the command db2 drop db <ODMDB>, and increase the readinessProbe and livenessProbe values in the custom resource (CR) file.

      The readiness by default times out in approximately five minutes. If there is some cluster slowness, network or database latency, double the number in the CR:

      odm_configuration:
        readinessProbe:
           initialDelaySeconds: 5
           periodSeconds: 5
           failureThreshold: 90
           timeoutSeconds: 15
        livenessProbe:
           initialDelaySeconds: 300
           periodSeconds: 10
           failureThreshold: 90
           timeoutSeconds: 15
    5. If a pod is Evicted, it might have reached the ephemeral storage limit. Check the logs for the following error message:
      Pod ephemeral local storage usage exceeds the total limit of containers 500Mi

      To fix the issue, increase the resources.limits.ephemeral-storage value of the corresponding component. See IBM Operational Decision Manager parameters to find the default values.

    6. If a pod fails or is otherwise unhealthy, look at the logs of the current pod:
      oc logs POD_NAME 
      If your pod previously failed, add the previous argument to access these logs:
      oc logs --previous POD_NAME 
      Otherwise, you can run commands inside that pod with exec:
      oc exec POD_NAME -- CMD ARG1 ARG2 ... ARGN

      For example, to get a shell to the running pod:

      oc exec -ti POD_NAME -- /bin/bash 

      In your shell, list the root directory and use other commands to view the configuration:

      ls 
      cat /config/server.xml
      cat /config/datasource.xml
      cat /proc/mounts 
      cat /proc/1/maps
  3. With an operator, you can set the debug configuration parameter to add more tracing at any time.

    In the odm_configuration part of the custom resource YAML file, set the value of the debug parameter to true:

    odm_configuration: 
       debug: true

    To apply the modified CR file, run the following command:

    oc apply -f <modifiedCrWithDebugFlag>

    To view the trace in the operator pods, run the following command:

    oc logs <OperatorPodId> 

Results

If nothing looks wrong in your configuration and you continue to get no response when you try to access your service, see Debug Services.

What to do next

If the Operational Decision Manager instance is working correctly, but the application is not working as you expect, inspect the Operational Decision Manager logs. You might have to change the logging levels to get more details on the suspected problem.

Operational Decision Manager runs on the Liberty profile, which uses a unified logging component for handling messages. The logging component also provides First Failure Data Capture (FFDC) services, and unifies the messages that are written to System.out, System.err, and java.util.logging with other messages. The logging component is controlled through the server configuration.

You customize the logging properties by adding logging elements to a server configuration file, and then creating a Kubernetes configMap to apply to the configuration. The configuration of the log level uses the following format:

<component> = <level>

The <component> is the component for which to set a log level, and the <level> is one of the valid logger levels. For more information, see Configuring logging.