Troubleshooting after bai-start completes

While you are checking that your single node deployment is operational, you might need to take corrective actions if containers do not reach their expected state or on other deployment issues.

Invalid security certificates
Insufficient hardware resources
Flink jobs
Elasticsearch indexes
Event emitters and dashboards

Invalid security certificates

A certificate can be invalid because it is expired or because the hostname changed since the certificate was generated. To identify the issue, check the logs of the services by running the bin/bai-logs command. You can apply the following corrective actions.

Custom certificates: If you provided your own certificates, replace the expired or invalid certificates, and CA certificates that were used to sign them.
Default certificates: If the certificates were auto-generated, remove the <bai_sn_install_dir>/certs directory and restart by running the bin/bai-start --acceptLicense --init command. The --init option enables the certificates to be regenerated.

Insufficient hardware resources

Check the logs of the Docker services by running the bin/bai-logs command. You can apply the following corrective actions.

CPU and memory: If possible, increase the CPU and memory allocated to Docker, or use a computer that has more available resources.
Storage: If the STORAGE_ENABLED environment variable, which is false by default, is set to true, disabling it again spares one slot. This setting might help solve errors that are returned in the logs of the job manager or task manager containers.

Flink jobs

To identify troubles that affect the Flink jobs, you can inspect the logs of the job manager and task manager containers and look up the Flink web interface.

You can find more information in the following pages.

bai-logs: See Troubleshooting tools and Command line interface for IBM Business Automation Insights on a single node.
Hostname: See Changing the external hostname.

Flink is not able to allocate enough slots and the computer has fewer than 6 CPUs.

To determine the number of CPUs, run the command getconf _NPROCESSORS_ONLN
Run bai-stop.
Set the TASKMANAGER_LOGICAL_CPUS variable to 5 in the .env configuration file.
Run bai-start --acceptLicense.

Flink savepoints are corrupted.

Start from a previous savepoint. For more information, see the description of the --savepoint-paths option.

Flink fails with logs showing errors.

The log shows the following error message and displays the same error after you take the corrective action that is mentioned in the message.

The savepoint /mnt/pv/checkpoints/\<uuid\>/chk-\<n\> was already used. The Job won’t be run from there. If you really want to use this savepoint again, you should delete the /mnt/pv/recovery/dba/\<bai-processor\>/chk-\<n\> file.

This issue can be a side effect of Flink jobs that cannot start for lack of available slots, as described in Flink is not able to allocate enough slots and the computer has fewer than 6 CPUs.. After this issue is corrected, the easiest way out is to remove all corrupted information.

rm -rf <bai_sn_install_dir>/volumes/flink/checkpoints
rm -rf <bai_sn_install_dir>/volumes/flink/savepoints
rm <bai_sn_install_dir>/processors/running_processors_info
rm <bai_sn_install_dir>/processors/savepoint_paths

Important:

Do not run these commands for a system in production or when data loss is critical. In such a case, find a valid savepoint or checkpoint and use it for each Flink processor.
The correct behavior of the Flink containers depends on the Kafka, ZooKeeper, and Elasticsearch containers, which must be up and healthy. Therefore, if applicable, troubleshoot these containers before the Flink containers.

Elasticsearch indexes

Defining a high number of fields in an Elasticsearch index might lead to a so-called mappings explosion which might cause out-of-memory errors and difficult situations to recover from. The maximum number of fields in Elasticsearch indexes created by IBM Business Automation Insights is set to 1000. Field and object mappings, and field aliases, count towards this limit. Ensure that the various documents that are stored in Elasticsearch indexes do not lead to reaching this limit.

To diagnose this issue, inspect the logs of the Elasticsearch service. To prevent nodes from running out of disk space, Elasticsearch indexes become read-only if the available disk space falls under the flood stage watermark. If the Elasticsearch container logs such a situation, increase the available disk space.

For more information about container logs, see Troubleshooting tools.

Event emitters and dashboards

See Troubleshooting IBM Business Automation Insights on Kubernetes.