Preparing your cluster for fault tolerance

To ensure recovery from possible job failures, IBM® Business Automation Insights uses Apache Flink checkpoints and savepoints, and high availability mode. You can restart a job from a previous checkpoint or savepoint. For some possible failure cases, you can also take preventive steps.

High availability configuration

For high availability and fault tolerance to be effective, it is critical that you set the appropriate number of replicas for Apache Zookeeper, for embedded Elasticsearch nodes, and for the administrative service. Refer to Replica configuration for high availability and fault tolerance.

Failure of Apache Flink job manager and task managers

When they are deployed, the job manager and task managers are configured to restart their failed pod, and jobs are configured for checkpointing. The job manager and task managers restart automatically and processing resumes from where it left off.

Job failure

If a recoverable error occurs, such as temporary network outage that prevents connection to Kafka or Elasticsearch, Flink jobs automatically restart. When a job cannot restart, see Restarting from a checkpoint or savepoint.

Known issues

For learn how to resolve known issues, see Troubleshooting.

Restriction: The Developer Edition does not implement high availability nor fault tolerance.